1
votes

I have 584 .txt files that I would like to merge into one 584 x 4 tibble.

Important Background Info:

The files can be divided into three categories according to the labels embedded in the file names. Thus:

A_1_COD.txt, A_23_COD.txt, A_235_COD,..., A_457_COD -> Belong in Category A;

B_3_COD.txt, B_19_COD.txt, B_189_COD,..., B_355_COD -> Belong in Category B;

C_5_COD.txt, C_11_COD.txt, C_196_COD,..., C_513_COD -> Belong in Category C;

The file names shown in this section have been modified for ease of comprehension. Examples of the real file names are: ENTITY_117_MOR.txt; INCREMENTAL_208_MOR.txt; MODERATE_173_MOR.txt. The real categories are:ENTITY, INCREMENTAL, & MODERATE.

What the resulting tibble structure should be like:

A tibble: 584 x 4

row filename
<?>
category
<fct>
text
<chr>
1 A_1_COD A "Lorem ipsu-
2 B_2_COD B "Lorem ipsu-
3 C_3_COD C "Lorem ipsu-
. . . .
. . . .
. . . .
584 A_584_COD A "Lorem ipsu-

What I have managed to do so far: Thanks to @awaji98, I managed to get three of the four columns I intend to have by using the following code:

library(tidyverse)
library(readtext)

folder <- "path_to_folder_of_texts"

  dat <- 
  folder %>% 
 # get full path names for each text
  dir(pattern = "*.txt", 
      full.names = T) %>% 
 # map readtext function to each path name into a dataframe
  map_df(., readtext) %>% 
 # add and change columns as desired
  mutate(filename= str_remove(doc_id, ".txt$"),
         category = as.factor(str_extract(doc_id, "^."))) %>% 
  select(filename,category,text) %>% 
  rowid_to_column(var = "row") 

# if you prefer a tibble output
dat %>% tibble()

The result can be seen in the image below:

The picture shows the resulting table with all the data except for category

Remaining problem to be solved: I need to get R to extract the categories embedded in the file names (i.e., ENTITY, INCREMENTAL, MODERATE) to fill the category column with the respective values.

@awaji98 suggested two possible paths. Here's the first one:

> dat <- folder %>% 
+     # get full path names for each text
+     dir(pattern = "*.txt", 
+         full.names = T) %>% 
+     # map readtext function to each path name into a dataframe
+     map_df(., readtext) %>% 
+     # add and change columns as desired
+     mutate(filename= str_remove(doc_id, ".txt$")) %>% 
+     tidyr::extract(filename, into = "category", regex = "^([A-Z]+)_", remove = FALSE) %>% 
+     mutate(category = factor(category)) %>% 
+     select(filename,category,text) %>% 
+     rowid_to_column(var = "row") %>% 
+     tibble()

, which resulted in a column filled with red "NAs."

The second one,

> dat <- ## Use tidy::extract to create two new columns from doc_id
+     folder %>% 
+     # get full path names for each text
+     dir(pattern = "*.txt", 
+         full.names = T) %>% 
+     # map readtext function to each path name into a dataframe
+     map_df(., readtext) %>% 
+     # add and change columns as desired
+     mutate(filename= str_remove(doc_id, ".txt$")) %>% 
+     tidyr::extract(doc_id, into = c("category","filename"), regex = "^([A-Z]+)_(.*).txt$") %>% 
+     mutate(category = factor(category)) %>% 
+     select(filename,category,text) %>% 
+     rowid_to_column(var = "row") %>% 
+     tibble()

as shown in the photo below, produced two columns filled with red "NAs."
image shows tibble with two columns containing red "NAs," which was not the expected output.

Final Solution

@awaji98 realized that the problem was with the regex. As it turned out, the file names had a trailing whitespace. The solution was to add a space to the front of each regex in the answer. Thus, the code that delivered the expected result was:

library(tidyverse)
library(readtext)

folder <- "path_to_folder_of_texts"  
  
dat <-folder %>% 
  # get full path names for each text
  dir(pattern = "*.txt", 
      full.names = T) %>% 
  # map readtext function to each path name into a dataframe
  map_df(., readtext) %>% 
  # add and change columns as desired
  mutate(filename= str_remove(doc_id, ".txt$")) %>% 
  extract(filename, into = "category", regex = "^ ([A-Z]+)_", remove = FALSE) %>% 
 mutate(category = factor(category)) %>% 
  select(filename,category,text) %>% 
  rowid_to_column(var = "row") %>% 
  tibble()

The final result is shown in the following photo:

This picture shows the successful final result Kind regards,
Á_C

1

1 Answers

0
votes

You can use a combination of some common tidyverse functions and the useful readtext() from the package with the same name:

library(tidyverse)
library(readtext)

folder <- "path_to_folder_of_texts"

  dat <- 
  folder %>% 
 # get full path names for each text
  dir(pattern = "*.txt", 
      full.names = T) %>% 
 # map readtext function to each path name into a dataframe
  map_df(., readtext) %>% 
 # add and change columns as desired
  mutate(filename= str_remove(doc_id, ".txt$"),
         category = as.factor(str_extract(doc_id, "^."))) %>% 
  select(filename,category,text) %>% 
  rowid_to_column(var = "row") 

# if you prefer a tibble output
dat %>% tibble()

UPDATED:

Perhaps one of the following will get what you need. The first example keeps the filename column with the category at the front of each value:

folder %>% 
  # get full path names for each text
  dir(pattern = "*.txt", 
      full.names = T) %>% 
 # map readtext function to each path name into a dataframe
  map_df(., readtext) %>% 
# add and change columns as desired
  mutate(filename= str_remove(doc_id, ".txt$")) %>% 
  extract(filename, into = "category", regex = "^ ([A-Z]+)_", remove = FALSE) %>% 
 mutate(category = factor(category)) %>% 
  select(filename,category,text) %>% 
  rowid_to_column(var = "row") %>% 
  tibble()
  
  

The second one uses tidyr::extract to create two columns from the doc_id, so filename drops the category part:

  ## Use tidy::extract to create two new columns from doc_id
  folder %>% 
    # get full path names for each text
    dir(pattern = "*.txt", 
        full.names = T) %>% 
    # map readtext function to each path name into a dataframe
    map_df(., readtext) %>% 
    # add and change columns as desired
    mutate(filename= str_remove(doc_id, ".txt$")) %>% 
    extract(doc_id, into = c("category","filename"), regex = "^ ([A-Z]+)_(.*).txt$") %>% 
    mutate(category = factor(category)) %>% 
    select(filename,category,text) %>% 
    rowid_to_column(var = "row") %>% 
    tibble()