I have 584 .txt files that I would like to merge into one 584 x 4 tibble.
Important Background Info:
The files can be divided into three categories according to the labels embedded in the file names. Thus:
A_1_COD.txt, A_23_COD.txt, A_235_COD,..., A_457_COD -> Belong in Category A;
B_3_COD.txt, B_19_COD.txt, B_189_COD,..., B_355_COD -> Belong in Category B;
C_5_COD.txt, C_11_COD.txt, C_196_COD,..., C_513_COD -> Belong in Category C;
The file names shown in this section have been modified for ease of comprehension. Examples of the real file names are: ENTITY_117_MOR.txt; INCREMENTAL_208_MOR.txt; MODERATE_173_MOR.txt. The real categories are:ENTITY, INCREMENTAL, & MODERATE.
What the resulting tibble structure should be like:
A tibble: 584 x 4
row | filename <?> |
category <fct> |
text <chr> |
---|---|---|---|
1 | A_1_COD | A | "Lorem ipsu- |
2 | B_2_COD | B | "Lorem ipsu- |
3 | C_3_COD | C | "Lorem ipsu- |
. | . | . | . |
. | . | . | . |
. | . | . | . |
584 | A_584_COD | A | "Lorem ipsu- |
What I have managed to do so far: Thanks to @awaji98, I managed to get three of the four columns I intend to have by using the following code:
library(tidyverse)
library(readtext)
folder <- "path_to_folder_of_texts"
dat <-
folder %>%
# get full path names for each text
dir(pattern = "*.txt",
full.names = T) %>%
# map readtext function to each path name into a dataframe
map_df(., readtext) %>%
# add and change columns as desired
mutate(filename= str_remove(doc_id, ".txt$"),
category = as.factor(str_extract(doc_id, "^."))) %>%
select(filename,category,text) %>%
rowid_to_column(var = "row")
# if you prefer a tibble output
dat %>% tibble()
The result can be seen in the image below:
The picture shows the resulting table with all the data except for category
Remaining problem to be solved: I need to get R to extract the categories embedded in the file names (i.e., ENTITY, INCREMENTAL, MODERATE) to fill the category column with the respective values.
@awaji98 suggested two possible paths. Here's the first one:
> dat <- folder %>%
+ # get full path names for each text
+ dir(pattern = "*.txt",
+ full.names = T) %>%
+ # map readtext function to each path name into a dataframe
+ map_df(., readtext) %>%
+ # add and change columns as desired
+ mutate(filename= str_remove(doc_id, ".txt$")) %>%
+ tidyr::extract(filename, into = "category", regex = "^([A-Z]+)_", remove = FALSE) %>%
+ mutate(category = factor(category)) %>%
+ select(filename,category,text) %>%
+ rowid_to_column(var = "row") %>%
+ tibble()
, which resulted in a column filled with red "NAs."
The second one,
> dat <- ## Use tidy::extract to create two new columns from doc_id
+ folder %>%
+ # get full path names for each text
+ dir(pattern = "*.txt",
+ full.names = T) %>%
+ # map readtext function to each path name into a dataframe
+ map_df(., readtext) %>%
+ # add and change columns as desired
+ mutate(filename= str_remove(doc_id, ".txt$")) %>%
+ tidyr::extract(doc_id, into = c("category","filename"), regex = "^([A-Z]+)_(.*).txt$") %>%
+ mutate(category = factor(category)) %>%
+ select(filename,category,text) %>%
+ rowid_to_column(var = "row") %>%
+ tibble()
as shown in the photo below, produced two columns filled with red "NAs."
image shows tibble with two columns containing red "NAs," which was not the expected output.
Final Solution
@awaji98 realized that the problem was with the regex. As it turned out, the file names had a trailing whitespace. The solution was to add a space to the front of each regex in the answer. Thus, the code that delivered the expected result was:
library(tidyverse)
library(readtext)
folder <- "path_to_folder_of_texts"
dat <-folder %>%
# get full path names for each text
dir(pattern = "*.txt",
full.names = T) %>%
# map readtext function to each path name into a dataframe
map_df(., readtext) %>%
# add and change columns as desired
mutate(filename= str_remove(doc_id, ".txt$")) %>%
extract(filename, into = "category", regex = "^ ([A-Z]+)_", remove = FALSE) %>%
mutate(category = factor(category)) %>%
select(filename,category,text) %>%
rowid_to_column(var = "row") %>%
tibble()
The final result is shown in the following photo: