How to assign vector of strings as variable names, in for loop, in data.table, in dplyr

Question

I have been searching for this a long time and never found suitable answer here. I have a variable "extern" which is a vector containing names of my data frames:

extern <- list.files(pattern = "*.csv")
extern <– c("altitude.csv", "area.csv", "farmland.csv", "GDPpercapita.csv", "population.csv")

I can load them into R like this and the characters are converted to variables. It works for me assuming my wd is the folder same as data.

for (i in 1:length(extern)) assign(extern[i], fread(extern[i], sep = ","))

But this is only case that works for me. Now I want to iterate through loaded datasets and replace all commas nad % signs by nothing "". If I put a single dataset (instead of extern[i]) there, the code works but when put to for cycle it throws an error.

for (i in 1:length(extern)) {
  mutate_all(extern[i], function(x) {
    str_replace_all(string = x, pattern = ",", replacement = "")
    str_replace_all(x, pattern = "%", replacement = "")})
}

Error in UseMethod("tbl_vars") : no applicable method for 'tbl_vars' applied to an object of class "character"

So I tried similar think with assign() as before without any success, than somebody used eval(parse()) combo and many different thinks.

for (i in 1:length(extern)) {
  assign(extern[i], mutate_all(extern[i], function(x) {
    str_replace_all(string = x, pattern = ",", replacement = "")
    str_replace_all(x, pattern = "%", replacement = "")}))

This is quite general problem, I encountered it many times and nothing worked so every time I had to go around. But now I have so many files that it is not possible to do it manually. Same problem is for example in data.table - I have stored column names in a vector and I wanted to do whatever. If you write there the name of the vector the data.table takes it as a name of single column and do not evaluate it. Also it is usually needed to cbind() or rbind() these files like this:

data <- rbindlist(l = list("altitude.csv", "area.csv", "farmland.csv", "GDPpercapita.csv", "population.csv"), idcol = "id")

and id like to have it easier way something like below - extern is a vector full of names (character strings)

rbindlist(l = extern, idcol = "id")

in this case it is possible to store names in the list because function itself accepts list of names, but majority of other functions including mutate_all() I used as an example, accepts variable not a character string as a name.

I hope somebody has same issue or some other way how to do it, because I really don't know and I spend searching like half year.

You can use get() on a character string to get the object it refers to--not sure how well that works with dplyr's non-standard evaluation. The more common way to do this is to make a list of data frames and iterate over the list. — Gregor Thomas

Uwe Uwe · Accepted Answer · 2021-03-06T18:22:21

This is a common situation which can be handled with ease by using a list.

This is what I would do if the data files are different in structure, i.e., columns differ in names, data types, or order:

library(data.table)
file_names <- list.files(pattern = "*.csv")
list_of_df <- lapply(file_names, fread)
list_of_df <- setNames(list_of_df, file_names)
list_of_df

$area.csv
   id         name
1:  1  normal name
2:  2   with,comma
3:  3 with%percent

$farmland.csv
   id         name
1:  1  normal name
2:  2   with,comma
3:  3 with%percent

$GDPpercapita.csv
   id         name
1:  1  normal name
2:  2   with,comma
3:  3 with%percent

Note that I have made up three sample files for demonstration. See Data section for details.

The elements of the resulting list object list_of_df are named like the files the data were loaded from.

Now, we can operate on the elements of the list using lapply() or a for loop, e.g.,

lapply(
  list_of_df, 
  function(df) df[, lapply(.SD, function(col) if (is.character(col)) stringr::str_remove_all(col, "[,%]") else col)]
  )

$area.csv
   id        name
1:  1 normal name
2:  2   withcomma
3:  3 withpercent

$farmland.csv
   id        name
1:  1 normal name
2:  2   withcomma
3:  3 withpercent

$GDPpercapita.csv
   id        name
1:  1 normal name
2:  2   withcomma
3:  3 withpercent

Note that the code to remove , and % has been simplified.

lapply() has the advantage over a for loop that is returns a list again which is convenient for subsequent processing steps.

As a side note: there is a speciality with data.table as it is able to update by reference, i.e., without copying the data.table. So, we can update list_of_df in place which might be a benefit in terms of speed and memory consumption for large datasets:

address(list_of_df) # just for demonstration
for (df in list_of_df) {
  cols <- which(sapply(df, is.character))
  df[, (cols) := lapply(.SD, stringr::str_remove_all, "[,%]"), .SDcols = cols]
}
address(list_of_df)

The calls to address(list_of_df) before and after the for loop have been added to demonstrate that list_of_df still occupies the same storage location but has been changed in place.

list_of_df

$area.csv
   id        name
1:  1 normal name
2:  2   withcomma
3:  3 withpercent

$farmland.csv
   id        name
1:  1 normal name
2:  2   withcomma
3:  3 withpercent

$GDPpercapita.csv
   id        name
1:  1 normal name
2:  2   withcomma
3:  3 withpercent

In case the datasets read from file have a similar structure, i.e. same name, order and data type of columns, we can combine the single pieces into one large dataset using rbindlist()

My preferred workflow for this use case is along

library(data.table)
library(magrittr)
file_names <- list.files(pattern = "*.csv")
big_df <- lapply(file_names, fread) %>% 
  set_names(file_names) %>% 
  rbindlist(idcol = "file_name")
big_df

          file_name id         name
1:         area.csv  1  normal name
2:         area.csv  2   with,comma
3:         area.csv  3 with%percent
4:     farmland.csv  1  normal name
5:     farmland.csv  2   with,comma
6:     farmland.csv  3 with%percent
7: GDPpercapita.csv  1  normal name
8: GDPpercapita.csv  2   with,comma
9: GDPpercapita.csv  3 with%percent

Note that rbindlist() has created an id column from the names of the list elements. This allows for distinguishing the origin of each row.

Working with one uniform data structure simplifies subsequent processing

cols <- which(sapply(big_df, is.character))
big_df[, (cols) := lapply(.SD, stringr::str_remove_all, "[,%]"), .SDcols = cols]
big_df

          file_name id        name
1:         area.csv  1 normal name
2:         area.csv  2   withcomma
3:         area.csv  3 withpercent
4:     farmland.csv  1 normal name
5:     farmland.csv  2   withcomma
6:     farmland.csv  3 withpercent
7: GDPpercapita.csv  1 normal name
8: GDPpercapita.csv  2   withcomma
9: GDPpercapita.csv  3 withpercent

As the OP is using mutate() here is an all "tidyverse" approach. It does essentially the same as the data.table versions above:

library(purrr)
library(dplyr)
file_names <- list.files(pattern = "*.csv")
list_of_df <- map(file_names, readr::read_csv) %>% 
  set_names(file_names)

list_of_df %>% 
  map( ~ mutate(.x, across(where(is.character), ~ stringr::str_remove_all(.x, "[,%]"))))

$area.csv
# A tibble: 3 x 2
     id name       
  <dbl> <chr>      
1     1 normal name
2     2 withcomma  
3     3 withpercent

$farmland.csv
# A tibble: 3 x 2
     id name       
  <dbl> <chr>      
1     1 normal name
2     2 withcomma  
3     3 withpercent

$GDPpercapita.csv
# A tibble: 3 x 2
     id name       
  <dbl> <chr>      
1     1 normal name
2     2 withcomma  
3     3 withpercent

map() is the equivalent of base R's lapply(). Also readr::read_csv() is used instead of data.table's fread().

Data

Caveat: The code below will create 3 files in the current working directory!

library(data.table)
dummy <- data.table(id = 1:3, name = c("normal name", "with,comma", "with%percent"))
extern <- c("area.csv", "farmland.csv", "GDPpercapita.csv")
for (fn in extern) fwrite(dummy, fn)

The code saves a dummy data.table three times as csv file to disk using three different file names.

How to assign vector of strings as variable names, in for loop, in data.table, in dplyr

3 Answers

Data