0
votes

I'm trying to extract one column from multiple .csv files using Reduce. What I have is

a vector with the path to every single .csv

filepaths

a function to read a .csv and return one of it's columns

getData <- function(path,column) {
   d = read.csv(path)
   d[,column]
}

and the Reduce function, to apply the getData function to every single filepath and store the results in a single collection (for demonstration I only take the first three path strings)

Reduce(function(path,acc) append(acc, getData(path,column)), filepaths[1:3],c())

If I do this, I get the following error, which occurs, when read.csv is called with one of the filepaths

Error in read.table(file = file, header = header, sep = sep, quote = quote, : 'file' must be a character string or connection

This is strange, cause if I call the "getData" function manually like

getData(filepaths[1],col)
getData(filepaths[2],col)
getData(filepaths[3],col)

it works.

I know, I could do this with a for loop. But I want to understand, what the problem is.

4
try do.call(rbind,lapply(filepaths, fread, select="colname"))mtoto
you can do this too with your function unlist(lapply(filepaths, function(x){ getData(x,1) })) will read first column.fishtank
Why read the whole .csv and then extract only one column? inefficient. fread for example has a select argument...MichaelChirico

4 Answers

1
votes

You could use fread from data.table to read in only the desired column, instead of reading in entire csv's and consequently dropping all columns but one, as in your function.

library(data.table)
unlist(lapply(filepaths, fread, select= "colname")) #output is a vector
1
votes

I just figured it out. The problem is, that Reduce expects a function, that has the accumulator as FIRST parameter, and the element as second. I switched them. So the solution looks like this:

getData <- function(path,column) {
  d = read.csv(path)
  d[,column]
}

Reduce(function(acc,path) append(acc, getData(path,column)), filepaths[1:3],c())

Thanks for the hint with fread. I see that this is much better than read.csv

0
votes

Reduce() is used with functions that process data and return same kind of data. For example reduceFun(x1,x2) which compares x1 and x2 and returns the max will be called first with x1 and x2 being the 2 first elements of the vector, then the result will be passd as x1 and the be third element as x2:

reduceFun <- function(x1,x2) 
{
  print(paste("x1=",x1, " : x2=",x2, " : max=",max(x1,x2)));
  return(max(x1,x2))
}
> res <- Reduce(reduceFun, 1:10)
[1] "x1= 1  : x2= 2  : max= 2"
[1] "x1= 2  : x2= 3  : max= 3"
[1] "x1= 3  : x2= 4  : max= 4"
[1] "x1= 4  : x2= 5  : max= 5"
[1] "x1= 5  : x2= 6  : max= 6"
[1] "x1= 6  : x2= 7  : max= 7"
[1] "x1= 7  : x2= 8  : max= 8"
[1] "x1= 8  : x2= 9  : max= 9"
[1] "x1= 9  : x2= 10  : max= 10"
> res
[1] 10

So Reduce() is probably not what you want to use, there are many other ways as shown in other answers.

0
votes

This works for me!

library(data.table)
setwd("C:/Users/your_path_here/CSV Files/")

WD="C:/Users/your_path_here/CSV Files/"
data<-data.table(read.csv(text="CashFlow,Cusip,Period"))

csv.list<- list.files(WD)
k=1

for (i in csv.list){
  temp.data<-read.csv(i)
  data<-data.table(rbind(data,temp.data))

  if (k %% 100 == 0)
    print(k/length(csv.list))

  k<-k+1
}