0
votes

I have been researching a way to efficiently extract information from large csv data sets using R. Many seem to recommend the package ff. I was successful in reading the data sets but am now running into problem trying to subset it.

The largest data set contains over 650,000 rows and 1005 columns. Not all columns contain the same data types. Viewed as a dataframe, the structure would look like this:

'data.frame':   5 obs. of  1005 variables:
 $ SAMPLING_EVENT_ID      : Factor w/ 5 levels "S6230404","S6252242",..: 2 1 3 4 5
 $ LATITUDE               : num  24.4 24.5 24.5 24.5 24.5
 $ LONGITUDE              : num  -81.9 -81.9 -82 -82 -82
 $ YEAR                   : int  2010 2010 2010 2010 2010
 $ MONTH                  : int  4 3 10 10 10
 $ DAY                    : int  97 88 299 298 300
 $ TIME                   : num  9 10 10 11.58 9.58
 $ COUNTRY                : Factor w/ 1 level "United_States": 1 1 1 1 1
 $ STATE_PROVINCE         : Factor w/ 1 level "Florida": 1 1 1 1 1
 $ COUNT_TYPE             : Factor w/ 2 levels "P21","P22": 2 2 1 1 1
 $ EFFORT_HRS             : num  6 2 7 6 3.5
 $ EFFORT_DISTANCE_KM     : num  48.28 8.05 0 0 0
 $ EFFORT_AREA_HA         : int  0 0 0 0 0
 $ OBSERVER_ID            : Factor w/ 3 levels "obs132426","obs58643",..: 3 2 1 1 1
 $ NUMBER_OBSERVERS       : Factor w/ 2 levels "?","1": 2 1 2 2 2
 $ Zenaida_macroura       : int  0 0 1 0 0

All other variables being similar to this last one i.e. various species of bird.

Here is the code I used to “successfully: read the csv:

B2010 <- read.table.ffdf (x = NULL, “filePath&Name", nrows = -1, first.rows = 50000, next.rows = 50000)

Trying to learn about ffdf output, I entered command lines such as dim(B2010), str(B2010), ls(B2010), etc. dim(B2010) resulted in the appropriate number of rows but only one column (a string per record of the values separated by commas), and ls(B2010) outputted “[1] "physical" "row.names" "virtual" instead of the usual list of variables.

I not sure how to handle this type of output to be able to extract say STATE_PROVINCE == “California”? How do I tell B2010 what the variables are? I think I need to look at this differently but need some of your help to figure it out.

The ultimate goal for me is to subset a bunch of csv data sets (since I have one per year) and put the results back together as dataframe for various analysis.

Thanks, Joe

3
I don't see any separator specified in the read.table call. Is read.table.ffdf different than read.table in assumptions regarding delimiters?IRTFM
Indeed, read.table is different from read.table.ffdf. While the first one accepts sep="," . . . it is not obvious where that information needs to be entered in the read.table.ffdf command (rgm2.lab.nig.ac.jp/RGM2/func.php?rd_id=ff:read.table.ffdf)JoeBird
I did look at subset.ff question. The question here is different in that I am not able to have variable names associated with the ddfd output. Once I have variable names, I should be able to use the subset.ff question/answer to subset my data.JoeBird
read.table.ffdf has arguments ..., the separator can be passed in the ... arguments. See ?read.table.ffdfuser1600826

3 Answers

3
votes

To subset an ffdf, use the ffbase package. As in

require(ffbase)
x <- subset(B2010, BB2010$STATE_PROVINCE == “California”)
2
votes

I finally found the solution to getting the ffdf variable names and types properly read and accessible for subsetting:

B2010 <- read.csv.ffdf (file = "filepath/name", colClasses = c("factor", "numeric", "numeric", "integer", "integer", "integer", "numeric", rep("factor",998)), first.rows = 10000, next.rows = 50000, nrows = -1)

This took forever to read but seemed to have worked i.e. I was able to create a subset of the data. Next step: to save the subset back to a "normal" dataframe and/or to a csv.

0
votes

According to the help page at ?read.table.ffdf, you should be using read.csv.ffdf(...). Then go to the page cited by Brandon.