I have been researching a way to efficiently extract information from large csv data sets using R. Many seem to recommend the package ff. I was successful in reading the data sets but am now running into problem trying to subset it.
The largest data set contains over 650,000 rows and 1005 columns. Not all columns contain the same data types. Viewed as a dataframe, the structure would look like this:
'data.frame': 5 obs. of 1005 variables:
$ SAMPLING_EVENT_ID : Factor w/ 5 levels "S6230404","S6252242",..: 2 1 3 4 5
$ LATITUDE : num 24.4 24.5 24.5 24.5 24.5
$ LONGITUDE : num -81.9 -81.9 -82 -82 -82
$ YEAR : int 2010 2010 2010 2010 2010
$ MONTH : int 4 3 10 10 10
$ DAY : int 97 88 299 298 300
$ TIME : num 9 10 10 11.58 9.58
$ COUNTRY : Factor w/ 1 level "United_States": 1 1 1 1 1
$ STATE_PROVINCE : Factor w/ 1 level "Florida": 1 1 1 1 1
$ COUNT_TYPE : Factor w/ 2 levels "P21","P22": 2 2 1 1 1
$ EFFORT_HRS : num 6 2 7 6 3.5
$ EFFORT_DISTANCE_KM : num 48.28 8.05 0 0 0
$ EFFORT_AREA_HA : int 0 0 0 0 0
$ OBSERVER_ID : Factor w/ 3 levels "obs132426","obs58643",..: 3 2 1 1 1
$ NUMBER_OBSERVERS : Factor w/ 2 levels "?","1": 2 1 2 2 2
$ Zenaida_macroura : int 0 0 1 0 0
All other variables being similar to this last one i.e. various species of bird.
Here is the code I used to “successfully: read the csv:
B2010 <- read.table.ffdf (x = NULL, “filePath&Name", nrows = -1, first.rows = 50000, next.rows = 50000)
Trying to learn about ffdf output, I entered command lines such as dim(B2010), str(B2010), ls(B2010), etc. dim(B2010) resulted in the appropriate number of rows but only one column (a string per record of the values separated by commas), and ls(B2010) outputted “[1] "physical" "row.names" "virtual" instead of the usual list of variables.
I not sure how to handle this type of output to be able to extract say STATE_PROVINCE == “California”? How do I tell B2010 what the variables are? I think I need to look at this differently but need some of your help to figure it out.
The ultimate goal for me is to subset a bunch of csv data sets (since I have one per year) and put the results back together as dataframe for various analysis.
Thanks, Joe