I'm working with a large data frame and I'd like to perform a pivot table type of function on it. I've been trying to use the reshape2 package but for some reason my molten data frame is not reshaping.
I would like to take a frame like this:
County Industry Type Variable Value
LA Plumbing Tax Rev 1000
LA Plumbing No tax Emp 100
LA Plumbing Tax Pay 500
And make it into (aggregating over Type):
Plumbing Tailors
County Rev Emp Pay Rev Emp Pay
LA 1000 100 500 1000 50 65
I'm running the following code:
dcast(m.data, county ~ variable + industry)
But it's not changing the dataframe at all. Where am I messing up?
EDIT:
I'm including a bit more info about this issue. Before getting to the molten data frame, I'm doing some really bad clean up to get the data to where I want it. I know the code below is not ideal and really should be fixed, but I'm basically uploading multiple CSV files (with the same column names), joining them, recoding some values, dropping some columns, selecting a subset of the data, turning it into a molten frame, and then trying to use dcast to reshape it. I removed the code where I recode certain values but that part seems to work okay. Here's some of the code:
data1 <- read.table("census_data_r_1.csv",header=TRUE,sep=",",stringsAsFactors=FALSE)
data2 <- read.table("census_data_r_2.csv",header=TRUE,sep=",", stringsAsFactors=FALSE)
fulldata <- rbind(data1,data2)
delete <- c("GEO.id","GEO.id2","NAICS.id","OPTAX.id","YEAR.id")
fulldata <- fulldata[, !(names(fulldata) %in% delete)]
colnames(fulldata) <- c("county","industry","tax_type","firms","revenue","payroll","num_employees","non_emp_firms","non_emp_firms_rev")
fulldata[c("firms","revenue","payroll","num_employees","non_emp_firms","non_emp_firms_rev")] <- recode.variables(fulldata[c("firms","revenue","payroll","num_employees","non_emp_firms","non_emp_firms_rev")],"'N' -> 'Nothing';'D' -> 'Withheld';'b' -> 20;'c' -> 100;'e' -> 250;'a' -> 10;'g' -> 1000;'f' -> 500;'Q' -> 'No Rev Collected';'h' -> 2500;'i' -> 5000;'j' -> 10000;'l' -> 50000;'k' -> 25000;'S' -> 'Bad Data';'m' -> 100000;")
fulldata.sub <- subset(fulldata, subset = (tax_type %in% c('Total', 'All establishments')) & (!(revenue %in% c('Nothing', 'Withheld','No Rev Collected'))) & (!(non_emp_firms %in% c('Nothing','Withheld'))))
m.data <- melt(fulldata.sub, id.vars = 1:3)
dcast(m.data, county ~ variable, sum)
Now I'm getting the following error:
Error in structure(ordered, dim = ns) : dims [product 18300] do not match the length of object [0]
Output from dput(head(fulldata.sub,40))
:
structure(list(county = c("Autauga County, Alabama", "Autauga County, Alabama",
"Autauga County, Alabama", "Autauga County, Alabama", "Autauga County, Alabama",
"Autauga County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama",
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama",
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama",
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama",
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama",
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama",
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama",
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama",
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama",
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama",
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama",
"Barbour County, Alabama", "Barbour County, Alabama"), industry = c("Rental and leasing services",
"Professional, scientific, and technical services", "Professional, scientific, and technical services",
"Accounting, tax preparation, bookkeeping, and payroll services",
"Accounting, tax preparation, bookkeeping, and payroll services",
"Architectural, engineering, and related services", "Real estate and rental and leasing",
"Real estate", "Lessors of real estate", "Offices of real estate agents and brokers",
"Offices of real estate agents and brokers", "Activities related to real estate",
"Real estate property managers", "Offices of real estate appraisers",
"Consumer goods rental", "Accounting, tax preparation, bookkeeping, and payroll services",
"Accounting, tax preparation, bookkeeping, and payroll services",
"Offices of certified public accountants", "Tax preparation services",
"Architectural, engineering, and related services", "Architectural services",
"Engineering services", "Specialized design services", "Computer systems design and related services",
"Computer systems design and related services", "Management, scientific, and technical consulting services",
"Advertising, public relations, and related services", "Veterinary services",
"Administrative and support and waste management and remediation services",
"Administrative and support services", "Employment services",
"Business support services", "Investigation and security services",
"Services to buildings and dwellings", "Exterminating and pest control services",
"Janitorial services", "Landscaping services", "Waste management and remediation services",
"Lessors of real estate", "Legal services"), tax_type = c("Total",
"All establishments", "All establishments", "All establishments",
"All establishments", "All establishments", "Total", "Total",
"Total", "Total", "Total", "Total", "Total", "Total", "Total",
"All establishments", "All establishments", "All establishments",
"All establishments", "All establishments", "All establishments",
"All establishments", "All establishments", "All establishments",
"All establishments", "All establishments", "All establishments",
"All establishments", "Total", "Total", "Total", "Total", "Total",
"Total", "Total", "Total", "Total", "Total", "Total", "All establishments"
), firms = c("10", "61", "61", "14", "14", "10", "358", "312",
"77", "161", "161", "74", "52", "16", "28", "79", "79", "36",
"20", "77", "13", "37", "19", "27", "27", "63", "17", "26", "250",
"238", "26", "14", "17", "157", "16", "29", "96", "12", "11",
"19"), revenue = c("8433", "42285", "42285", "8581", "8581",
"5571", "266692", "201777", "59742", "104768", "104768", "37267",
"32141", "4615", "20691", "33203", "33203", "19805", "3160",
"39318", "10494", "21167", "6833", "12391", "12391", "21496",
"11097", "18388", "163661", "145935", "30746", "4048", "13849",
"77076", "9934", "15832", "47411", "17726", "1585", "6439"),
payroll = c("1641", "15473", "15473", "3506", "3506", "2229",
"59476", "47937", "4053", "30180", "30180", "13704", "11902",
"1674", "4854", "17298", "17298", "9718", "1122", "15263",
"3688", "8649", "908", "4429", "4429", "7335", "2634", "6073",
"67526", "62354", "19529", "1002", "6824", "27688", "3181",
"8632", "14434", "5172", "265", "1431"), num_employees = c("56",
"386", "386", "127", "127", "41", "1987", "1643", "160",
"1030", "1030", "453", "406", "42", "217", "491", "491",
"217", "138", "356", "69", "204", "45", "111", "111", "165",
"101", "282", "2807", "2686", "806", "53", "399", "1241",
"110", "399", "675", "121", "23", "36"), non_emp_firms = c("8",
"330", "330", "49", "49", "35", "2358", "2289", "648", "840",
"840", "801", "186", "32", "19", "208", "208", "20", "40",
"203", "21", "74", "107", "99", "99", "356", "82", "10",
"1452", "1435", "25", "153", "61", "982", "12", "526", "350",
"17", "40", "16"), non_emp_firms_rev = c("882", "10111",
"10111", "493", "493", "1280", "164778", "160968", "55888",
"33321", "33321", "71759", "25870", "1504", "692", "2961",
"2961", "533", "466", "9220", "889", "5387", "4448", "3235",
"3235", "14395", "10337", "602", "35998", "33953", "708",
"3991", "806", "18726", "329", "6246", "9974", "2045", "1978",
"488")), .Names = c("county", "industry", "tax_type", "firms",
"revenue", "payroll", "num_employees", "non_emp_firms", "non_emp_firms_rev"
), row.names = c(6L, 7L, 9L, 19L, 21L, 25L, 54L, 55L, 56L, 65L,
66L, 70L, 71L, 74L, 77L, 99L, 101L, 103L, 105L, 109L, 111L, 115L,
119L, 125L, 127L, 131L, 139L, 143L, 147L, 148L, 152L, 155L, 159L,
162L, 163L, 165L, 167L, 169L, 174L, 180L), class = "data.frame")
EDIT
One more edit, including the output from >str(fulldata.sub) and str(m.data)
data.frame': 130098 obs. of 9 variables:
$ county : Factor w/ 3237 levels "Abbeville County, South Carolina",..: 121 121 121 121 121 121 121 121 131 131 ...
$ industry : Factor w/ 369 levels "Accounting, tax preparation, bookkeeping, and payroll services",..: 283 239 239 1 1 33 358 358 274 273 ...
$ tax_type : Factor w/ 4 levels "All establishments",..: 4 1 1 1 1 1 1 1 4 4 ...
$ firms : num 10 61 61 14 14 10 4 4 358 312 ...
$ revenue : num 31466 21347 21347 31717 31717 ...
$ payroll : num 5521 4863 4863 13729 13729 ...
$ num_employees : num 4664 3625 3625 571 571 ...
$ non_emp_firms : num 3122 1887 1887 2486 2486 ...
$ non_emp_firms_rev: num 17550 96 96 12669 12669 ...
'data.frame': 780588 obs. of 5 variables:
$ county : Factor w/ 3237 levels "Abbeville County, South Carolina",..: 121 121 121 121 121 121 121 121 131 131 ...
$ industry: Factor w/ 369 levels "Accounting, tax preparation, bookkeeping, and payroll services",..: 283 239 239 1 1 33 358 358 274 273 ...
$ tax_type: Factor w/ 4 levels "All establishments",..: 4 1 1 1 1 1 1 1 4 4 ...
$ variable: Factor w/ 6 levels "firms","revenue",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 10 61 61 14 14 10 4 4 358 312 ...
dput(head(fulldata.sub,40))
ordput(head(m.data,40))
? – Brian Diggs