1
votes

I'm working with a large data frame and I'd like to perform a pivot table type of function on it. I've been trying to use the reshape2 package but for some reason my molten data frame is not reshaping.

I would like to take a frame like this:

County      Industry   Type    Variable   Value 
LA          Plumbing     Tax       Rev       1000 
LA          Plumbing     No tax    Emp       100 
LA          Plumbing     Tax       Pay       500

And make it into (aggregating over Type):

        Plumbing       Tailors
County  Rev   Emp  Pay Rev   Emp  Pay
LA      1000  100  500 1000  50   65

I'm running the following code:

dcast(m.data, county ~ variable + industry)

But it's not changing the dataframe at all. Where am I messing up?

EDIT:

I'm including a bit more info about this issue. Before getting to the molten data frame, I'm doing some really bad clean up to get the data to where I want it. I know the code below is not ideal and really should be fixed, but I'm basically uploading multiple CSV files (with the same column names), joining them, recoding some values, dropping some columns, selecting a subset of the data, turning it into a molten frame, and then trying to use dcast to reshape it. I removed the code where I recode certain values but that part seems to work okay. Here's some of the code:

data1 <- read.table("census_data_r_1.csv",header=TRUE,sep=",",stringsAsFactors=FALSE) 
data2 <- read.table("census_data_r_2.csv",header=TRUE,sep=",", stringsAsFactors=FALSE)
fulldata <- rbind(data1,data2)
delete <- c("GEO.id","GEO.id2","NAICS.id","OPTAX.id","YEAR.id")
fulldata <- fulldata[, !(names(fulldata) %in% delete)]
colnames(fulldata) <- c("county","industry","tax_type","firms","revenue","payroll","num_employees","non_emp_firms","non_emp_firms_rev")
fulldata[c("firms","revenue","payroll","num_employees","non_emp_firms","non_emp_firms_rev")] <- recode.variables(fulldata[c("firms","revenue","payroll","num_employees","non_emp_firms","non_emp_firms_rev")],"'N' -> 'Nothing';'D' -> 'Withheld';'b' -> 20;'c' -> 100;'e' -> 250;'a' -> 10;'g' -> 1000;'f' -> 500;'Q' -> 'No Rev Collected';'h' -> 2500;'i' -> 5000;'j' -> 10000;'l' -> 50000;'k' -> 25000;'S' -> 'Bad Data';'m' -> 100000;")
fulldata.sub <- subset(fulldata, subset = (tax_type %in% c('Total', 'All establishments')) & (!(revenue %in% c('Nothing', 'Withheld','No Rev Collected'))) & (!(non_emp_firms %in% c('Nothing','Withheld'))))
m.data <- melt(fulldata.sub, id.vars = 1:3)
dcast(m.data, county ~ variable, sum)

Now I'm getting the following error:

Error in structure(ordered, dim = ns) : dims [product 18300] do not match the length of object [0]

Output from dput(head(fulldata.sub,40)):

structure(list(county = c("Autauga County, Alabama", "Autauga County, Alabama", 
"Autauga County, Alabama", "Autauga County, Alabama", "Autauga County, Alabama", 
"Autauga County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Baldwin County, Alabama", "Baldwin County, Alabama", "Baldwin County, Alabama", 
"Barbour County, Alabama", "Barbour County, Alabama"), industry = c("Rental and leasing services", 
"Professional, scientific, and technical services", "Professional, scientific, and technical services", 
"Accounting, tax preparation, bookkeeping, and payroll services", 
"Accounting, tax preparation, bookkeeping, and payroll services", 
"Architectural, engineering, and related services", "Real estate and rental and leasing", 
"Real estate", "Lessors of real estate", "Offices of real estate agents and brokers", 
"Offices of real estate agents and brokers", "Activities related to real estate", 
"Real estate property managers", "Offices of real estate appraisers", 
"Consumer goods rental", "Accounting, tax preparation, bookkeeping, and payroll services", 
"Accounting, tax preparation, bookkeeping, and payroll services", 
"Offices of certified public accountants", "Tax preparation services", 
"Architectural, engineering, and related services", "Architectural services", 
"Engineering services", "Specialized design services", "Computer systems design and related services", 
"Computer systems design and related services", "Management, scientific, and technical consulting services", 
"Advertising, public relations, and related services", "Veterinary services", 
"Administrative and support and waste management and remediation services", 
"Administrative and support services", "Employment services", 
"Business support services", "Investigation and security services", 
"Services to buildings and dwellings", "Exterminating and pest control services", 
"Janitorial services", "Landscaping services", "Waste management and remediation services", 
"Lessors of real estate", "Legal services"), tax_type = c("Total", 
"All establishments", "All establishments", "All establishments", 
"All establishments", "All establishments", "Total", "Total", 
"Total", "Total", "Total", "Total", "Total", "Total", "Total", 
"All establishments", "All establishments", "All establishments", 
"All establishments", "All establishments", "All establishments", 
"All establishments", "All establishments", "All establishments", 
"All establishments", "All establishments", "All establishments", 
"All establishments", "Total", "Total", "Total", "Total", "Total", 
"Total", "Total", "Total", "Total", "Total", "Total", "All establishments"
), firms = c("10", "61", "61", "14", "14", "10", "358", "312", 
"77", "161", "161", "74", "52", "16", "28", "79", "79", "36", 
"20", "77", "13", "37", "19", "27", "27", "63", "17", "26", "250", 
"238", "26", "14", "17", "157", "16", "29", "96", "12", "11", 
"19"), revenue = c("8433", "42285", "42285", "8581", "8581", 
"5571", "266692", "201777", "59742", "104768", "104768", "37267", 
"32141", "4615", "20691", "33203", "33203", "19805", "3160", 
"39318", "10494", "21167", "6833", "12391", "12391", "21496", 
"11097", "18388", "163661", "145935", "30746", "4048", "13849", 
"77076", "9934", "15832", "47411", "17726", "1585", "6439"), 
    payroll = c("1641", "15473", "15473", "3506", "3506", "2229", 
    "59476", "47937", "4053", "30180", "30180", "13704", "11902", 
    "1674", "4854", "17298", "17298", "9718", "1122", "15263", 
    "3688", "8649", "908", "4429", "4429", "7335", "2634", "6073", 
    "67526", "62354", "19529", "1002", "6824", "27688", "3181", 
    "8632", "14434", "5172", "265", "1431"), num_employees = c("56", 
    "386", "386", "127", "127", "41", "1987", "1643", "160", 
    "1030", "1030", "453", "406", "42", "217", "491", "491", 
    "217", "138", "356", "69", "204", "45", "111", "111", "165", 
    "101", "282", "2807", "2686", "806", "53", "399", "1241", 
    "110", "399", "675", "121", "23", "36"), non_emp_firms = c("8", 
    "330", "330", "49", "49", "35", "2358", "2289", "648", "840", 
    "840", "801", "186", "32", "19", "208", "208", "20", "40", 
    "203", "21", "74", "107", "99", "99", "356", "82", "10", 
    "1452", "1435", "25", "153", "61", "982", "12", "526", "350", 
    "17", "40", "16"), non_emp_firms_rev = c("882", "10111", 
    "10111", "493", "493", "1280", "164778", "160968", "55888", 
    "33321", "33321", "71759", "25870", "1504", "692", "2961", 
    "2961", "533", "466", "9220", "889", "5387", "4448", "3235", 
    "3235", "14395", "10337", "602", "35998", "33953", "708", 
    "3991", "806", "18726", "329", "6246", "9974", "2045", "1978", 
    "488")), .Names = c("county", "industry", "tax_type", "firms", 
"revenue", "payroll", "num_employees", "non_emp_firms", "non_emp_firms_rev"
), row.names = c(6L, 7L, 9L, 19L, 21L, 25L, 54L, 55L, 56L, 65L, 
66L, 70L, 71L, 74L, 77L, 99L, 101L, 103L, 105L, 109L, 111L, 115L, 
119L, 125L, 127L, 131L, 139L, 143L, 147L, 148L, 152L, 155L, 159L, 
162L, 163L, 165L, 167L, 169L, 174L, 180L), class = "data.frame")

EDIT

One more edit, including the output from >str(fulldata.sub) and str(m.data)

data.frame': 130098 obs. of 9 variables:
$ county : Factor w/ 3237 levels "Abbeville County, South Carolina",..: 121 121 121 121 121 121 121 121 131 131 ...
$ industry : Factor w/ 369 levels "Accounting, tax preparation, bookkeeping, and payroll services",..: 283 239 239 1 1 33 358 358 274 273 ...
$ tax_type : Factor w/ 4 levels "All establishments",..: 4 1 1 1 1 1 1 1 4 4 ...
$ firms : num 10 61 61 14 14 10 4 4 358 312 ...
$ revenue : num 31466 21347 21347 31717 31717 ...
$ payroll : num 5521 4863 4863 13729 13729 ...
$ num_employees : num 4664 3625 3625 571 571 ...
$ non_emp_firms : num 3122 1887 1887 2486 2486 ...
$ non_emp_firms_rev: num 17550 96 96 12669 12669 ...
'data.frame': 780588 obs. of 5 variables:
$ county : Factor w/ 3237 levels "Abbeville County, South Carolina",..: 121 121 121 121 121 121 121 121 131 131 ...
$ industry: Factor w/ 369 levels "Accounting, tax preparation, bookkeeping, and payroll services",..: 283 239 239 1 1 33 358 358 274 273 ...
$ tax_type: Factor w/ 4 levels "All establishments",..: 4 1 1 1 1 1 1 1 4 4 ...
$ variable: Factor w/ 6 levels "firms","revenue",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 10 61 61 14 14 10 4 4 358 312 ...

1
Can you give the output of either dput(head(fulldata.sub,40)) or dput(head(m.data,40))?Brian Diggs
another update. i realized the value column was a factor, so i converted that to a numeric type by introducing the following bit of code: m.data$value <- as.numeric(as.character(m.data$value))jlatif
updating original question with output from dput(head(fulldata.sub,40))jlatif
Personally, I don't engage in questions like this where I answer the original question, and then the OP replies by entirely changing the question (I have discovered that doing so is extremely deleterious to my mental health), so I removed my answer, and won't be participating on this question any further. Others may feel differently, though, so if you're patient perhaps someone else will stop by.joran
well, i changed the question because i found more information about my request. its really the same question still - i just found a more specific error code and cleaned it up a bit so others could ready it more easily. thanks anyways, i think it's now more informative and a better question to ask. also, other info was added as people leaving comments asked for additional output. also, your answer didn't work just a fyi.jlatif

1 Answers

2
votes

If you look at str(fulldata.sub) you'll see that your numbers in columns 4 through 9 are being treated as character. So, when you melt() the character strings are converted to factors. Then you are trying to run a sum() evaluation on a variable of type factor - which does not compute.

You can just convert to numeric, as follows:

...    
fulldata.sub[4:9] <- sapply(fulldata.sub[4:9],as.numeric)
# Then run your melt/cast sequence
m.data <- melt(fulldata.sub, id.vars = 1:3)
dcast(m.data, county ~ variable, sum)

Or, fix the data import. This could be happening because there are character strings like "", "-" ,"," ,"n/a" ,"n.a" ,or " ", etc. Using read.csv you can fix this by setting the na.strings=c("erroneous_string","other_erroneous_string",...) argument.