1
votes

I have a basic understanding of R that mostly entails the ability to run regressions and summary statistics, so if there appear any gaps in my knowledge I would appreciate being pointed in the correct direction.

I have time series data in CSV that is formatted as follows:

Facility ID, Utility Type, Account No, Unit Name, Date 1, Date 2, Date 3, Date 4

There will be multiple rows for a specific account number referencing a unique utility type and facility (i.e., one row entry for Unit Name = L, one row entry for Unit Name = USD). The account number values for a particular unit at every date are entered in each "date" column. I would like to be able to write a script that enables me to re-export the data where each Date column doesn't contain entries for multiple units. I would also like to then designate to R that the Date columns represent monthly time series data points, and from there do various time series analysis.

I appreciate your help in telling me how to clean up this data.

As requested, sample data:

Facility ID, Facility Name, State, Utility Type, Supplier, Account No., Unit Name, 7/1/14, 8/1/14
4015, Palm Court Apts, CA, Chilled Water, PG&E, 87993, USD, 42333, 41775
4015, Palm Court Apts, CA, Chilled Water, PG&E, 87993, ton-hr, 244278, 238035
4044, 18 Sawtelle, CA, Natural Gas, Chevron, 17965, USD, 4860, 5890
4044, 18 Sawtelle, CA, Natural Gas, Chevron, 17965, M^3, 7639, 8895

Example output:

Facility ID, Facility Name, State, Utility Type, Supplier, Account No., Quantity Consumed, Unit of Measure, Utility Bill, Currency, Date
4015, Palm Court Apts, CA, Chilled Water, PG&E, 87993, 244278, ton-hr, 42333, USD, 7/1/14
4015, Palm Court Apts, CA, Chilled Water, PG&E, 87993, 238035, ton-hr, 41775, USD, 8/1/14
4044, 18 Sawtelle, CA, Natural Gas, Chevron, 17965, 7639, M^3, 4860, USD, 7/1/14
4044, 18 Sawtelle, CA, Natural Gas, Chevron, 17965, 8895, M^3, 5890, USD, 8/1/14
1
You should try help(unstack) or , look into the reshape2 package for some ideas of how to proceed. - vpipkt
Can you please show some sample data in these columns? And how you would like the output to look like? (So that we have a reproducible example) - slhck
Thank you @slhck, I added a few rows of data - milesaway
@slhck, the output depends on what you think would be the best for time series data. Column headers of 7/1/14-USD, 7/1/14-ton-hr would work; but then I would not have a column header that is a dedicated date value. - milesaway
The "238,035" is an error, right? There's one column too many :) - slhck

1 Answers

0
votes
library(reshape2)
d = read.csv("data.csv")
d.molten = melt(d, 
  id.vars=c("Facility.ID", "Facility.Name", "State", "Utility.Type", "Supplier", "Account.No.", "Unit.Name"), 
  variable.name = "Date"
)

The melt function breaks up a "wide" format (with an undefined numbers of columns) to a "long" format, where each row is an observation. This is actually the preferred format for most things you'd do in R, at least when using packages from the "Hadleyverse". Especially for time series.

But we're not done yet. Now you have the following structure:

Facility.ID    Facility.Name …  Date  value
       4015  Palm Court Apts X7.1.14  42333

We have to fix the dates that are currently just "strings". They had an "X" prepended since column names cannot start with a number, and cannot contain spaces.

d.molten$Date=as.Date(d.molten$Date, "X%m.%d.%y")

Now your dates will look correct, and you have one row for each observation:

Facility.ID    Facility.Name …     Date  value
       4015  Palm Court Apts 2014-07-01  42333

And now we can easily plot time series:

library(ggplot2)
ggplot(d.molten, 
  aes(x = Date, y = value, color = Facility.Name)) + 
  geom_point()