How to turn variable names into factors in a data frame in R

Question

Say I have a data frame containing time-series data, where the first column is the index, and the remaining columns all contain different data streams, and are named descriptively, as in the following example:

temps = data.frame(matrix(1:20,nrow=2,ncol=10))
names(temps) <- c("flr1_dirN_areaA","flr1_dirS_areaA","flr1_dirN_areaB","flr1_dirS_areaB","flr2_dirN_areaA","flr2_dirS_areaA","flr2_dirN_areaB","flr2_dirS_areaB","flr3_dirN_areaA","flr3_dirS_areaA")
temps$Index <- as.Date(2013,7,1:2)

temps
  flr1_dirN_areaA flr1_dirS_areaA    ...       Index
1               1               3    ...  1975-07-15
2               2               4    ...  1975-07-16

Now I want to prep the data frame for plotting with ggplot2, and i want to include the three factors: flr, dir, and area.

I can achieve this for this simple example as follows:

temps.m <- melt(temps,"Index")
temps.m$flr <- factor(rep(1:3,c(8,8,4)))
temps.m$dir <- factor(rep(c("N","S"),each=2,len=20))
temps.m$area <- factor(rep(c("A","B"),each=4,len=20))
temps.m
        Index        variable value flr dir area
1  1975-07-15 flr1_dirN_areaA     1   1   N    A
2  1975-07-16 flr1_dirN_areaA     2   1   N    A
3  1975-07-15 flr1_dirS_areaA     3   1   S    A
4  1975-07-16 flr1_dirS_areaA     4   1   S    A
5  1975-07-15 flr1_dirN_areaB     5   1   N    B
6  1975-07-16 flr1_dirN_areaB     6   1   N    B
7  1975-07-15 flr1_dirS_areaB     7   1   S    B
8  1975-07-16 flr1_dirS_areaB     8   1   S    B
9  1975-07-15 flr2_dirN_areaA     9   2   N    A
10 1975-07-16 flr2_dirN_areaA    10   2   N    A
11 1975-07-15 flr2_dirS_areaA    11   2   S    A
12 1975-07-16 flr2_dirS_areaA    12   2   S    A
13 1975-07-15 flr2_dirN_areaB    13   2   N    B
14 1975-07-16 flr2_dirN_areaB    14   2   N    B
15 1975-07-15 flr2_dirS_areaB    15   2   S    B
16 1975-07-16 flr2_dirS_areaB    16   2   S    B
17 1975-07-15 flr3_dirN_areaA    17   3   N    A
18 1975-07-16 flr3_dirN_areaA    18   3   N    A
19 1975-07-15 flr3_dirS_areaA    19   3   S    A
20 1975-07-16 flr3_dirS_areaA    20   3   S    A

In reality, I have data streams (columns) of varying lengths - each of which comes from its own file, has missing data, more than 3 factors encoded in the column (file) names, so this simple method of applying factors won't work. I need something more robust, and I'm inclined to parse the variable names into the different factors, and populate the factor-columns of the melted data frame.

My end goal is to plot something like this:

ggplot(temps.m,aes(x=Index,y=value,color=area,linetype=dir))+geom_line()+facet_grid(flr~.)

example of what i would like to plot, with multiple factors

I imagine that the reshape, reshape2, plyr, or some other package can do this in one or two statements - but I struggle with melt/cast/ddply and the rest of them. Any suggestions?

Also, if you can suggest an entirely different [better] approach to structuring my data, I'm all ears.

Thanks in advance

I think you need to reduce your question to specific components and make them reproducible with minimal code. — geotheory
there are 8 lines of code up there, the first three provide an example of the data i have, the next 4 lead to a reformatted set of data (formatted in the structure that i need for plotting), and the last line creates a plot, to show what my end goal is. what specifically could i improve or clarify? i'm happy to edit if it will help — RyanStochastic
@RyanStochastic Mainly you have a string with a certain pattern and you want to extract/split into 3 or more factors That's it. So all all other informations like the plot context/the plot itself is just confusing... — agstudy
@RyanStochastic Try to reformulate your question and focus on string pattern (see comment above) and you will get better solution. — agstudy
I added the plot and context to hopefully get a more generic solution, or suggestions on how to approach the problem differently, rather than a specific solution tailored to this specific, very simple data set. I understand your perspective, and thank you for your time and effort. — RyanStochastic

agstudy agstudy · Accepted Answer · 2013-07-18T00:52:37

You can use some regular expressions to creates your factors:

res <- do.call(rbind,strsplit(gsub('flr([0-9]+).*dir([A-Z]).*area([A-Z])',
              '\\1,\\2,\\3',  
              temps.m$variable),
         ','))

    [,1] [,2] [,3]
 [1,] "1"  "N"  "A" 
 [2,] "1"  "N"  "A" 
 [3,] "1"  "S"  "A" 
 [4,] "1"  "S"  "A" 
 [5,] "1"  "N"  "B" 
 [6,] "1"  "N"  "B" 
 [7,] "1"  "S"  "B" 
 [8,] "1"  "S"  "B" 
 ........

Maybe you need further step to transform your columns to factors.

res <- colwise(as.factor)(data.frame(res))
  X1 X2 X3
1   1  N  A
2   1  N  A
3   1  S  A
4   1  S  A
........

To combine the result with your melted data you can use cbind

 temps.m <- cbind(temps.m,res)

How to turn variable names into factors in a data frame in R

2 Answers