2
votes

I am in the process of reorganizing a large weather dataset. I am trying to attach a replicated character string to a list so that the repeated string appears before each element of the list.

For example, imagine a table containing monthly temperature and precipitation (nedbor) data over time, in two separate cities (K and S). It is currently structured such that each row represents a year ranging from 2000 to 2015 and there is a column for each weather variable for each month. This makes for a very wide table (which I want).

The problem is that the dataframe was constructed from 12 .csv files, each named after the month of the data it represents, as well as two separate vectors that describe a different variable across years (NAO). The output table from

> Weather<-data.frame(Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,NAO,NAOPrevYr)

yields a table with 16 rows (one for each year 2000-2015) and 170 columns structured so that these columns:

(Year, Month, S.HighTemp, S.LowTemp, S.MeanTemp, S.Nedbor, S.Nedbordage, K.Year, K.Month, K.HighTemp, K.LowTemp, K.MeanTemp,K.Nedbor,K.Nedbordage)

are associated with each month (14*12=168) and two additional columns (NAO and NAOLastYear) sit at the end. Entries in the Month column are obviously repeated for the entirety of their respective month. However, because each source file contains the same column names, the column names in the Weather dataframe are followed by ".1" for the February segment of columns, ".2" for March, etc.

I want to rename the columns so that the generic descriptor (eg, "S.HighTemp") is followed by a period and then the month with which it is associated. The desired output is still a table with 16 rows and 170 columns, except that rather than the August section of columns reading

(Year.7, Month.7, S.HighTemp.7, S.LowTemp.7, S.MeanTemp.7, S.Nedbor.7, S.Nedbordage.7, K.Year.7, K.Month.7, K.HighTemp.7, K.LowTemp.7, K.MeanTemp.7,K.Nedbor.7,K.Nedbordage.7)

I want it to read

(Year.Aug, Month.Aug, S.HighTemp.Aug, S.LowTemp.Aug, S.MeanTemp.Aug, S.Nedbor.Aug, S.Nedbordage.Aug, K.Year.Aug, K.Month.Aug, K.HighTemp.Aug, K.LowTemp.Aug, K.MeanTemp.Aug,K.Nedbor.Aug,K.Nedbordage.Aug)

and act similarly for each of the 14-variable monthly blocks.

What I tried:

names(Weather)<-c(c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                    "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                    "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                    "K.Nedbordage")+c(rep(".Jan",times=14)),
                    c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                      "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                      "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                      "K.Nedbordage")+c(rep(".Feb",times=14)),
                    c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                      "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                      "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                      "K.Nedbordage")+c(rep(".Mar",times=14)),
                    c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                      "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                      "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                      "K.Nedbordage")+c(rep(".Apr",times=14)),
                    c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                      "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                      "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                      "K.Nedbordage")+c(rep(".May",times=14)),
                    c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                      "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                      "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                      "K.Nedbordage")+c(rep(".Jun",times=14)),
                    c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                      "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                      "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                      "K.Nedbordage")+c(rep(".Jul",times=14)),
                    c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                      "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                      "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                      "K.Nedbordage")+c(rep(".Aug",times=14)),
                    c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                      "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                      "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                      "K.Nedbordage")+c(rep(".Sep",times=14)),
                    c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                      "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                      "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                      "K.Nedbordage")+c(rep(".Oct",times=14)),
                    c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                      "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                      "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                      "K.Nedbordage")+c(rep(".Nov",times=14)),
                    c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
                      "S.Nedbor","S.Nedbordage","K.Year","K.Month",
                      "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
                      "K.Nedbordage")+c(rep(".Dec",times=14)),
                  NAO, NAOPrevYr)

Unfortunately this gives me an error indicating I'm trying to apply the non-numeric argument to a binary operator. I'm assuming this is because I combined a "+" with vectors of character strings.

I searched for information related to merging character strings. The related material I found online is largely too linear in its design for what I'm trying to do.

For example,

R Programming: Automating Merge of Character Strings adds character strings together into a vector of strings. But I want to merge strings across vectors, almost like taking two adjacent columns of variables and months, and removing the divide of the cell between then (the list would then be in a top-to-bottom order).
Merging vectors of strings in a list in R , is really just a rearrangement of entries in a vector. And
How to merge vectors into a list in R? still claims to be merging vectors but really seems to just be appending vectors.

Basically I'm pretty new to this and still figuring the whole R thing out. If you have any ideas for what more I can look up please let me know. There has got to be a better way of doing this...

1

1 Answers

2
votes

Indeed, when you want to combine character strings you should not use the + operator (which is for numeric data). Instead, you can use the paste function (type ?paste within R for more information).

Here is an example:

# The first part of your column names
base_names = c("Year","Month","S.HighTemp","S.LowTemp","S.MeanTemp",
    "S.Nedbor","S.Nedbordage","K.Year","K.Month",
    "K.HighTemp","K.LowTemp","K.MeanTemp","K.Nedbor",
    "K.Nedbordage")

# Paste a month
paste0(base_names, ".Jan")

This returns a vector like so:

[1] "Year.Jan"         "Month.Jan"        "S.HighTemp.Jan"   "S.LowTemp.Jan"    "S.MeanTemp.Jan"   "S.Nedbor.Jan"     "S.Nedbordage.Jan"
 [8] "K.Year.Jan"       "K.Month.Jan"      "K.HighTemp.Jan"   "K.LowTemp.Jan"    "K.MeanTemp.Jan"   "K.Nedbor.Jan"     "K.Nedbordage.Jan"

To do all your months, you don't necessarily need to build the names vector by "hand" (like you tried in your example). You can automate it somehow. Here are some different solutions.

# Create a vector with months
months = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Nov", "Dec")

1) Using a for loop

# Create an empty vector to store the new column names
new_names = c()

# Paste each month to the base_names and add it to the new_names vector
for(month in months){
    new_names = c(new_names, paste0(base_names, ".", month))
}

2) Using sapply function

# This creates a matrix with each base_name and month pasted together
new_names = sapply(months, function(month, base_names){
    paste0(base_names, ".", month)
}, base_names = base_names)

# Convert the result to a vector
new_names = as.vector(new_names)

3) Using expand.grid

# This creates a table with all combinations of base_names and months
new_names = expand.grid(base_names, months)

# Paste the two columns together to return a vector
new_names = paste0(new_names[,1], ".", new_names[,2])

EDIT:

to answer the OP's questions in the comments, I'm adding some (hopefully clear) explanations for how the above solutions work.

Question 1)

In the for loop the variable month is taking each of the values in the vector months, one at a time. So in each iteration of the loop the variable month will have a different value. Try it out by simply printing the variable month:

for(month in months){ print(month) }

You could also build an "iterator" variable, and then call the i-th element of the months vector. In this case I'm making a variable i that takes the values 1 to 12 (length of months). This approach works, but is unnecessary in your case:

for(i in 1:length(months)){
    print(month[i])
}

Question 2)

That is the nice thing about vector operations in R. Indeed, paste() will "recycle" a vector if it's shorter than the other vectors being pasted. To understand this, see what happens if you paste two vectors with the same length:

paste(c("A", "B", "C", "D", "E"), 1:5)
## "A 1" "B 2" "C 3" "D 4" "E 5"

And now vectors of different lengths:

paste(c("A", "B", "C", "D", "E"), 1:2)
[1] "A 1" "B 2" "C 1" "D 2" "E 1"

See how the values of the second vector were re-used until all elements of the first vector were finished. So, if you only use one value for the second vector, paste() will repeat that value as many times as needed:

paste(c("A", "B", "C", "D", "E"), 1)
[1] "A 1" "B 1" "C 1" "D 1" "E 1"

Question 3)

Essentially the apply() family of functions work a bit like a for loop, so the answer to this is similar to answer to question 1). Basically, sapply() will iterate through each element of the months vector and pass that as the first variable in our function (which I've called month). Again, as in the for loop, you could have used indexes, but it was unnecessary in this case.

It's worth noting that using apply() is usually the "R" way of doing loops, because for loops are often slower.