Finding correlation between two data sets in R

Question

UPDATED DATA SET 2 and 1 STRUCTURE: Sorry for this sudden update. I have two data sets. The structure of my first data set is (when using print(matr1) in R):

        month_year  income
 [1,]  "Jan 2000"  "30000"
 [2,]  "Feb 2000"  "12364"
 [3,]  "Mar 2000"  "37485"
 [4,]  "Apr 2000"  "2000"
 [5,]  "Jun 2000"  "7573"
          .     .      .
          .     .      .

Thus the first data set has one income value for each month of each year.

The structure of my second data set is (when using print(matr2) in R):

     month_year     value
 [1,] "Jan 2000" "84737476"
 [2,] "Jan 2000" "39450334"
 [3,] "Jan 2000" "48384943"
 [4,] "Feb 2000" "12345678"
 [5,] "Feb 2000" "49595340"

          .     .      .
          .     .      .

So in this second data set I have n(say 100 but not constant all the time) number of values for each month of each year.

Both the data sets have values month wise for many subsequent years(like for all the months for year 2000, 2001 and so on). Now I want to find the correlation between these two data sets but month-wise and not as whole. When I use the R command cor(as.numeric(matr1[,"income"]),as.numeric(matr2[,"value"])) then I get the overall correlation but I want the correlation per month rather than as a whole. I want the correlation something like this:

                  Jan | Feb | Mar | Apr | May | .....
Correlation        x  |  y  |  z  |  p  |  q  | .....

The issue that I have is:

How do I get the correlation value per month instead of the overall correlation?

NOTE: I am not sure whether I should have posted this question here or on Cross Validated. I had posted a question for this data set only regarding an error in getting correlation and it was migrated from there to here. So please forgive if I am posting this on wrong place.

UPDATE1: After some suggestion I have modified this post to point in the correct dimension. First of all, the data sets as of now are in matrix format and hence the quotes. I can convert it to data.frame as suggested by some comments but right now I have been calculating the correlation by converting the columns using as.numeric.

I see you're using a matrix. data.frame or ts object would be more appropriate for these sort of matters. — Roman Luštrik
Does each of your data sets have one income/value for each month/year? Then how do you expect to compute a correlation? Apart from that statistical query, your question needs a lot of other help, such as why is everything in quotes (A: you haven't read your data in properly) and how do I convert things to dates (A: see about a hundred other SO questions and/or install the "lubridate" package). — Spacedman
@Spacedman, everything in quotes because he has a matrix, and matrix can't have both character and numeric classes at the same time, so the coercion hierarchy was implemented here by default I guess. — David Arenburg
@Spacedman I have done some modification to above post. I understand that by having only one value of month in each data set we cannot get the correlation. I have values for month for each year in both data sets(for April in 2000 as well as 2001 and so on in each data set and for each month in that way) — Jason Donnald

akrun akrun · Accepted Answer · 2014-08-10T08:44:07

May be you can try:

dat1 <- structure(list(year = c(2000L, 2000L, 2000L, 2000L, 2000L, 2001L, 
2001L, 2001L, 2001L, 2001L), month = c(1L, 2L, 3L, 4L, 5L, 1L, 
2L, 3L, 4L, 5L), income = c(30000L, 12364L, 37485L, 2000L, 7573L, 
25000L, 14364L, 38485L, 4000L, 7873L)), .Names = c("year", "month", 
"income"), class = "data.frame", row.names = c(NA, -10L))

dat2 <- structure(list(month_year = c("Jan 2000", "Feb 2000", "Mar 2000", 
"Apr 2000", "May 2000", "Jan 2001", "Feb 2001", "Mar 2001", "Apr 2001", 
"May 2001"), value = c(84737476L, 39450334L, 48384943L, 12345678L, 
49595340L, 84337476L, 34450334L, 48984943L, 124545678L, 49525340L
)), .Names = c("month_year", "value"), class = "data.frame", row.names = c(NA, 
-10L))



 dat1$month_year <- paste(month.abb[dat1$month], dat1$year)
 dat1$month <- gsub(" \\d+","", dat1$month_year)
 dat2$month <- gsub(" \\d+","", dat2$month_year)
 dat1$indx <- with(dat1, ave(month, month, FUN=seq_along))
 dat2$indx <- with(dat2, ave(month, month, FUN=seq_along))
 dat1 <- dat1[,c(2,3,5)]
 dat2 <- dat2[,c(3,2,4)]
 colnames(dat2)[2] <- "income"

 library(reshape2)

 dat2C <- dcast(dat2, indx~month, value.var="income")
 dat1C <- dcast(dat1, indx~month, value.var="income")
 m1 <- as.matrix(dat1C[,-1])
 m2 <- as.matrix(dat2C[,-1])
 cor(m1,m2)
  diag(cor(m1,m2))
 # Apr Feb Jan Mar May 
  #1   -1   1   1  -1

Also, if you can merge the two datasets together, this could be done using data.table. Using the dput() data above

 library(data.table)
 dat1$month_year <- paste(month.abb[dat1$month], dat1$year)
 dat1 <- dat1[,c(4,3)]
 setDT(dat1)
 setDT(dat2)
 setkey(dat2, month_year)

 dat2[dat1, income := i.income]
 dat2[,month:= gsub(" \\d+", "", month_year)][,cor(value, income), by=month] 
 #    month V1
 #1:   Apr  1
 #2:   Feb -1
 #3:   Jan  1
 #4:   Mar  1
 #5:   May -1

Update

dat1 <- structure(list(month_year = structure(c(5L, 3L, 8L, 1L, 7L, 6L, 
4L, 9L, 2L), .Label = c("Apr 2000", "Apr 2001", "Feb 2000", "Feb 2001", 
"Jan 2000", "Jan 2001", "Jun 2000", "Mar 2000", "Mar 2001"), class = "factor"), 
income = c(30000, 12364, 37485, 2000, 7573, 42000, 15764, 
38465, 5000)), .Names = c("month_year", "income"), row.names = c(NA, 
-9L), class = "data.frame")


 dat2 <-  structure(list(month_year = structure(c(5L, 5L, 5L, 3L, 3L, 7L, 
 7L, 7L, 1L, 1L, 6L, 6L, 4L, 4L, 8L, 8L, 2L, 2L, 2L, 2L), .Label = c("Apr 2000", 
 "Apr 2001", "Feb 2000", "Feb 2001", "Jan 2000", "Jan 2001", "Mar 2000", 
 "Mar 2001"), class = "factor"), value = c(84737476, 39450334, 
 48384973, 12345678, 49595340, 4534353, 43353325, 84333535, 35343232, 
 4334353, 3434353, 5355322, 5223345, 4523535, 345353, 32235, 423553, 
 233553, 423535, 884455)), .Names = c("month_year", "value"), row.names = c(NA, 
 -20L), class = "data.frame")


 datN <- merge(dat1, dat2, all=T)
 library(data.table)
 DT <- data.table(datN)
 DT[, month:= gsub(" \\d+", "", month_year)][,cor(value, income),by=month]
 #   month         V1
 #1:   Apr -0.7136049
 #2:   Feb -0.7037676
 #3:   Jan -0.8637808
 #4:   Jun         NA
 #5:   Mar -0.6484684

Finding correlation between two data sets in R

2 Answers

Update