2
votes

Im trying to produce a correlation plot for my data but i get 'x must be numeric error', other fixes have not worked for my case. Do i have to change the month to numeric as well? or is there a way of selecting only the numeric columns for my plot

Tried converting all to numeric but it just changes back to factor automatically

getwd()
myDF <- read.csv("qbase.csv")
head(myDF)
str(myDF)

cp <-cor(myDF)
head(round(cp,2))

'data.frame':   12 obs. of  8 variables:
 $ Month                    : Factor w/ 12 levels "18-Apr","18-Aug",..: 5 4 8 1 9 7 6 2 12 11 ...
 $ Monthly.Recurring.Revenue: Factor w/ 2 levels "$25,000 ","$40,000 ": 1 1 1 1 1 2 2 2 2 2 ...
 $ Price.per.Seat           : Factor w/ 2 levels "$40 ","$50 ": 2 2 2 2 2 1 1 1 1 1 ...
 $ Paid.Seats               : int  500 500 500 500 500 1000 1000 1000 1000 1000 ...
 $ Active.Users             : int  10 50 50 100 450 550 800 900 950 800 ...
 $ Support.Cases            : int  0 0 1 5 35 155 100 75 50 45 ...
 $ Users.Trained            : int  1 5 0 50 100 300 50 30 0 100 ...
 $ Features.Used            : int  5 5 5 5 8 9 9 10 15 15 ...

The results to dput(myDF) as are follows:

dput( myDF)

structure(list(Month = structure(c(5L, 4L, 8L, 1L, 9L, 7L, 6L, 
2L, 12L, 11L, 10L, 3L), .Label = c("18-Apr", "18-Aug", "18-Dec", 
"18-Feb", "18-Jan", "18-Jul", "18-Jun", "18-Mar", "18-May", "18-Nov", 
"18-Oct", "18-Sep"), class = "factor"), Monthly.Recurring.Revenue = structure(c(1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("$25,000 ", 
"$40,000 "), class = "factor"), Price.per.Seat = structure(c(2L, 
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("$40 ", 
"$50 "), class = "factor"), Paid.Seats = c(500L, 500L, 500L, 
500L, 500L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L), 
    Active.Users = c(10L, 50L, 50L, 100L, 450L, 550L, 800L, 900L, 
    950L, 800L, 700L, 600L), Support.Cases = c(0L, 0L, 1L, 5L, 
    35L, 155L, 100L, 75L, 50L, 45L, 10L, 5L), Users.Trained = c(1L, 
    5L, 0L, 50L, 100L, 300L, 50L, 30L, 0L, 100L, 50L, 0L), Features.Used = c(5L, 
    5L, 5L, 5L, 8L, 9L, 9L, 10L, 15L, 15L, 15L, 15L)), class = "data.frame", row.names = c(NA, 
-12L))
4
You have multiple answers, several of which are claiming to remove an error because their code does not say "Error" when run. But how many of them are actually changing the values to be valid months? None f them I suspect. If you turn a factor value into numeric without a proper converstion to Date class then April and Augus will both be lower than January. Better answers might come if you posted the output of dput( myDF). You should do this with the original version. Not the one after the code below has been applied.IRTFM
@42- Thanks a lot for the tips, i'm completely new to this thats why my posting of comments and code is a bit off. i will repost the output of dput(myDF). In the end i intend to produce a correlation plot using corrplot(myDF, method="circle") and also run a multiple linear regression to see which variables affect active users the most. Any help with that will also be very welcome.NeverQuit101

4 Answers

1
votes

You can convert dates to POSIXct and also remove the dollar sign to convert the second and third columns to numeric:

myDF$Month <- as.numeric(as.POSIXct(myDF$Month, format="%d-%b", tz="GMT"))
myDF[,c(2,3)] <-  sapply(myDF[,c(2,3)], function(x) as.numeric(gsub("[\\$,]", "", x)))

cp <-cor(myDF)

library(ggcorrplot)
ggcorrplot(cp)
0
votes

You are trying to get a correlation between factors and numeric columns, wich can't happen (cor handles only numeric, hence the error). You can do:

library(data.table)

ir <- data.table(iris) # since you didn't produce a reproducible example

ir[, cor(.SD), .SDcols = names(ir)[(lapply(ir, class) == "numeric")]]

what is in there:

cor(.SD) will calculate the correlation matrix for a new dataframe composed of a subset data.table (.SD, see ?data.table).

.SDcols establish wich columns will go into that subset data.table. They are only those which class is numeric.

0
votes

You can remove the dollar sign and change the integer variables to numeric using sapply, then calculate the correlation.

myDF[,c(2,3)] <-  sapply(myDF[,c(2,3)], function(x) as.numeric(gsub("[\\$,]", "", x)))
newdf <-  sapply(myDF[,2:8],as.numeric)
cor(newdf)

Edited:

If you want to use the month variable. Please install lubridate and use month function.

For example:

library(lubridate)
myDF$Month<- month(as.POSIXct(myDF$Month, format="%d-%b", tz="GMT"))
myDF[,c(2,3)] <-  sapply(myDF[,c(2,3)], function(x) as.numeric(gsub("[\\$,]", "", x)))
newdf <-  sapply(myDF,as.numeric)
cor(as.data.frame(newdf))
0
votes

The way to convert those months to Date class:

myDF$MonDt <- as.Date( paste0(myDF$Month, "-15"), format="%y-%b-%d")

Could also have used zoo::as.yearmon. Either method would allow you to apply as.numeric to get a valid time scaled value. The other answers are adequate when using single year data but because they incorrectly make the assumption the the leading two digits are day of the month rather than the year, they are going to fail to deliver valid answers in any multi-year dataset, but will not throw any warning about this.

with(myDF, cor(Active.Users, as.numeric(MonDt) )  )
[1] 0.8269705

As one of the other answers illustrated removing the $ and commas is needed before as.numeric will succeed on currency-formatted text. Again, this is also factor data so as.numeric could have yielded erroneous answers, although in this simple example it would not. A safe method would be:

myDF[2:3] <- lapply(myDF[2:3], function(x) as.numeric( gsub("[$,]", "", x)))

myDF
    Month Monthly.Recurring.Revenue Price.per.Seat Paid.Seats Active.Users
1  18-Jan                     25000             50        500           10
2  18-Feb                     25000             50        500           50
3  18-Mar                     25000             50        500           50
4  18-Apr                     25000             50        500          100
5  18-May                     25000             50        500          450
6  18-Jun                     40000             40       1000          550
7  18-Jul                     40000             40       1000          800
8  18-Aug                     40000             40       1000          900
9  18-Sep                     40000             40       1000          950
10 18-Oct                     40000             40       1000          800
11 18-Nov                     40000             40       1000          700
12 18-Dec                     40000             40       1000          600
   Support.Cases Users.Trained Features.Used      MonDt
1              0             1             5 2018-01-15
2              0             5             5 2018-02-15
3              1             0             5 2018-03-15
4              5            50             5 2018-04-15
5             35           100             8 2018-05-15
6            155           300             9 2018-06-15
7            100            50             9 2018-07-15
8             75            30            10 2018-08-15
9             50             0            15 2018-09-15
10            45           100            15 2018-10-15
11            10            50            15 2018-11-15
12             5             0            15 2018-12-15

This question gets an answer that allows multiple correlation coefficients to be calculated and the two way data associations plotted on one page:

How to add p values for correlation coefficients plotted using splom in lattice?