Is there a way in R to ignore a "." in my data when calculating mean/sd/etc

Question

I have a large data set that I need to calculate mean/std dev/min/ and max on for several columns. The data set uses a "." to denote when a value is missing for a subject. When running the mean or sd function this causes R to return NA . Is there a simple way around this?

my code is just this

xCAL<-mean(longdata$CAL)
sdCAL<-sd(longdata$CAL)
minCAL<-min(longdata$CAL)
maxCAL<-max(longdata$CAL)

but R will return NA on all these variables. I get the following Error

Warning message: In mean.default(longdata$CAL) : argument is not numeric or logical: returning NA

How did you import the data? Usually there are options to specify what values are missing during import. It's better to fix the problem there than try to clean up afterword. — MrFlick

Gregor Thomas Gregor Thomas · Accepted Answer · 2020-03-31T15:06:03

You need to convert your data to numeric to be able to do any calculations on it. When you run as.numeric, your . will be converted to NA, which is what R uses for missing values. Then, all of the functions you mention take an argument na.rm that can be set to TRUE to remove (rm) missing values (na).

If your data is a factor, you need to convert it to character first to avoid loss of information as explained in this FAQ.

Overall, to be safe, try this:

longdata$CAL <- as.numeric(as.character(longdata$CAL))
xCAL <- mean(longdata$CAL, na.rm = TRUE)
sdCAL <- sd(longdata$CAL, na.rm = TRUE)
# etc

Do note that na.rm is a property of the function - it's not magic that works everywhere. If you look at the help pages for ?mean ?sd, ?min, etc., you'll see the na.rm argument documented. If you want to remove missing values in general, the na.omit() function works well.

Is there a way in R to ignore a "." in my data when calculating mean/sd/etc

1 Answers