1
votes

I have a data set with body temperatures taken every minute for 8 hours. I removed aberrant data and now have NA values, sometimes just one alone, and sometimes more then 10 in a row. I would like to replace the missing data using linear interpolation.

I tried different things but I could'nt make 'approx' to work (NA values stayed NA...) or even find a way to specify to R to use the value before (same column, minus 1 row) or the value after (same column, + 1 row). in this examples, where I try to replace just one NA, the [+1] and [-1] are just read as [1], so it doesn't work

df$var1_lini <- ifelse (!is.na(df$var1),df$var1,
                ifelse (!is.na(df$var[+1]),df$var[-1]+(df$var1[-1]+df$var1[+1])/2,NA))

i'm open to any form of solution I am a beginner so a detailed answer would be great! Thank you

Eve

2
library(zoo); help("na.approx")Roland

2 Answers

2
votes

Another approach is to build a linear model using the existing data you have and then use that model (model predictions) to replace NAs.

A simple example to help you understand is this:

library(ggplot2)

# create example dataset
df = data.frame(value = mtcars$qsec,
                time = 1:nrow(mtcars))

# replace some values with NA (you can experiment with different values)
df$value[c(5,12,17,18,30)] = NA

# build linear model based on existing data (model ignores rows with NAs)
m = lm(value ~ time, data = df)

# add predictions as a column
df$pred_value = predict(m, newdata = df)

# replace (only) NAs with predictions
df$interp_value = ifelse(is.na(df$value), df$pred_value, df$value)

# plot existing and interpolated data
ggplot()+
  geom_point(data=df, aes(time, value), size=5)+
  geom_point(data=df, aes(time, interp_value), col="red")

enter image description here

Where the black points represent the existing values and the red points represent existing + NA replacements.

1
votes

The easiest way solve this is to use a package that has functions for missing data replacement like imputeTS or forecast, zoo

The process of replacing missing values with reasonable estimations is also called 'imputation' in statistics.

For interpolating a time series, vector or data.frame it is as easy as this:

library("imputeTS")
na.interpolation(yourDataWithNAs)

Keep in mind, there are also other imputation methods beyond linear interpolation. E.g. Moving Average Imputation, Seasonality based imputation - depending on the problem another method will provide better results. (here are some further explanations: Time Series Imputation)