0
votes

I have the following R problem when calculating median value from time data series. Can someone understand why R behaves so strangely when such a simple thing like median value needs to be calculated.

  • Task: calculate median value of finishing time from runners racing competition dataset.
  • Problem: when taking median value from time value an error message "argument is not numeric or logical: returning NA" is returned by R.
  • Data is read in from "NEJ_21_km_results.csv" file and factors converted to char value. When trying to convert time value from char to numeric "NAs introduced by coercion" message is returned (but there is no NA values in dataframe).
  • In some other cases (when using other files) only then error message is returned when data is filtered by gender (and sometimes only for one gender).

1) Read data into "all_runners" dataframe

all_runners <- read.csv("NEJ_21_km_results.csv", stringsAsFactors=FALSE, strip.white = TRUE)

"RESULT" datafield info is of "chr" datatype

str(all_runners)

'data.frame':   100 obs. of  10 variables:
 $ POS  : int  1 2 3 4 5 6 7 8 9 10 ...
 $ BIB     : int  3 2 1 9 5 10 8 33 34 67 ...
 $ NAME    : chr  "DOMINIC KIPTARUS" "TIIDREK NURME" "ROMAN FOSTI" "RAIDO MITT"...
 $ YOB     : int  1996 1985 1983 1991 1984 1982 1993 1992 1984 1996 ...
 $ NAT     : chr  "KEN" "EST" "EST" "EST" ...
 $ CITY    : chr  "" "" "" "" ...
 $ RESULT  : chr  "01:03:55" "01:03:57" "01:06:18" "01:09:33" ...
 $ BEHIND  : chr  "" "00:00:02" "00:02:23" "00:05:38" ...
 $ NET.TIME: chr  "01:03:55" "01:03:57" "01:06:18" "01:09:31"...
 $ CAT     : chr  "MN" "M" "M" "M" ...

2) Calculate median of all runners results

> all_runners_median = median(all_runners$RESULT, na.rm = TRUE)

Warning message: In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) : argument is not numeric or logical: returning NA

3) Convert time value from char to numeric

> results_to_numeric <- as.numeric(all_runners$RESULT)

Warning message: NAs introduced by coercion

4) Calculate median of all womens results ('N'=>women, 'M'=>men)

all_womens <- all_runners %>%
  filter(str_sub(CAT, 1, 1) == "N") %>%
  select(RESULT)

'RESULT' datafield info is of 'chr' datatype

> str(all_womens)

'data.frame': 8 obs. of 1 variable: $ RESULT: chr "01:18:36" "01:20:07" "01:22:52" "01:25:11" ...

Warning message: In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) : argument is not numeric or logical: returning NA

> all_womens
    RESULT
1 01:18:36
2 01:20:07
3 01:22:52
4 01:25:11
5 01:26:04
6 01:26:09
7 01:26:42
8 01:26:55
1
You have two problems, it is not possible to calculate a median/mean from a dataframe and or for a character column. First change RESULT to date class and you will be fine. Instead of select use pull. Sorry I'm working from phone otherwise I would be more helpful.A. Suliman

1 Answers

1
votes

Here how can apply median on time:

# Get sample of 'Date/Time Type'
x <- c("01:03:55", "01:03:57", "01:06:18", "01:09:33")

# Convert to proper format 
y <- as.POSIXct(x, format = "%H:%M:%S")

# Find the median
y <- median(y)

#  Updated, no need to use strsplit and sapply, directly use format
#  ys <- strsplit(as.character(y), split = " ")
#  sapply(ys, function(x) x[2])

# Get the time
format(y,"%H:%M:%S" )
[1] "01:05:07"

When you apply as.POSIXct, it will associate a date with it.

Edit: based on suggestion by: Rich Scriven, we can directly use format and it eliminates the need to use splitting and looping.

If you want to perform the analysis by group, e.g gender, you can simply use:

x <- c("01:03:55", "01:03:57", "01:06:18", "01:09:33")
df <- data.frame(Gender = rep(c("M", "F"), each = 4), time = x)
# > df
#   Gender     time
# 1      M 01:03:55
# 2      M 01:03:57
# 3      M 01:06:18
# 4      M 01:09:33
# 5      F 01:03:55
# 6      F 01:03:57
# 7      F 01:06:18
# 8      F 01:09:33

df$time <- as.POSIXct(x, format = "%H:%M:%S")
time_group_by_gender <- split(df$time, df$Gender )
# > time_group_by_gender
# $F
# [1] "2018-07-21 01:03:55 +03" "2018-07-21 01:03:57 +03" "2018-07-21 01:06:18 +03"
# [4] "2018-07-21 01:09:33 +03"
# 
# $M
# [1] "2018-07-21 01:03:55 +03" "2018-07-21 01:03:57 +03" "2018-07-21 01:06:18 +03"
# [4] "2018-07-21 01:09:33 +03"

time_median <- lapply(time_group_by_gender, median)
time_median <- lapply(time_median, format, "%H:%M:%S")

# > time_median
# $F
# [1] "01:05:07"
# 
# $M
# [1] "01:05:07"