0
votes

I am hoping to get some help identifying:

  1. The location of missing observations when no NA values are present (entire row is missing).
  2. Create a row for the missing data with mean values based on mean values from certain categories.

My df:

  • Numerical load data from hockey players during practices over the season

  • Occasionally the accelerometers do not work during practice even though the player practiced. So to make sure we can still track the work they did during practice, I would like to insert the average mean value from their position (forward, defense, or goalie) from that practice. (i.e. if a goalie’s accelerometer does not work I would like to take the average loads of the other goalies and insert it into that player’s observational row for the practice).

  • This would be a simpler task if there were rows ALL players each practice and NA values for loads when the accelerometer doesn’t work BUT when I download the data in a csv from the online cloud there are only rows for players who had working units. So that part is out of my control.

> head(DummyLoads)  
 Name       Date Load Position    
1   Jim 2019-10-19  900    2.100  Forward 
2   Bob 2019-10-19  900    2.100  Forward 
3  Dave 2019-10-19  900    2.100  Forward 
4 Steve 2019-10-19  850    2.312  Forward 
5  Fred 2019-10-19  850    2.312  Defense 
6   Ray 2019-10-19  850    2.312  Defense
DummyLoads <- structure(list(Name = structure(c(4L, 1L, 2L, 6L, 3L, 5L, 4L, 1L, 2L, 3L, 5L, 4L, 1L, 2L, 6L, 3L, 5L, 2L, 6L, 3L, 5L),
 .Label = c("Bob",  "Dave", "Fred", "Jim", "Ray", "Steve"), 
class = "factor"), 
Date = structure(c(1L,  1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,  4L, 4L, 4L, 4L), 
.Label = c("2019-10-19", "2019-10-20", "2019-10-21",  "2019-10-22"), class = "factor"), 
Load = c(900L, 900L, 900L,  850L, 850L, 850L, 789L, 789L, 789L, 960L, 960L, 909L, 909L, 909L,  991L, 991L, 991L, 720L, 717L, 717L, 717L), 
Load.Min = c(2.1,  2.1, 2.1, 2.312, 2.312, 2.312, 2.22, 2.22, 2.22, 2, 2, 1.88,  1.88, 1.88, 1.99, 1.99, 1.99, 2.1, 2.3, 2.3, 2.3), Position = structure(c(2L,  2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L,  2L, 2L, 1L, 1L), 
.Label = c("Defense", "Forward"), class = "factor")), class = "data.frame", row.names = c(NA,  -21L))
ggplot(DummyLoads, aes(x = Name, y = Load, fill = Position))+
  geom_bar(stat = "identity")+
  facet_grid(~Date)

Here is a chart showing players with missing data.

Missing Player Loads

Ideally I would like to be able to identify those missing data points without having to plot it first. I also want to avoid having to manually calculate means each time and then input. Hoping to find an automated solution because I will have a full season of practices to do this with, but understand that might be tricky!

Thank you in advance for any suggestions. I apologize if I didn’t explain things clearly.

2

2 Answers

1
votes

Updated to the current question:

DummyLoads <- DummyLoads %>% 
  ungroup()
full_data <- expand.grid(
  Athlete = DummyLoads %>% 
    pull(Athlete) %>% 
    unique(),
  Date = DummyLoads %>% 
    pull(Date) %>% 
    unique())


full_data %>% 
  # join incomplete data onto full data
  left_join(DummyLoads, by = c("Athlete", "Date")) %>% 
  # assign the position to each player
  # in the example data, some players do ahve different positions
  # if this is true, than it would be unclear which average should be
  # considered. Therefore, I assumed their position is constant
  left_join(DummyLoads %>% 
              select(Athlete, Position) %>% 
              distinct(Athlete, .keep_all=TRUE),
            by = "Athlete") %>% 
  # keep both to check the differences
  rename(Position = Position.y) %>% 
  group_by(Date, Position) %>% 
  # if Load is missing, take the mean of Loads which is grouped
  mutate(Load2 = coalesce(PL_Avg, 
                          mean(PL_Avg, na.rm = T)))

0
votes

Generic Base R solution:

# Mark out NA rows flatten in single observation (each element denoting a vector of the df):

    is_val_na <- apply(data.frame(lapply(DummyLoads,

                                         function(x){is.na(x)})), 1, paste, collapse = ", " )

    # Split up using the grouping var "Name", and impute the mean where NA, coerce list to df: 

    DummyLoads_imputed <- do.call("rbind", lapply(split(DummyLoads, DummyLoads$Position), 
                              function(x){
                                if(is.numeric(x)){
                                  ifelse(is.na(x), mean(x, na.rm = TRUE), x)
                                  }else{x}
                                }
                              )
                            )

    # Bind the data.frame with a factor vector holding the T/F values: 

    DummyLoads_imputed <- cbind(DummyLoads_imputed, row_na = as.factor(is_val_na))

Data Used:

DummyLoads <- structure(list(Name = structure(c(4L, 1L, 2L, 6L, 3L, 5L, 4L, 1L, 2L, 3L, 5L, 4L, 1L, 2L, 6L, 3L, 5L, 2L, 6L, 3L, 5L), 
                                          .Label = c("Bob",  "Dave", "Fred", "Jim", "Ray", "Steve"),
                                          class = "factor"), Date = structure(c(1L,  1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
                                                                                3L, 3L, 3L,  4L, 4L, 4L, 4L), .Label = c("2019-10-19", "2019-10-20", "2019-10-21",  "2019-10-22"), 
                                                                              class = "factor"), Load = c(900L, 900L, 900L,  850L, 850L, 850L, 789L, 789L, 789L, 960L, 
                                                                                                          960L, 909L, 909L, 909L,  991L, 991L, 991L, 720L, 717L, 717L, 717L), 
                         Position = structure(c(2L,  2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,  1L, 2L, 1L, 1L), 
                                              .Label = c("Defense", "Forward"), class = "factor")), row.names = c(NA,  -21L), class = "data.frame")