Finding and Filling Missing Observations (Whole Rows not NA Values) with Mean Values

Question

I am hoping to get some help identifying:

The location of missing observations when no NA values are present (entire row is missing).
Create a row for the missing data with mean values based on mean values from certain categories.

My df:

Numerical load data from hockey players during practices over the season
Occasionally the accelerometers do not work during practice even though the player practiced. So to make sure we can still track the work they did during practice, I would like to insert the average mean value from their position (forward, defense, or goalie) from that practice. (i.e. if a goalie’s accelerometer does not work I would like to take the average loads of the other goalies and insert it into that player’s observational row for the practice).
This would be a simpler task if there were rows ALL players each practice and NA values for loads when the accelerometer doesn’t work BUT when I download the data in a csv from the online cloud there are only rows for players who had working units. So that part is out of my control.

> head(DummyLoads)  
 Name       Date Load Position    
1   Jim 2019-10-19  900    2.100  Forward 
2   Bob 2019-10-19  900    2.100  Forward 
3  Dave 2019-10-19  900    2.100  Forward 
4 Steve 2019-10-19  850    2.312  Forward 
5  Fred 2019-10-19  850    2.312  Defense 
6   Ray 2019-10-19  850    2.312  Defense

DummyLoads <- structure(list(Name = structure(c(4L, 1L, 2L, 6L, 3L, 5L, 4L, 1L, 2L, 3L, 5L, 4L, 1L, 2L, 6L, 3L, 5L, 2L, 6L, 3L, 5L),
 .Label = c("Bob",  "Dave", "Fred", "Jim", "Ray", "Steve"), 
class = "factor"), 
Date = structure(c(1L,  1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,  4L, 4L, 4L, 4L), 
.Label = c("2019-10-19", "2019-10-20", "2019-10-21",  "2019-10-22"), class = "factor"), 
Load = c(900L, 900L, 900L,  850L, 850L, 850L, 789L, 789L, 789L, 960L, 960L, 909L, 909L, 909L,  991L, 991L, 991L, 720L, 717L, 717L, 717L), 
Load.Min = c(2.1,  2.1, 2.1, 2.312, 2.312, 2.312, 2.22, 2.22, 2.22, 2, 2, 1.88,  1.88, 1.88, 1.99, 1.99, 1.99, 2.1, 2.3, 2.3, 2.3), Position = structure(c(2L,  2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L,  2L, 2L, 1L, 1L), 
.Label = c("Defense", "Forward"), class = "factor")), class = "data.frame", row.names = c(NA,  -21L))

ggplot(DummyLoads, aes(x = Name, y = Load, fill = Position))+
  geom_bar(stat = "identity")+
  facet_grid(~Date)

Here is a chart showing players with missing data.

Missing Player Loads

Ideally I would like to be able to identify those missing data points without having to plot it first. I also want to avoid having to manually calculate means each time and then input. Hoping to find an automated solution because I will have a full season of practices to do this with, but understand that might be tricky!

Thank you in advance for any suggestions. I apologize if I didn’t explain things clearly.

mnist mnist · Accepted Answer · 2019-10-27T00:26:58

Updated to the current question:

DummyLoads <- DummyLoads %>% 
  ungroup()
full_data <- expand.grid(
  Athlete = DummyLoads %>% 
    pull(Athlete) %>% 
    unique(),
  Date = DummyLoads %>% 
    pull(Date) %>% 
    unique())


full_data %>% 
  # join incomplete data onto full data
  left_join(DummyLoads, by = c("Athlete", "Date")) %>% 
  # assign the position to each player
  # in the example data, some players do ahve different positions
  # if this is true, than it would be unclear which average should be
  # considered. Therefore, I assumed their position is constant
  left_join(DummyLoads %>% 
              select(Athlete, Position) %>% 
              distinct(Athlete, .keep_all=TRUE),
            by = "Athlete") %>% 
  # keep both to check the differences
  rename(Position = Position.y) %>% 
  group_by(Date, Position) %>% 
  # if Load is missing, take the mean of Loads which is grouped
  mutate(Load2 = coalesce(PL_Avg, 
                          mean(PL_Avg, na.rm = T)))

Finding and Filling Missing Observations (Whole Rows not NA Values) with Mean Values

2 Answers