2
votes

I have this dataframe:

set.seed(0)
df <- data.frame(id = factor(sample(1:100, 10000, replace=TRUE), levels=1:100),
          year = factor(sample(1950:2019, 10000, replace=TRUE), levels=1950:2019)) %>% unique() %>% arrange(id, year)

And I'm looking to plot a heatmap graph where the ids are in the X-axis, years at the Y-axis, and the color is blue when the data point exists and the color is red when the data doesn't exist. I'm almost there, but I can't figure out to change the fill argument for the two colors:

ggplot(df, aes(id, year, fill= year)) + 
  geom_tile()

The objective to plot both variables as factors is to plot them even when some year doesn't have any id (and plotting its whole row as red).

EDIT:

Two things I forgot to add (hope it's not too late):

  • How to add alpha transparency to geom_tile() without messing it?
  • I need to sort the ids from maximum missings to minimum missings.
3

3 Answers

5
votes

The complete() function from the tidyr package is useful for filling in missing combinations. First, you need to set a flag variable to indicate if the data is present or not, and then expand the data frame with the missing combinations and fill the new flag variable with 0:

df <- df %>% 
  mutate(flag = TRUE) %>% 
  complete(id, year, fill = list(flag = FALSE))

ggplot(df, aes(id, year, fill = flag)) + 
  geom_tile()

enter image description here EDIT1: To add transparency, add alpha = 0.x within geom_tile(), where x is a value indicating the transparency. The lower the value, the more transparent.

EDIT2: To sort by missingness add the following code prior to the ggplot code:

# Determine the order of the IDs
df_order <- df %>% 
  group_by(id) %>% 
  summarize(sum = sum(flag)) %>% 
  arrange(desc(sum)) %>% 
  mutate(order = row_number()) %>% 
  select(id, order)

# Set the IDs in order on the chart
df <- df %>% 
  left_join(df_order) %>% 
  mutate(id = fct_reorder(id, order))
3
votes

I think you need to do some pre-processing before plotting. Create a temporary variable (data_exist) which denotes data is present for that id and year. Then use complete to fill the missing years for each id and plot it.

library(tidyverse)
df %>%
   mutate_all(~as.integer(as.character(.))) %>%
   mutate(data_exist = 1) %>%
   complete(id, year = min(year):max(year), fill = list(data_exist = 0)) %>%
   mutate(data_exist = factor(data_exist)) %>%
   ggplot() + aes(id, year, fill= data_exist) + geom_tile()

enter image description here

1
votes

With expand.gridyou can create a dataframe with all combinations of ids and years, then left join on this combinations to see if you had them in df

  all <- expand.grid(id=levels(df$id),year=levels(df$year)) %>% 
              left_join(df) %>% 
              mutate(present=ifelse(is.na(present),'0','1'))
   ggplot(all, aes(as.numeric(id), as.numeric(year), fill= present)) + 
            geom_tile() + 
            scale_fill_manual(values=c('0'='red','1'='blue')) + # change default colors
            theme(legend.position="None") # hide legend

enter image description here