0
votes

This is first time I am trying to work with time-series plots. I have a dataset with approximately 50k rows with similar structure as below for multiple years;

Year    expense_1   expense_2   expense_3   expense_4
1999    5           NA          NA          31.82
2000    2           NA          NA          4.75
1999    10.49       NA          NA          NA
2000    39.69       NA          NA          NA
2000    NA          NA          10.61       NA
1999    8.08        NA          NA          NA
2000    16          NA          NA          NA
1999    9.32        NA          NA          NA
1999    9.35        NA          NA          NA

Now, I want to plot a time-series with Year on the X-axis, Expense on the Y-axis with different lines each for expense_1, expense_2, expense_3, expense_4. The expense for each category should be summed-up year-wise and NA should be removed.

2

2 Answers

2
votes

You can calculate the sum using summarise_all then convert your data to long format so that it's easier to plot using ggplot

library(tidyverse)
library(scales)

df <- read.table(text = "Year    expense_1   expense_2   expense_3   expense_4
1999    5           NA          NA          31.82
                 2000    2           NA          NA          4.75
                 1999    10.49       NA          NA          NA
                 2000    39.69       NA          NA          NA
                 2000    NA          NA          10.61       NA
                 1999    8.08        NA          NA          NA
                 2000    16          NA          NA          NA
                 1999    9.32        NA          NA          NA
                 1999    9.35        NA          NA          NA",
                 header = TRUE, stringsAsFactors = FALSE)

# define summation function that returns NA if all values are NA
# By default, R returns 0 if all values are NA
sum_NA <- function(x) {
  if(all(is.na(x))) NA_integer_ else sum(x, na.rm = TRUE)
} 

df_long <- df %>% 
  group_by(Year) %>% 
  summarise_all(funs(sum_NA(.))) %>% 
  gather(key = "type", value = "expense", -Year)
df_long

#> # A tibble: 8 x 3
#>    Year type      expense
#>   <int> <chr>       <dbl>
#> 1  1999 expense_1   42.2 
#> 2  2000 expense_1   57.7 
#> 3  1999 expense_2   NA   
#> 4  2000 expense_2   NA   
#> 5  1999 expense_3   NA   
#> 6  2000 expense_3   10.6 
#> 7  1999 expense_4   31.8 
#> 8  2000 expense_4    4.75

ggplot(df_long, aes(x = Year, y = expense, color = type, group = type)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 1)) +
  theme_bw()

Created on 2018-05-21 by the reprex package (v0.2.0).

0
votes

You can let ggplot do most of the work for you - just gather, then start plotting:

df %>%
  gather(expense, value, -Year) %>%
  ggplot(aes(x=Year, y=value, color=expense)) +
  geom_line(stat="summary", fun.y="sum")