I want to calculate difference by groups. Although I referred R: Function “diff” over various groups thread on SO, for unknown reason, I am unable to find the difference. I have tried three methods : a) spread b) dplyr::mutate with base::diff() c) data.table with base::diff(). While a) works, I am unsure how I can solve this problem using b) and c).
Description about the data:
I have revenue data for the product by year. I have categorized years >= 2013 as Period 2 (called P2), and years < 2013 as Period 1 (called P1).
Sample data:
dput(Test_File)
structure(list(Ship_Date = c(2010, 2010, 2012, 2012, 2012, 2012,
2017, 2017, 2017, 2016, 2016, 2016, 2011, 2017), Name = c("Apple",
"Apple", "Banana", "Banana", "Banana", "Banana", "Apple", "Apple",
"Apple", "Banana", "Banana", "Banana", "Mango", "Pineapple"),
Revenue = c(5, 10, 13, 14, 15, 16, 25, 25, 25, 1, 2, 4, 5,
7)), .Names = c("Ship_Date", "Name", "Revenue"), row.names = c(NA,
14L), class = "data.frame")
Expected Output
dput(Diff_Table)
structure(list(Name = c("Apple", "Banana", "Mango", "Pineapple"
), P1 = c(15, 58, 5, NA), P2 = c(75, 7, NA, 7), Diff = c(60,
-51, NA, NA)), .Names = c("Name", "P1", "P2", "Diff"), class = "data.frame", row.names = c(NA,
-4L))
Here's my code:
Method 1: Using spread [Works]
data.table::setDT(Test_File)
cutoff<-2013
Test_File[Test_File$Ship_Date>=cutoff,"Ship_Period"]<-"P2"
Test_File[Test_File$Ship_Date<cutoff,"Ship_Period"]<-"P1"
Diff_Table<- Test_File %>%
dplyr::group_by(Ship_Period,Name) %>%
dplyr::mutate(Revenue = sum(Revenue)) %>%
dplyr::select(Ship_Period, Name,Revenue) %>%
dplyr::ungroup() %>%
dplyr::distinct() %>%
tidyr::spread(key = Ship_Period,value = Revenue) %>%
dplyr::mutate(Diff = `P2` - `P1`)
Method 2: Using dplyr [Doesn't work: NAs are generated in Diff column.]
Diff_Table<- Test_File %>%
dplyr::group_by(Ship_Period,Name) %>%
dplyr::mutate(Revenue = sum(Revenue)) %>%
dplyr::select(Ship_Period, Name,Revenue) %>%
dplyr::ungroup() %>%
dplyr::distinct() %>%
dplyr::arrange(Name,Ship_Period, Revenue) %>%
dplyr::group_by(Ship_Period,Name) %>%
dplyr::mutate(Diff = diff(Revenue))
Method 3: Using data.table [Doesn't work: It generates all zeros in Diff column.]
Test_File[,Revenue1 := sum(Revenue),by=c("Ship_Period","Name")]
Diff_Table<-Test_File[,.(Diff = diff(Revenue1)),by=c("Ship_Period","Name")]
Question: Can someone please help me with method 2 and method 3 above? I am fairly new to R so I apologize if my work sounds too basic. I am still learning this language.
P1forPineapplebeNA? - jogoNA-number=NAandnumber-NA=NA. So, I would believe that it doesn't matter much. I created those entries to ensure that the code doesn't blow up when one ofP1orP2are missing. Does that help? - watchtower