I'm wondering what I'm doing incorrectly either with data.table or dplyr.
The goal of the below code snippet is to calculate the difference in ROA from the median ROA by sector and year. The two look like they should produce comparable results but do not.
require(data.table)
require(dplyr)
set.seed(1)
roa <- rnorm(100000, mean = 0, sd = 1)
sector <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
year <- c(2012, 2011, 2010, 2009, 2008, 2007)
sector <- sample(sector, 100000, replace = T)
year <- sample(year, 100000, replace = T)
data <- data.table(roa, sector, year)
rm(roa, sector, year)
data[,roa_ad_chk:= roa - median(roa, na.rm=T), by = c("sector", "year")]
data <- data %>%
group_by(sector, year) %>%
mutate(roa_ad = roa - median(roa, na.rm = T))
#shouldn't these functions be equivalent?
sum(data$roa_ad_chk - data$roa_ad)
rm(data)
dplyranddata.tabledo you run? - talat