0
votes

I'm wondering what I'm doing incorrectly either with data.table or dplyr.

The goal of the below code snippet is to calculate the difference in ROA from the median ROA by sector and year. The two look like they should produce comparable results but do not.

require(data.table)
require(dplyr)

set.seed(1)
roa <- rnorm(100000, mean = 0, sd = 1)
sector <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
year <- c(2012, 2011, 2010, 2009, 2008, 2007)
sector <- sample(sector, 100000, replace = T)
year <- sample(year, 100000, replace = T)
data <- data.table(roa, sector, year)
rm(roa, sector, year)

data[,roa_ad_chk:= roa - median(roa, na.rm=T), by = c("sector", "year")]
data <- data %>% 
  group_by(sector, year) %>%
  mutate(roa_ad = roa - median(roa, na.rm = T))

#shouldn't these functions be equivalent?
sum(data$roa_ad_chk - data$roa_ad)
rm(data)
2
When I run this, the final sum is 0 - so they seem to be equivalent. What do you get get? And what versions of dplyr and data.table do you run? - talat
I was getting 66.81926. Though I just restarted R and am getting an answer of 0. Looks like it is something to do with having both dplyr & plyr loaded. - rwdvc

2 Answers

1
votes

You don't necessarily need to detach one of the packages in this case. You can have both packages loaded, but when calling a function that has a shared name between them you can use the scope operator to differentiate which one you are calling. For example, suppose you want to call the function 'summarise()' from plyr package. You call it:

plyr::summarise()

and if you want to call that function from the dplyr package, you call:

dplyr::summarise()
1
votes

The issue was a result of having both dplyr and plyr loaded.