I've used the aggregate() function a bit but always with FUN=mean
rather than sum
. It seems to have a few idiosyncracies which are making things difficult.
I am using the World Health organisation's mortality dataset which is formatted as so:
> head(data)
Country Admin1 SubDiv Year List Cause Sex Frmat IM_Frmat Deaths1 Deaths2 Deaths3 Deaths4 Deaths5 Deaths6 Deaths7 Deaths8 Deaths9 Deaths10 Deaths11 Deaths12 Deaths13 Deaths14 Deaths15 Deaths16 Deaths17
1 1125 NA 2000 103 V72 1 2 8 1 0 0 NA NA NA 0 0 0 0 0 0 1 0 0 0 0
2 1125 NA 2000 103 V77 1 2 8 1 0 0 NA NA NA 0 0 0 0 0 1 0 0 0 0 0
3 1125 NA 2000 103 V78 1 2 8 19 1 0 NA NA NA 0 2 2 0 2 2 1 3 1 1 2
4 1125 NA 2000 103 V58 1 2 8 30 0 1 NA NA NA 1 2 2 1 4 3 2 4 5 1 0
5 1125 NA 2000 103 V58 2 2 8 4 0 0 NA NA NA 1 0 0 0 0 2 1 0 0 0 0
6 1125 NA 2000 103 V74 1 2 8 5 0 0 NA NA NA 1 1 0 0 1 0 1 0 1 0 0
Deaths18 Deaths19 Deaths20 Deaths21 Deaths22 Deaths23 Deaths24 Deaths25 Deaths26 IM_Deaths1 IM_Deaths2 IM_Deaths3 IM_Deaths4 Name
1 0 0 0 0 0 0 NA NA 0 0 NA NA NA Egypt
2 0 0 0 0 0 0 NA NA 0 0 NA NA NA Egypt
3 0 1 0 0 0 1 NA NA 0 1 NA NA NA Egypt
4 2 1 1 0 0 0 NA NA 0 0 NA NA NA Egypt
5 0 0 0 0 0 0 NA NA 0 0 NA NA NA Egypt
6 0 0 0 0 0 0 NA NA 0 0 NA NA NA Egypt
This shows the cause of each for multiple diseases (V72, V77 etc.) for 1 country (1125 == Egypt) in 1 year, for men (Sex=1) and women (Sex=2). The deaths columns are number of deaths per age group.
The thing is I want to group the number of deaths for a few disease codes, per country, per year, per sex. Specifically I want the disease codes I20* - I25*. These corresponds to the ischaemic heart disease deaths.
The first thing I've done is select only those rows:
codes = c("1067","I20","I21","I22","I23","I24","I25",paste("I",200:250,sep="")) #ICD-10 codes for IHD
data_ihd <- subset(data, Cause %in% codes)
And then I try to use aggregate
to sum the rows which match for Country, Year, Sex. I no longer care about the disease code - this dataset has 1 row per disease code for each country/year/sex combination - I want to treat all the diseases as 1 and sum the deaths.
It seems like I should be able to do:
> aggregate(data_ihd, by=list(data_ihd$Name,data_ihd$Year,data_ihd$Sex),FUN=sum)
Error in Summary.factor(3518L, na.rm = FALSE) :
‘sum’ not meaningful for factors
But it seems not. Which is weirdly because it's happy if I use FUN=mean
.
I know it's because I have factors, but I don't see how I can lose factors without actively deleting the columns?
What would you advise?
I'm happy to use a package like dplyr or tidyr but not sure how they would help here...
EDIT: I guess I also need to be careful here that I only sum the Deaths* columns rather than the Country, Year columns! I'm not sure if aggregate makes sure it doesn't sum the columns you supply in by=
EDIT: I've been asked to give a bit more info about what I want. If we take the following dataset:
> head(data)
Country Admin1 SubDiv Year List Cause Sex Frmat IM_Frmat Deaths1 Deaths2 Deaths3 Deaths4 Deaths5 Deaths6 Deaths7 Deaths8 Deaths9 Deaths10 Deaths11 Deaths12 Deaths13 Deaths14 Deaths15 Deaths16 Deaths17
1 1000 NA 2000 103 1 1 2 8 1 0 0 NA NA NA 0 0 0 0 0 0 1 0 0 0 0
2 1000 NA 2000 103 1 2 2 8 1 0 0 NA NA NA 0 0 0 0 0 1 0 0 0 0 0
3 1000 NA 2000 103 2 1 2 8 19 1 0 NA NA NA 0 2 2 0 2 2 1 3 1 1 2
4 1000 NA 2000 103 2 2 2 8 30 0 1 NA NA NA 1 2 2 1 4 3 2 4 5 1 0
5 1000 NA 2001 103 1 1 2 8 4 0 0 NA NA NA 1 0 0 0 0 2 1 0 0 0 0
6 1000 NA 2001 103 1 2 2 8 5 0 0 NA NA NA 1 1 0 0 1 0 1 0 1 0 0
7 1000 NA 2001 103 2 1 2 8 4 0 0 NA NA NA 1 0 0 0 0 2 1 0 0 0 0
8 1000 NA 2001 103 2 2 2 8 5 0 0 NA NA NA 1 1 0 0 1 0 1 0 1 0 0
9 2000 NA 2000 103 1 1 2 8 4 0 0 NA NA NA 1 0 0 0 0 2 1 0 0 0 0
10 2000 NA 2000 103 1 2 2 8 5 0 0 NA NA NA 1 1 0 0 1 0 1 0 1 0 0
Becomes
Country Admin1 SubDiv Year List Sex Frmat IM_Frmat Deaths1 Deaths2 Deaths3 Deaths4 Deaths5 Deaths6 Deaths7 Deaths8 Deaths9 Deaths10 Deaths11 Deaths12 Deaths13 Deaths14 Deaths15 Deaths16 Deaths17
1 1000 NA 2000 103 1 2 8 20 1 0 NA NA NA 0 2 2 0 2 2 3 3 1 1 2
2 1000 NA 2000 103 2 2 8 31 0 1 NA NA NA 1 2 2 1 4 4 2 4 5 1 0
...
. 2000 NA 2000 ...
As you can see, rows 1 and 3 have been summed for the Deaths columns because the Country, Year and Sex are equal. The cause is ignored. The matching columns have not been summed, of course.
EDIT 3: Let's simplyify this a lot, then:
Country Year Sex Cause Deaths1 Deaths2 Deaths3
1 UK 2000 1 A 1 1 1
2 UK 2000 2 A 1 1 1
3 UK 2000 1 B 1 1 1
4 UK 2000 2 B 1 1 1
5 UK 2001 1 A 1 1 1
6 UK 2001 2 A 1 1 1
7 UK 2001 1 B 1 1 1
8 UK 2001 2 B 1 1 1
1 USA 2000 1 A 1 1 1
2 USA 2000 2 A 1 1 1
3 USA 2000 1 B 1 1 1
4 USA 2000 2 B 1 1 1
5 USA 2001 1 A 1 1 1
6 USA 2001 2 A 1 1 1
7 USA 2001 1 B 1 1 1
8 USA 2001 2 B 1 1 1
...
Group all the causes together (sum the deaths), when County, Year and Sex are the same, becoming:
Country Year Sex Deaths1 Deaths2 Deaths3
1 UK 2000 1 2 2 2
2 UK 2000 2 2 2 2
3 UK 2001 1 2 2 2
4 UK 2001 2 2 2 2
5 USA 2000 1 2 2 2
6 USA 2000 2 2 2 2
7 USA 2001 1 2 2 2
8 USA 2001 2 2 2 2
I can't seem to use aggregate here because 1) Country is a factor; 2) it would sum the years
Deaths
columnsaggregate(.~Name+Year+Sex, df[-c(1:3,5:6, 8:9)], FUN=sum, na.rm=TRUE, na.action=na.pass)
- akrun