0
votes

I'm somewhat still new to R. I have a data frame of school absences with columns for dates, individual school ID numbers, day of the week, and counts for that day's number of absent students. Now, my frame has three categories of ID's: elementary school(200), middle school(300), and high school(500). Each school has its own dates and its individual absence counts for those dates, and all schools of each school type combine their totals for that date together.

The head of the data, with an example of totals for all elementary schools:

Count   Date   ID  DOW
500  1/4/2011 200   1
689  1/5/2011 200   2
345  1/6/2011 200   3
100  1/7/2011 200   4
421  1/8/2011 200   5
1211 1/11/2011 200  1

And the tail end with data for one specific elementary school:

Count  Date   ID   DOW
 3   1/4/2011 225   3
 3   1/5/2011 225   4
 3   1/6/2011 225   5
 2   1/7/2011 225   1
 2   1/8/2011 225   2
 2   1/11/2011 225  3

My problem is that I have several schools in my data frame that I want to remove from the time series analysis, but their individual daily counts are added up in their school type's total counts for that day. I'm trying to find a way to loop over the frame and remove those school's rows and to subtract their counts from their respective day's total count. So after the analysis, the head data would look more like:

Count   Date   ID  DOW
497  1/4/2011 200   1
686  1/5/2011 200   2
342  1/6/2011 200   3
98   1/7/2011 200   4
419  1/8/2011 200   5
1209 1/11/2011 200  1

I've looked at using apply() and sweep() and dplyr, but found nothing that combines looping and subtracting inside of the same frame.

EDIT: To clarify, this data frame is much larger then my example. It has several dozen schools, each with their own school-specific IDs and daily absence counts but roughly the same dates in session. I have roughly 90,000 rows of observations, so a process that automatically iterates and checks for each date and school ID was what I was looking for.

So, if using the original subset of data above, where the 200 ID is the total of all elementary school absences for that day, then an example of the individual school's observations would be like:

Count  Date   ID   DOW
 3   1/4/2011 225   3
 8   1/5/2011 225   4
 12  1/6/2011 225   5
 2   1/7/2011 225   1
 6   1/8/2011 225   2
 7   1/11/2011 225  3
 ..... etc for all dates until end of calendar year for that school
 6   1/4/2011 230   3
 3   1/5/2011 230   4
 2   1/6/2011 230   5
 0   1/7/2011 230   1
 11  1/8/2011 230   2
 4   1/11/2011 230  3
 ..... etc for all dates until end of calendar year for that school, and repeat for next school school.
1
Your data is confusing. Can you provide a better representative example. Suppose, if there are many IDs for each 'Date', how will this change?akrun
I should have made clearer that my data frame is much larger then the one I showed, just as an example. It has over 2 years of dates available for several dozen schools, each with their own specific ID, like the tail example. I'll add another example to clarify it.Andrew Miller
You have to make a reproducible example that mimics the dataset rather ethan etc for all dates... and the expected output as we have to rely on testing the code with the example dataakrun

1 Answers

0
votes

We can do this with data.table

library(data.table)
setDT(df)[, .(Count = abs(diff(Count)), ID = ID[1L], DOW = DOW[1L]),
       by = .(Date, ID1 = 100* as.numeric(ID) %/%100)][, ID1 := NULL][]
#          Date Count  ID DOW
#1:  1/4/2011   497 200   1
#2:  1/5/2011   686 200   2
#3:  1/6/2011   342 200   3
#4:  1/7/2011    98 200   4
#5:  1/8/2011   419 200   5
#6: 1/11/2011  1209 200   1

data

df <- structure(list(Count = c(500L, 689L, 345L, 100L, 421L, 1211L, 
3L, 3L, 3L, 2L, 2L, 2L), Date = c("1/4/2011", "1/5/2011", "1/6/2011", 
"1/7/2011", "1/8/2011", "1/11/2011", "1/4/2011", "1/5/2011", 
"1/6/2011", "1/7/2011", "1/8/2011", "1/11/2011"), ID = c(200L, 
200L, 200L, 200L, 200L, 200L, 225L, 225L, 225L, 225L, 225L, 225L
), DOW = c(1L, 2L, 3L, 4L, 5L, 1L, 3L, 4L, 5L, 1L, 2L, 3L)),
.Names = c("Count", 
 "Date", "ID", "DOW"), row.names = c(NA, -12L), class = "data.frame")