I'm analyzing data where I want to count the number of occurrences that a row contains a particular combination of fields within a range that is specified by another field. I'm relatively new to R and have not been able to figure out sliding windows, which it seems like this is.
What I want in non-code is:
-The number of occurrences of "m4C" (column 1) on the positive strand (column 6) in the first 200 bp (column 2)
-Iterate the above with for the next 200 bp, or any window size I want to choose.
I'll also be modifying to look at the other types of modifications and both strands once I have the basic snippet down.
I've looked at rollapply from zoo, but don't know how to craft a function for this combination of conditions. I've also seen this https://stats.stackexchange.com/questions/3051/mean-of-a-sliding-window-in-r
, but I don't know enough to move it beyond the simple function used there. I've figured out a way to do this in Excel using COUNTIFS
COUNTIFS($A:$A,"m4C",$F:$F,"+",$B:$B,">"&$S3,$B:$B,"<"&$T3)
where the calls to columns S and T specify the upper and lower ranges for the window, but I'd like to use R for, well, reasons.
Example data:
type start end seqid score strand
m6A 2 2 NC_002932.3 44 -
modified_base 20 20 NC_002932.3 41 -
m6A 57 57 NC_002932.3 451 -
modified_base 69 69 NC_002932.3 55 +
m6A 80 80 NC_002932.3 540 +
modified_base 93 93 NC_002932.3 55 +
m4C 139 139 NC_002932.3 37 +
m6A 196 196 NC_002932.3 422 +
m4C 200 200 NC_002932.3 40 +
m6A 204 204 NC_002932.3 571 -
m6A 210 210 NC_002932.3 477 -
m6A 255 255 NC_002932.3 500 -
modified_base 264 264 NC_002932.3 32 +
Desired output from the above:
0-200 1
200-400 1
Real files have tens of thousands of lines. Thanks in advance for any help. I have the data as tab separated files that I'd happily read into any form to get what I want, but have been playing with data.frame because that is what I know so far.