1
votes

I'm analyzing data where I want to count the number of occurrences that a row contains a particular combination of fields within a range that is specified by another field. I'm relatively new to R and have not been able to figure out sliding windows, which it seems like this is.

What I want in non-code is:
-The number of occurrences of "m4C" (column 1) on the positive strand (column 6) in the first 200 bp (column 2)
-Iterate the above with for the next 200 bp, or any window size I want to choose.
I'll also be modifying to look at the other types of modifications and both strands once I have the basic snippet down.

I've looked at rollapply from zoo, but don't know how to craft a function for this combination of conditions. I've also seen this https://stats.stackexchange.com/questions/3051/mean-of-a-sliding-window-in-r
, but I don't know enough to move it beyond the simple function used there. I've figured out a way to do this in Excel using COUNTIFS

COUNTIFS($A:$A,"m4C",$F:$F,"+",$B:$B,">"&$S3,$B:$B,"<"&$T3)

where the calls to columns S and T specify the upper and lower ranges for the window, but I'd like to use R for, well, reasons.

Example data:

type    start   end seqid   score   strand
m6A 2   2   NC_002932.3 44  -
modified_base   20  20  NC_002932.3 41  -
m6A 57  57  NC_002932.3 451 -
modified_base   69  69  NC_002932.3 55  +
m6A 80  80  NC_002932.3 540 +
modified_base   93  93  NC_002932.3 55  +
m4C 139 139 NC_002932.3 37  +
m6A 196 196 NC_002932.3 422 +
m4C 200 200 NC_002932.3 40  +
m6A 204 204 NC_002932.3 571 -
m6A 210 210 NC_002932.3 477 -
m6A 255 255 NC_002932.3 500 -
modified_base   264 264 NC_002932.3 32  +

Desired output from the above:
0-200 1
200-400 1

Real files have tens of thousands of lines. Thanks in advance for any help. I have the data as tab separated files that I'd happily read into any form to get what I want, but have been playing with data.frame because that is what I know so far.

1

1 Answers

0
votes

Here's a dplyr approach:

library(dplyr)
df2 <- df %>%
  group_by(grp = start %/% 200 + 1) %>%
  summarize(min = min(start),
            max = max(start),
            count = sum(type == "m4C")) 
#> df2
## A tibble: 2 x 4
#    grp   min   max count
#  <dbl> <dbl> <dbl> <int>
#1     1     2   196     1
#2     2   200   264     1



# additional step to match output more closely
df2 %>%
  mutate(group = paste0(min, "-", max)) %>%
  select(group, count)

## A tibble: 2 x 2
#  group   count
#  <chr>   <int>
#1 2-196       1
#2 200-264     1

Loading data:

df <- read.table(
  header = T, 
  stringsAsFactors = F,
  text = "type    start   end seqid   score   strand
m6A 2   2   NC_002932.3 44  -
modified_base   20  20  NC_002932.3 41  -
m6A 57  57  NC_002932.3 451 -
modified_base   69  69  NC_002932.3 55  +
m6A 80  80  NC_002932.3 540 +
modified_base   93  93  NC_002932.3 55  +
m4C 139 139 NC_002932.3 37  +
m6A 196 196 NC_002932.3 422 +
m4C 200 200 NC_002932.3 40  +
m6A 204 204 NC_002932.3 571 -
m6A 210 210 NC_002932.3 477 -
m6A 255 255 NC_002932.3 500 -
modified_base   264 264 NC_002932.3 32  +")