0
votes

I currently have a motif search working in a series of for loops and would like to move to a nested tibble to improve speed and simplicity (ish). However, I cannot figure out how to store a tibble within a tibble so I can then unnest it. If that's not possible, tips on how to pass the lists (and an id column) so I could later join it to the original table would be appreciated.

Input: set of coordinates and the corresponding DNA sequence

Goals:
1) Find instances of the motif I care about
2) Combine those with the start or end of the range to create all pairs of starts and ends (where the found position can be either)
3) Determine the type of the pairing

I cannot figure out how to get mutate to accept a tibble (Error in mutate_impl(.data, dots) : Column `pairs` is of unsupported class data.frame). I can't call rowwise here because I need to send the whole list of positions to the function, as well as values from other columns.

test_input = tibble(
  start = c(1,10,15), 
  end = c(9, 14, 25),  
  sequence = c("GAGAGAGTC","CATTT", "TCACAGTTTCC")
)

custom_function = function(start, end, list.of.positions) {
  ## Doesn't include extra math, case specifications, and error handling here for simplicity
  starts = c(start, list.of.positions)
  ends = c(end, list.of.positions)
  pairs = expand.grid(starts, ends) %>% as_tibble %>% 
    mutate(type = case_when(TRUE ~ "a_type")) #Simplified for example to one case 
  return(pairs)
}

test_input %>% 
# for each set of coordinates/string
  rowwise() %>% 
  # find the positions of a given motif
  mutate(match.positions = regexp.match.ends(gregexpr("AG", sequence))) %>% 
  mutate(num.matches = case_when(
    is_logical(match.positions) ~ NA_integer_,
    TRUE ~ length(match.positions) 
  )) %>% 
  # expand and covert to real positions
  unnest %>% rowwise %>% 
  mutate(true.positions = case_when(
    is.na(match.positions) ~ NA_real_, #must be a double-compatible NA
    TRUE ~ start + match.positions - 1)) %>% 
  select(-match.positions) %>% 
  ungroup() %>% 
  # re-"nest" into a list of real positions
  group_by_at(vars(-true.positions)) %>% 
  summarise(true.positions = list(true.positions)) %>% 
  # pass list of real positions to a function that creates pairs of coordinates and determines the type of pair
  mutate(pairs = custom_function(start, end, true.positions))

My final tibble should look like this (after unnesting pairs):

  start   end  sequence      new.start  new.end   type  
  <dbl> <dbl>  <chr>         <dbl>      <dbl>    <chr>   
1     1     9  GAGAGAGTC     1          3        a_type
1     1     9  GAGAGAGTC     1          5        a_type
2     1     9  GAGAGAGTC     1          7        a_type
3     1     9  GAGAGAGTC     1          9        a_type
4     1     9  GAGAGAGTC     3          5        a_type
...
10    1     9  GAGAGAGTC     7          9        a_type
11    10    14 CATTT         10         14       a_type
...

One workaround I thought of was to paste the output values into a string and pass it back as a list, which the tibble tolerates, unnesting, and then separating it but surely there's a less hacky way to go about this. Many thanks for your help/ideas!

1

1 Answers

0
votes

So I'm not at all familiar with the subject matter. But I think I can piece together what you're trying to do. I like using the stringr package, as it does a lot of this with simpler syntax.

test_input <- tibble(
  start = c(1,10,15), 
  end = c(9, 14, 25),  
  sequence = c("GAGAGAGTC","CATTT", "TCACAGTTTCC")
)

custom_function <- function(string, pattern, label) {
    string %>%
        str_locate_all(pattern) %>%    # get the start-end pairs.
        as.data.frame() %>%    # make it a data.frame
        expand.grid() %>%    # all combos. this seemed important.
        mutate(
            sequence = string,
            type = label
            ) %>%    # add the string and label to each row.
        %>% rename(
            new_start = start,    # rename so we don't confuse columns.
            new_end = end         # I prefer not to use dots in my names.
            ) %>%
        left_join(test_input) %>%    # add the original start and ends
        return()    # return df has cols: start, end, sequence, new_start, new_end, type.
}

final_out <- data.frame(
    start = numeric(0),
    end = numeric(0),
    sequence = character(0),
    new_start = numeric(0),
    new_end = numeric(0)
    )    # empty dummy DF that we'll add to.

for (string in test_input$sequence) {
    final_out <- custom_function(string = string,
                                 pattern = 'AG',
                                 label = 'a_type') %>%
        bind_rows(final_out)
}    # add the rows of each output to the final DF we made.

print(final_out)

It seemed like you were trying to label the result based on the pattern you supplied, so you can specify 'a_type' or whatever label you want.

There may be a way to do this without the for loop by using a map or apply function. I'd have to tinker around more to figure that out though.

Hopefully that helps, or at least leads you in the right direction. Like I said, I am not familiar with the subject matter.