1
votes

Apologies if the example is not formatted properly.

I have a data set with one sample per row, the data contain two columns with reference numbers of the start value and end value.

  • cell A1 = Sample #1
  • cell B1 = 101-263 (start value)
  • cell C1 = 101-266 (end value)

  • cell A2 = Sample #2

  • cell B2 = 162-186 (start value)
  • cell C2 = 162-187 (end value)

The range of values is a different length of each row of data, with a maximum range of 8 values. I need to fill in the values in the range, with each value in a cell along the row.

So for sample #1 above I need to create the cell values: D1 = 101-264, and E1 = 101-265

While for sample #2 there will be no extra cells needed.

Is there a formula (using Vlookup and If perhaps?) that I can create and drag across all rows and over the 8 needed columns to fill in this data? (I don't mind if there are N/A in the shorter-range rows)

If there is an easier way using R also fine with me.

Thanks for any advice

2

2 Answers

0
votes

The following code:

library(magrittr)
library(plyr)
library(reshape2)

# Create input example
dat = data.frame(
  sample = c("Sample #1", "Sample #2"), 
  start = c("101-263", "162-186"), 
  end = c("101-266", "162-187"),
  stringsAsFactors = FALSE
  )

# Extract 'start' and 'end' values
dat$num1 = dat$start %>% strsplit("-") %>% sapply("[", 1)
dat$start2 = dat$start %>% strsplit("-") %>% sapply("[", 2) %>% as.numeric
dat$end2 = dat$end %>% strsplit("-") %>% sapply("[", 2) %>% as.numeric
dat$start = NULL
dat$end = NULL

# For each row
for(i in 1:nrow(dat)) {

  # Check if there is any need to add entries
  if((dat$end2[i] - dat$start2[i]) > 1) {

    # For each entry
    for(j in seq(dat$start2[i], dat$end2[i] -1)) {

      # Create entry
      new_entry = data.frame(
        sample = dat$sample[i],
        num1 = dat$num1[i],
        start2 = dat$start2[i],
        end2 = j,
        stringsAsFactors = FALSE
      )

      # Add to table
      dat = rbind(dat, new_entry)

    }

  }

}

# Calculate all values
dat$value = paste0(dat$num1, "-", dat$end2)
dat = dat[, c("sample", "value")]

# Create column labels
dat = ddply(
  dat,
  "sample",
  transform,
  var = paste0("col", rank(value))
)

# Reshape to required format
dat = dcast(dat, sample ~ var, value.var = "value")

Does what you asked on the provided example.

It transforms this table -

     sample   start     end
1 Sample #1 101-263 101-266
2 Sample #2 162-186 162-187

Into this one -

     sample    col1    col2    col3    col4
1 Sample #1 101-263 101-264 101-265 101-266
2 Sample #2 162-187    <NA>    <NA>    <NA>

If there is a larger example for testing will be happy to do so :)

0
votes

Please try in D1 copied across eight columns and then D1:K1 copied down to suit:

=IF(1*RIGHT($C1,3)>RIGHT($B1,3)+COLUMN()-3,LEFT($B1,4)&RIGHT($B1,3)+COLUMN()-3,"")

The condition (IF) checks whether or not to display a result (or 'blank' "", for neater presentation) depending on the result equalling or exceeding the upper limit specified in ColumnC.

There is some text manipulation (RIGHT and LEFT) to get at the part that is to be integer incremented or to add back the static part.

COLUMN() returns the column number (A>1, B>2 etc) so is useful as a kind of stepping function. In D1 COLUMN()-3 is 4-3 or 1 so there 1 is added to the start of the range (shown on the right of B1). When copied across to ColumnE COLUMN()-3 becomes 5-3, so 2 is added to the start of the range.