0
votes

I have a data frame which looks like this:

> df
                   V1         V2    V3 V4       V5                 V6 V7
1  chr1:859582-899582    AHR.pfm 33440  - 9.188581          gcacgcaac NA
2  chr2:859582-899582   AIRE.pfm  7387  + 7.982141 TCTGGTTCAGTTGGATGC NA
3  chr1:859582-899582 AIRE.3.pfm 30639  - 8.127811 aaaaccaaacaaacaaaa NA
4 chr13:859582-899582   ALX1.pfm 11835  + 7.485710       GTAATTGTGTTA NA
5 chr21:859582-899582 ALX1.1.pfm 16260  + 9.529333       GTAATTAATTTA NA
6  chrX:859582-899582 ALX1.2.pfm 20686  + 9.241755       CTAATTAATTTA NA

I want to append multiple new columns to this dataframe and all of these columns will have information from within this data frame. Details are below:

  1. append column chr having values strsplit(df$V1,":")[[1]] (split first column values on : and assign the first index to this new column)
  2. append column start having values strsplit(df$V1,":")[[2]] => strsplit(df$V1,"-")[[1]] (split first column value firstly on : take 2nd index and then split it on - and assign first index to this new column)
  3. append column end having values df$start + length(df$V6) (add in the value of start, the length of characters in V6 column)
  4. append column TF having values strsplit(df$V2,".")[[1]] (split second column values on . and assign the first index to this new column

So that appended new column looks like:

chr1 859582 859591  AHR
chr2 859582 859600  AIRE
chr1 859582 859600  AIRE
chr13 859582 859594 ALX1
chr21 859582 859594 ALX1
chrX 859582 859594  ALX1
2

2 Answers

3
votes

1) tidyr/dplyr This separates column V1 into chr, start and end columns and then appends a TF column with everything before the first dot in V2. An alternative to the mutate might be separate(V2, "TF", extra = "drop") which would drop V2 and put TF in its place. Add the separate argument remove=FALSE if you want to keep the original columns as well.

library(dplyr)
library(tidyr)

library(dplyr)
library(tidyr)
df %>% 
   separate(V1, c("chr", "start", "end"), convert = TRUE) %>%
   mutate(TF = sub("[.].*", "", V2), end = end + nchar(V6))

giving:

    chr  start    end         V2    V3 V4       V5                 V6 V7   TF
1  chr1 859582 899591    AHR.pfm 33440  - 9.188581          gcacgcaac NA  AHR
2  chr2 859582 899600   AIRE.pfm  7387  + 7.982141 TCTGGTTCAGTTGGATGC NA AIRE
3  chr1 859582 899600 AIRE.3.pfm 30639  - 8.127811 aaaaccaaacaaacaaaa NA AIRE
4 chr13 859582 899594   ALX1.pfm 11835  + 7.485710       GTAATTGTGTTA NA ALX1
5 chr21 859582 899594 ALX1.1.pfm 16260  + 9.529333       GTAATTAATTTA NA ALX1
6  chrX 859582 899594 ALX1.2.pfm 20686  + 9.241755       CTAATTAATTTA NA ALX1

2) no packages A way to do it with only base R is:

nms <- c("chr", "start", "end")
df2 <- cbind(df[-1], 
 read.table(text = sub("-", ":", df$V1), sep = ":", as.is = TRUE, col.names = nms), 
 TF = sub("[.].*", "", df$V2), 
 stringsAsFactors = FALSE)
 df2 <- transform(df2, end = end + nchar(V6))

giving:

> df2
          V2    V3 V4       V5                 V6 V7   chr  start    end   TF
1    AHR.pfm 33440  - 9.188581          gcacgcaac NA  chr1 859582 899591  AHR
2   AIRE.pfm  7387  + 7.982141 TCTGGTTCAGTTGGATGC NA  chr2 859582 899600 AIRE
3 AIRE.3.pfm 30639  - 8.127811 aaaaccaaacaaacaaaa NA  chr1 859582 899600 AIRE
4   ALX1.pfm 11835  + 7.485710       GTAATTGTGTTA NA chr13 859582 899594 ALX1
5 ALX1.1.pfm 16260  + 9.529333       GTAATTAATTTA NA chr21 859582 899594 ALX1
6 ALX1.2.pfm 20686  + 9.241755       CTAATTAATTTA NA  chrX 859582 899594 ALX1

Note: df in reproducible form:

Lines <- "
                   V1         V2    V3 V4       V5                 V6 V7
1  chr1:859582-899582    AHR.pfm 33440  - 9.188581          gcacgcaac NA
2  chr2:859582-899582   AIRE.pfm  7387  + 7.982141 TCTGGTTCAGTTGGATGC NA
3  chr1:859582-899582 AIRE.3.pfm 30639  - 8.127811 aaaaccaaacaaacaaaa NA
4 chr13:859582-899582   ALX1.pfm 11835  + 7.485710       GTAATTGTGTTA NA
5 chr21:859582-899582 ALX1.1.pfm 16260  + 9.529333       GTAATTAATTTA NA
6  chrX:859582-899582 ALX1.2.pfm 20686  + 9.241755       CTAATTAATTTA NA"
df <- read.table(text = Lines, as.is = TRUE)
1
votes

You can write a function fun to do this, with reapeted calls to sapply/strsplit.

fun <- function(DF){
    chr <- sapply(strsplit(DF[[1]], ":"), `[`, 1)
    start <- sapply(strsplit(DF[[1]], ":"), `[`, 2)
    end <- as.integer(sapply(strsplit(start, "-"), `[`, 1)) + nchar(DF[[6]])
    start <- sapply(strsplit(start, "-"), `[`, 1)
    TF <- sapply(strsplit(DF[[2]], "\\."), `[`, 1)
    cbind(DF, data.frame(chr, start, end, TF))
}

fun(df)

Note that the data must be read in with both columns V1 and V2 of class character, not factor. If they are of class factor coerce them to character first or, if applicable, set argument stringsAsFactors = FALSE.

DATA.

df <-
structure(list(V1 = c("chr1:859582-899582", "chr2:859582-899582", 
"chr1:859582-899582", "chr13:859582-899582", "chr21:859582-899582", 
"chrX:859582-899582"), V2 = c("AHR.pfm", "AIRE.pfm", "AIRE.3.pfm", 
"ALX1.pfm", "ALX1.1.pfm", "ALX1.2.pfm"), V3 = c(33440L, 7387L, 
30639L, 11835L, 16260L, 20686L), V4 = c("-", "+", "-", "+", "+", 
"+"), V5 = c(9.188581, 7.982141, 8.127811, 7.48571, 9.529333, 
9.241755), V6 = c("gcacgcaac", "TCTGGTTCAGTTGGATGC", "aaaaccaaacaaacaaaa", 
"GTAATTGTGTTA", "GTAATTAATTTA", "CTAATTAATTTA"), V7 = c(NA, NA, 
NA, NA, NA, NA)), .Names = c("V1", "V2", "V3", "V4", "V5", "V6", 
"V7"), class = "data.frame", row.names = c("1", "2", "3", "4", 
"5", "6"))