R data processing involving strsplit

Question

I have a data frame which looks like this:

> df
                   V1         V2    V3 V4       V5                 V6 V7
1  chr1:859582-899582    AHR.pfm 33440  - 9.188581          gcacgcaac NA
2  chr2:859582-899582   AIRE.pfm  7387  + 7.982141 TCTGGTTCAGTTGGATGC NA
3  chr1:859582-899582 AIRE.3.pfm 30639  - 8.127811 aaaaccaaacaaacaaaa NA
4 chr13:859582-899582   ALX1.pfm 11835  + 7.485710       GTAATTGTGTTA NA
5 chr21:859582-899582 ALX1.1.pfm 16260  + 9.529333       GTAATTAATTTA NA
6  chrX:859582-899582 ALX1.2.pfm 20686  + 9.241755       CTAATTAATTTA NA

I want to append multiple new columns to this dataframe and all of these columns will have information from within this data frame. Details are below:

append column chr having values strsplit(df$V1,":")[[1]] (split first column values on : and assign the first index to this new column)
append column start having values strsplit(df$V1,":")[[2]] => strsplit(df$V1,"-")[[1]] (split first column value firstly on : take 2nd index and then split it on - and assign first index to this new column)
append column end having values df$start + length(df$V6) (add in the value of start, the length of characters in V6 column)
append column TF having values strsplit(df$V2,".")[[1]] (split second column values on . and assign the first index to this new column

So that appended new column looks like:

chr1 859582 859591  AHR
chr2 859582 859600  AIRE
chr1 859582 859600  AIRE
chr13 859582 859594 ALX1
chr21 859582 859594 ALX1
chrX 859582 859594  ALX1

G. Grothendieck G. Grothendieck · Accepted Answer · 2017-10-30T18:24:34

1) tidyr/dplyr This separates column V1 into chr, start and end columns and then appends a TF column with everything before the first dot in V2. An alternative to the mutate might be separate(V2, "TF", extra = "drop") which would drop V2 and put TF in its place. Add the separate argument remove=FALSE if you want to keep the original columns as well.

library(dplyr)
library(tidyr)

library(dplyr)
library(tidyr)
df %>% 
   separate(V1, c("chr", "start", "end"), convert = TRUE) %>%
   mutate(TF = sub("[.].*", "", V2), end = end + nchar(V6))

giving:

    chr  start    end         V2    V3 V4       V5                 V6 V7   TF
1  chr1 859582 899591    AHR.pfm 33440  - 9.188581          gcacgcaac NA  AHR
2  chr2 859582 899600   AIRE.pfm  7387  + 7.982141 TCTGGTTCAGTTGGATGC NA AIRE
3  chr1 859582 899600 AIRE.3.pfm 30639  - 8.127811 aaaaccaaacaaacaaaa NA AIRE
4 chr13 859582 899594   ALX1.pfm 11835  + 7.485710       GTAATTGTGTTA NA ALX1
5 chr21 859582 899594 ALX1.1.pfm 16260  + 9.529333       GTAATTAATTTA NA ALX1
6  chrX 859582 899594 ALX1.2.pfm 20686  + 9.241755       CTAATTAATTTA NA ALX1

2) no packages A way to do it with only base R is:

nms <- c("chr", "start", "end")
df2 <- cbind(df[-1], 
 read.table(text = sub("-", ":", df$V1), sep = ":", as.is = TRUE, col.names = nms), 
 TF = sub("[.].*", "", df$V2), 
 stringsAsFactors = FALSE)
 df2 <- transform(df2, end = end + nchar(V6))

giving:

> df2
          V2    V3 V4       V5                 V6 V7   chr  start    end   TF
1    AHR.pfm 33440  - 9.188581          gcacgcaac NA  chr1 859582 899591  AHR
2   AIRE.pfm  7387  + 7.982141 TCTGGTTCAGTTGGATGC NA  chr2 859582 899600 AIRE
3 AIRE.3.pfm 30639  - 8.127811 aaaaccaaacaaacaaaa NA  chr1 859582 899600 AIRE
4   ALX1.pfm 11835  + 7.485710       GTAATTGTGTTA NA chr13 859582 899594 ALX1
5 ALX1.1.pfm 16260  + 9.529333       GTAATTAATTTA NA chr21 859582 899594 ALX1
6 ALX1.2.pfm 20686  + 9.241755       CTAATTAATTTA NA  chrX 859582 899594 ALX1

Note: df in reproducible form:

Lines <- "
                   V1         V2    V3 V4       V5                 V6 V7
1  chr1:859582-899582    AHR.pfm 33440  - 9.188581          gcacgcaac NA
2  chr2:859582-899582   AIRE.pfm  7387  + 7.982141 TCTGGTTCAGTTGGATGC NA
3  chr1:859582-899582 AIRE.3.pfm 30639  - 8.127811 aaaaccaaacaaacaaaa NA
4 chr13:859582-899582   ALX1.pfm 11835  + 7.485710       GTAATTGTGTTA NA
5 chr21:859582-899582 ALX1.1.pfm 16260  + 9.529333       GTAATTAATTTA NA
6  chrX:859582-899582 ALX1.2.pfm 20686  + 9.241755       CTAATTAATTTA NA"
df <- read.table(text = Lines, as.is = TRUE)

R data processing involving strsplit

2 Answers