4
votes

I am trying to parse tab-delimited data, which has been saved as a text file with extraneous data. I would like this to be an R data.table/data.frame.

The tab-delimited format is the following:

A   1092    -   1093    +   1X
B   1093    HRDCPMRFYT
A   1093    +   1094    -   1X
B   1094    BSZSDFJRVF
A   1094    +   1095    +   1X
B   1095    SSTFCLEPVV
...

There are only two types of rows, A and B. A consistently has 5 columns, e.g. for the first row,

1092    -   1093    +   1X

B consistently has two columns:

1093    HRDCPMRFYT

Question: How do you parse a file with "alternating" rows with different formats?

Let's say that this was a text file which was only of this format, alternating rows of A and B, with 5 columns and 2 columns respectively. How do you parse this into an R data.table? My idea how be to create the following format:

1092    -    1093    +    1X    1093    HRDCPMRFYT
1093    +    1094    -    1X    1094    BSZSDFJRVF
1094    +    1095    +    1X    1095    SSTFCLEPVV
... 
2
just to confirm: you are wanting to append the even rows to the odd (are they always alternating?), to create seven columns?user20650
@user20650 Yes, that's the preferable way to do this. If it's easier to create two data.tables, "A" and "B", that would be useful too I thinkShanZhengYang

2 Answers

5
votes

You can run shell commands using fread. In Win10, you can even run some linux utilities such as sed

Hence, you can simply do

fread("sed '$!N;s/\\n/ /' test.tab")
#      V1 V2   V3 V4      V5         V6
# 1: 1092  - 1093  + 1X 1093 HRDCPMRFYT
# 2: 1093  + 1094  - 1X 1094 BSZSDFJRVF
# 3: 1094  + 1095  + 1X 1095 SSTFCLEPVV

(sed syntax taken from here)


Data

text <- "1092    -   1093    +   1X
1093    HRDCPMRFYT
1093    +   1094    -   1X
1094    BSZSDFJRVF
1094    +   1095    +   1X
1095    SSTFCLEPVV"

# Saving it as tab separated file on disk
write(gsub(" +", "\t", text), file = "test.tab")
4
votes

One way to go is to read in your data with readLines, pull out the bits you want, and pass to read.table to form the dataframe. So if the rows are alternating then:

txt <- 
'1092    -   1093    +   1X
1093    HRDCPMRFYT
1093    +   1094    -   1X
1094    BSZSDFJRVF
1094    +   1095    +   1X
1095    SSTFCLEPVV'


rd <- readLines(textConnection(txt))
data.frame(read.table(text=rd[c(TRUE, FALSE)]), 
           read.table(text=rd[c(FALSE, TRUE)]))

Change textConnection(txt) to your file path


Another way is to read in only once and then post-process

r <- read.table(text=txt, fill=TRUE, stringsAsFactors=FALSE, na.strings = "")
d <- cbind(r[c(TRUE, FALSE),], r[c(FALSE, TRUE),])
d[ colSums(is.na(d)) < nrow(d)]