3
votes

I have a large dataset, contained in a .txt file, that is broken into rows, without columns. Unfortunately, the rows are clustered by case. It looks a bit like this:

v1(case1): a   
v2(case1): b
v3(case1): c

v1(case2): d
v2(case2): e
v3(case2): f

…and so on. I tried using read.table to separate the variable names from the data, using this command:

data1 <- read.table("Data.txt", header = FALSE, sep = ":", fill=TRUE)

…but it wasn't completely effective (i.e., in some cases it placed the variable names in the "v1" column, and in some cases it did not), leading to this situation:

V1            V2
1   v1case1   a
2   v2case1   b 
3   v3case1   c
4   v1case2   d
5   v2case2   e
6   v3case2   f
7            v1case3
8            v2case3
9            v3case3

Any suggestions on a better way of either a) extracting all of the variable names into a separate column (so that I can use them to create new variables that will pull the relevant data for each variable into a column using "if/else") or b) a different way of putting this dataset into row/column format?

All advice much appreciated.

2

2 Answers

2
votes

stringr and plyr can help here if you start with readLines():

library(stringr)
library(plyr)

dat <- readLines("rows.txt")
print(dat)
## [1] "v1(case1): a" "v2(case1): b" "v3(case1): c" "v1(case2): d" "v2(case2): e" "v3(case2): f"

x <- ldply(str_match_all(dat, "^([[:alnum:]]+)\\(([[:alnum:]]+)\\):\ +([[:alnum:]]+)"))[,2:4]
print(x)
##    2     3 4
## 1 v1 case1 a
## 2 v2 case1 b
## 3 v3 case1 c
## 4 v1 case2 d
## 5 v2 case2 e
## 6 v3 case2 f

I'm not entirely sure how you need the resulting data frame to look like, but reshape or reshape2 can get you the rest of the way there.

0
votes

Using only base R:

dat = as.data.frame(scan('Data.txt', sep = ':', 
                    what = list(case = character(), value = character()), 
                    strip.white = TRUE, blank.lines.skip = TRUE))

The option blank.lines.skip goes around the empty lines problem. You can further process the case names using the suggestions by @hrbrmstr if you require so.