I have a raw text file in the following format:
RELEASE VERSION: 20150514 (May 14, 2015)
======================================================================== VERSION
STUDY VARIABLE: Version Number Of Release
QUESTION: --------- Version of Cumulative Data File
NOTES: ------ This variable appears in the data as: ANES_cdf_VERSION:YYYY-mmm-DD where mmm is standard 3-character month abbreviation (Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec).
TYPE: ----- Character-1
======================================================================== VCF0004
STUDY VARIABLE: Year of Study
QUESTION: --------- Year of study (4-digit)
TYPE: ----- Numeric Dec 0-1
===================================================================== VCF0006 ... and so on
Observations are bounded by "=" row and each of observation has some amount of variables (not all may be presented)
I am trying to create a data table out of it.
I created a vector of observations, in each observations columns are separated by '|'. Then I use fread to make a data table:
dt <- fread(paste(rawObs, collapse = '\n'),sep = '|',header = F, fill = T)
However, this is not really a solution. Fill = T only considers missing columns at the end of the observations and not in between:
In the example we have it should be:
id | study_var | question | notes | type
version | s1 | q1 | notes1 | character-1
VCF0004 | s2 | q2 | NA | numeric
But R creates it as
id | study_var | question | notes | type
version | s1 | q1 | notes1 | character-1
VCF0004 | s2 | q2 | numeric | NA
Type of the second observation is shifted leftward. As a solution, I was thinking to determine a missing columns within each observation and insert NAs explicitly in the input file, using max number of variables found but it might be slow for large files.
Thanks for help. Any comments are appreciated. Here is all code:
library(magrittr)
library(data.table)
path <- 'Downloads/anes_timeseries_cdf_codebook_var.txt'
raw_data <- readLines(path)
head(raw_data)
#remove empty lines
raw_data <- raw_data[raw_data != ""]
#remove header
raw_data <- raw_data[-c(1,2)]
data_entries_index <- grep('^=+', raw_data)+1
#add end position of the last observation
data_entries_index <- c(data_entries_index, length(raw_data))
#opening file shows editor couldn't read two characters - we can ignore it though
data_entries_index
parseRawObservation <- function(singleRawObs, VariableIndex){
count=length(VariableIndex)-1
for (i in 1:count){
start = VariableIndex[i]+2
end = VariableIndex[i+1]-1
varValue <- paste(singleRawObs[start:end],collapse = ' ')
if (i==1)
obsSpaced <- varValue
else
obsSpaced <- paste(obsSpaced,varValue, sep = '|')
}
obsSpaced
}
#create a vector of raw observations
numObs <- length(data_entries_index)
count=numObs-1
rawObs=vector()
for (i in 1:count) {
start <- data_entries_index[i]
end <- data_entries_index[i+1]-2
singleRawObs <-raw_data[start:end]
VariableIndex <- grep("^-+",singleRawObs)-1
#add end of the last variable index
VariableIndex <- c(VariableIndex, length(singleRawObs)+1)
rawObs[i] <- parseRawObservation(singleRawObs,VariableIndex)
#add first two columns separately as they do not have dashes at the next line
rawObs[i] <- paste(singleRawObs[1], singleRawObs[2], rawObs[i], sep = '|')
}
#determine max number of fields
numOfCol <- max(sapply(rawObs, FUN = function(x) length(strsplit(x,'|')[[1]])))
which.max(sapply(rawObs, FUN = function(x) length(strsplit(x,'|')[[1]])))
dt <- fread(textConnection(rawObs),sep = '|',header = F)
dt <- fread(paste(rawObs[1:2], collapse = '\n'),sep = '|',header = F, fill = T)
rawObs[653]