0
votes

I have a raw text file in the following format:

RELEASE VERSION: 20150514 (May 14, 2015)

======================================================================== VERSION

STUDY VARIABLE: Version Number Of Release

QUESTION: --------- Version of Cumulative Data File

NOTES: ------ This variable appears in the data as: ANES_cdf_VERSION:YYYY-mmm-DD where mmm is standard 3-character month abbreviation (Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec).

TYPE: ----- Character-1

======================================================================== VCF0004

STUDY VARIABLE: Year of Study

QUESTION: --------- Year of study (4-digit)

TYPE: ----- Numeric Dec 0-1

===================================================================== VCF0006 ... and so on

Observations are bounded by "=" row and each of observation has some amount of variables (not all may be presented)

I am trying to create a data table out of it.

I created a vector of observations, in each observations columns are separated by '|'. Then I use fread to make a data table:

dt <- fread(paste(rawObs, collapse = '\n'),sep = '|',header = F, fill = T)

However, this is not really a solution. Fill = T only considers missing columns at the end of the observations and not in between:

In the example we have it should be:

id      | study_var | question | notes    | type
version | s1        | q1       | notes1   | character-1
VCF0004 | s2        | q2       | NA       | numeric

But R creates it as

id      | study_var | question | notes    | type
version | s1        | q1       | notes1   | character-1
VCF0004 | s2        | q2       | numeric  | NA

Type of the second observation is shifted leftward. As a solution, I was thinking to determine a missing columns within each observation and insert NAs explicitly in the input file, using max number of variables found but it might be slow for large files.

Thanks for help. Any comments are appreciated. Here is all code:

library(magrittr)
library(data.table)
path <- 'Downloads/anes_timeseries_cdf_codebook_var.txt'
raw_data <- readLines(path)
head(raw_data)

#remove empty lines
raw_data <- raw_data[raw_data != ""]

#remove header
raw_data <- raw_data[-c(1,2)]
data_entries_index <-  grep('^=+', raw_data)+1

#add end position of the last observation
data_entries_index <- c(data_entries_index, length(raw_data))

#opening file shows editor couldn't read two characters - we can ignore it though
data_entries_index

parseRawObservation <- function(singleRawObs, VariableIndex){
    count=length(VariableIndex)-1
    for (i in 1:count){
      start = VariableIndex[i]+2
      end = VariableIndex[i+1]-1
      varValue <- paste(singleRawObs[start:end],collapse = ' ')
      if (i==1)
        obsSpaced <- varValue
      else
        obsSpaced <- paste(obsSpaced,varValue, sep = '|')
    }
    obsSpaced
}


#create a vector of raw observations
numObs <- length(data_entries_index)
count=numObs-1
rawObs=vector()
for (i in 1:count) {
  start <- data_entries_index[i]
  end <- data_entries_index[i+1]-2
  singleRawObs <-raw_data[start:end]
  VariableIndex <- grep("^-+",singleRawObs)-1

  #add end of the last variable index
  VariableIndex <- c(VariableIndex, length(singleRawObs)+1)
  rawObs[i] <- parseRawObservation(singleRawObs,VariableIndex) 

  #add first two columns separately as they do not have dashes at the next line
  rawObs[i] <- paste(singleRawObs[1], singleRawObs[2], rawObs[i], sep = '|')
}


#determine max number of fields
numOfCol <- max(sapply(rawObs,  FUN =  function(x) length(strsplit(x,'|')[[1]])))
which.max(sapply(rawObs,  FUN =  function(x) length(strsplit(x,'|')[[1]])))

dt <- fread(textConnection(rawObs),sep = '|',header = F)
dt <- fread(paste(rawObs[1:2], collapse = '\n'),sep = '|',header = F, fill = T)
rawObs[653]
1

1 Answers

0
votes

There is a handy alternative for reading files like this one: read.dcf().

read.dcf() reads files in Debian Control Format (DCF) which consist of regular lines of form tag:value. Records are separated by one or more empty lines.

However, the input file needs to be modified to conform with the DCF format (plus some additional modifications to meet OP's expected result):

  1. Empty rows need to be removed as they would be mistaken as record separator.
  2. The streaks of equal signs = which are used as record separator need to be replaced by multiple empty lines and the missing tag id:.
  3. The streaks of dashes should be removed.
  4. The first row containing RELEASE VERSION: should be removed to be in line with OP's expectations.

The code below assumes that the raw text file is named "raw.txt".

library(data.table)
library(magrittr)
# read raw file, skip first row
raw <- fread("raw.txt", sep = "\n", header = FALSE, skip = 1L)
# replace streaks of "=" and "-"
raw[, V1 := V1 %>% 
      stringr::str_replace("[=]+", "\n\nid:") %>% 
      stringr::str_replace(": [-]+", ": ")][]
# now read the modified data using DCF format skipping empty rows
dt <- as.data.table(read.dcf(textConnection(raw[V1 != "", V1])))
dt
        id            STUDY VARIABLE                        QUESTION
1: VERSION Version Number Of Release Version of Cumulative Data File
2: VCF0004             Year of Study         Year of study (4-digit)
3: VCF0006                        NA                              NA
                                                                                                                                                                        NOTES
1: This variable appears in the data as: ANES_cdf_VERSION:YYYY-mmm-DD [...]
2:                                                                                                                                                                         NA
3:                                                                                                                                                                         NA
              TYPE
1:     Character-1
2: Numeric Dec 0-1
3:              NA