0
votes

How could I import a file :

  • starting with an undefined number of comment lines
  • followed by a line with headers, some of them containing the comment character which is used to identify the comment lines above?

For example, with a file like this:

# comment 1
# ...
# comment X
c01,c#02,c03,c04
1,2,3,4
5,6,7,8 

Then:

myDF = read.table(myfile, sep=',', header=T)

Error in read.table(myfile, sep = ",", header = T) : more columns than column names

The obvious problem is that # is used as comment character to announce comment lines, but also in the headers (which, admittedly, is bad practice, but I have no control on this).

The number of comment lines being unknown a priori, I can't even use the skip argument. Also, I don't know the column names (not even their number) before importing, so I'd really need to read them from the file.

Any solution beyond manually manipulating the file?

1
readLines to import the whole thing as strings, then clean it up into a standard format. - Gregor Thomas
Cleanup the file before you bring it into R. Maybe you can go to the source and handle things there. - Tim Biegeleisen

1 Answers

1
votes

It may be easy enough to count the number of lines that start with a comment, and then skip them.

csvfile <- "# comment 1
# ...
# comment X
c01,c#02,c03,c04
1,2,3,4
5,6,7,8"

# return a logical for whether the line starts with a comment.
# remove everything from the first FALSE and afterward
# take the sum of what's left
start_comment <- grepl("^#", readLines(textConnection(csvfile)))
start_comment <- sum(head(start_comment, which(!start_comment)[1] - 1))

# skip the lines that start with the comment character
Data <- read.csv(textConnection(csvfile),
                 skip = start_comment,
                 stringsAsFactors = FALSE)

Note that this will work with read.csv, because in read.csv, comment.char = "". If you must use read.table, or must have comment.char = #, you may need a couple more steps.

start_comment <- grepl("^#", readLines(textConnection(csvfile)))
start_comment <- sum(head(start_comment, which(!start_comment)[1] - 1))

# Get the headers by themselves.
Head <- read.table(textConnection(csvfile),
                   skip = start_comment,
                   header = FALSE,
                   sep = ",",
                   comment.char = "",
                   nrows = 1)

Data <- read.table(textConnection(csvfile),
                   sep = ",",
                   header = FALSE,
                   skip = start_comment + 1,
                   stringsAsFactors = FALSE)

# apply column names to Data
names(Data) <- unlist(Head)