Conflict between comment character and headers to import DF with read.table

Question

How could I import a file :

starting with an undefined number of comment lines
followed by a line with headers, some of them containing the comment character which is used to identify the comment lines above?

For example, with a file like this:

# comment 1
# ...
# comment X
c01,c#02,c03,c04
1,2,3,4
5,6,7,8

Then:

myDF = read.table(myfile, sep=',', header=T)

Error in read.table(myfile, sep = ",", header = T) : more columns than column names

The obvious problem is that # is used as comment character to announce comment lines, but also in the headers (which, admittedly, is bad practice, but I have no control on this).

The number of comment lines being unknown a priori, I can't even use the skip argument. Also, I don't know the column names (not even their number) before importing, so I'd really need to read them from the file.

Any solution beyond manually manipulating the file?

readLines to import the whole thing as strings, then clean it up into a standard format. — Gregor Thomas
Cleanup the file before you bring it into R. Maybe you can go to the source and handle things there. — Tim Biegeleisen

Benjamin Benjamin · Accepted Answer · 2018-01-23T15:03:27

It may be easy enough to count the number of lines that start with a comment, and then skip them.

csvfile <- "# comment 1
# ...
# comment X
c01,c#02,c03,c04
1,2,3,4
5,6,7,8"

# return a logical for whether the line starts with a comment.
# remove everything from the first FALSE and afterward
# take the sum of what's left
start_comment <- grepl("^#", readLines(textConnection(csvfile)))
start_comment <- sum(head(start_comment, which(!start_comment)[1] - 1))

# skip the lines that start with the comment character
Data <- read.csv(textConnection(csvfile),
                 skip = start_comment,
                 stringsAsFactors = FALSE)

Note that this will work with read.csv, because in read.csv, comment.char = "". If you must use read.table, or must have comment.char = #, you may need a couple more steps.

start_comment <- grepl("^#", readLines(textConnection(csvfile)))
start_comment <- sum(head(start_comment, which(!start_comment)[1] - 1))

# Get the headers by themselves.
Head <- read.table(textConnection(csvfile),
                   skip = start_comment,
                   header = FALSE,
                   sep = ",",
                   comment.char = "",
                   nrows = 1)

Data <- read.table(textConnection(csvfile),
                   sep = ",",
                   header = FALSE,
                   skip = start_comment + 1,
                   stringsAsFactors = FALSE)

# apply column names to Data
names(Data) <- unlist(Head)

Conflict between comment character and headers to import DF with read.table

1 Answers