2
votes

I'm curious about data frame behavior from read.csv for the purposes of doing some data integrity checks to fail early in some algorithm work we're doing. Is it true that the default behavior for loading up a data frame from a csv file will only recognize as factors those columns holding character data? In other words can anything else also be recognized as a factor by default? I'm guessing not but the documentation I'm looking at only speaks of the relation of character data to factors but no other types and makes me weary that I may be making the converse error.

R- data.frame Documentation

stringsAsFactors logical: should character vectors be converted to factors? The ‘factory-fresh’ default is TRUE, but this can be changed by setting options(stringsAsFactors = FALSE).

Basically the check I'm intending will go something like

if ( any( sapply( myCsvDataFrame, class ) == "factor" ) ) {
   stop("DataIntegrityError--dataframe contains character data")
}

Further documentation seems to support my guess:

Unless colClasses is specified, all columns are read as character columns and then converted using type.convert to logical, integer, numeric, complex or (depending on as.is) factor as appropriate. Quotes are (by default) interpreted in all fields, so a column of values like "42" will result in an integer column.

So this explains more of the behavior

as.is the default behavior of read.table is to convert character variables (which are not converted to logical, numeric or complex) to factors. The variable as.is controls the conversion of columns not otherwise specified by colClasses. Its value is either a vector of logicals (values are recycled if necessary), or a vector of numeric or character indices which specify which columns should not be converted to factors.

Note: to suppress all conversions including those of numeric columns, set colClasses = "character".

Note that as.is is specified per column (not per variable) and so includes the column of row names (if any) and any columns to be skipped.

What I'm taking away from all this is that R first loads everything as characters (which makes sense in the CSV context, being just a flat text file right) and then attempts to coerce/convert certain columns into numeric/logical types and only where such conversion was unsuccessful are left columns which remain as character data which are subsequently stored within factors to become what we see in the resulting data frame.

1
No, CSV files cannot contain factorsSeñor O
I'm not sure what you mean by "cannot contain factors", CSV files certainly can contain character columns that once loaded into R store as factors. So I would say yes CSV files can contain factors with the implicit understanding that columns become factors in R.jxramos
CSV files contain text. R can interpret them as factors or characters or numeric.Señor O
In regards to can anything else also be recognized as a factor by default?, the answer is Yes. In my experience, R won't recognize a typo an a numeric column, and will convert that column to factor if it cannot recognize that it is numericRich Scriven
@RichardScriven so basically a single stray typo in one field pollutes the whole column from being recognized as a numeric type. Definitely good to know!jxramos

1 Answers

1
votes

Building on Richard Scriven's comment, read.table (and its wrapper functions) can create a data.frame with five types of columns:

  • Logical
  • Integer
  • Numeric
  • Character, or factor (depending on the stringsAsFactors argument/option)
  • Complex

Here's a simple example showing these five types of data being read in:

str(read.csv(text = "a,b,c,d,e
TRUE,1,4.0,a,1i
FALSE,2,5.5,b,2i
TRUE,3,6.0,c,3i", header = TRUE))
# 'data.frame':   3 obs. of  5 variables:
#  $ a: logi  TRUE FALSE TRUE
#  $ b: int  1 2 3
#  $ c: num  4 5.5 6
#  $ d: Factor w/ 3 levels "a","b","c": 1 2 3
#  $ e: cplx  0+1i 0+2i 0+3i

Note how the fourth column is a character column, which is read in as a factor. Each column is read in as a character vector and coerced to a specific class using either the colClasses argument or automated type checking via type.convert (as you highlight in your question).

This means that everything is a character, unless R can detect that it is something else. If stringsAsFactors = TRUE, then those columns are returned as factors.

This should be pretty intuitive except that, as Richard Scriven points out, you can sometimes get caught when type.convert cannot quite figure out a column. Here are some examples, all of which are typos or the result of poorly formed columns:

  1. Mixing logical representations (expect logical, get factor):

    str(read.csv(text = "a
    TRUE
    FALSE
    1
    0", header = TRUE))
    # 'data.frame':   4 obs. of  1 variable:
    #  $ a: Factor w/ 4 levels "0","1","FALSE",..: 4 3 2 1
    
  2. Character string in an otherwise numeric column (expect integer, get factor):

    str(read.csv(text = "a
    1
    2
    3
    4a", header = TRUE))
    # 'data.frame':   4 obs. of  1 variable:
    #  $ a: Factor w/ 4 levels "1","2",..: 1 2 3 4
    
  3. Another example of character string in a numeric column (expect numeric, get factor):

    str(read.csv(text = "a
    1.1
    2.1
    3.1
    4.x", header = TRUE))
    # 'data.frame':   4 obs. of  1 variable:
    #  $ a: Factor w/ 4 levels "1.1","2.1",..: 1 2 3 4
    
  4. Saying there isn't a header when there actually is (expect integer, get factor):

    str(read.csv(text = "a
    1
    2
    3
    4a", header = FALSE))
    # 'data.frame':   5 obs. of  1 variable:
    #  $ V1: Factor w/ 5 levels "1","2",..: 5 1 2 3 4
    
  5. Accidental spaces in numeric values (expect numeric, get factor):

    str(read.csv(text = "a
    1
    2
    3 .4", header = FALSE))
    # 'data.frame':   3 obs. of  1 variable:
    #  $ a: Factor w/ 3 levels "1","2","3 . 4",..: 1 2 3
    
  6. In R 3.1.0, one could also end up with a factor column if reading in a numeric column would have resulted in a loss of precision (because the column contained too many decimal places to represent in R). This behavior is now seen in the numerals argument to read.table:

    # default behavior (expect numeric, get numeric)
    str(read.csv(text = "a
    1.1
    2.2
    3.123456789123456789", header = TRUE, numerals = "allow.loss"))
    # 'data.frame':   3 obs. of  1 variable:
    #  $ a: num  1.1 2.2 3.12
    
    # "no.loss" argument (expect numeric, get factor)
    str(read.csv(text = "a
    1.1
    2.2
    3.123456789123456789", header = TRUE, numerals = "no.loss"))
    # 'data.frame':   3 obs. of  1 variable:
    #  $ a: Factor w/ 3 levels "        1.1",..: 1 2 3
    

There are probably some other situations that would result in receiving a factor column, but all of them are going to be due to malformed files or inappropriately used arguments to read.table.