I'm curious about data frame behavior from read.csv
for the purposes of doing some data integrity checks to fail early in some algorithm work we're doing. Is it true that the default behavior for loading up a data frame from a csv file will only recognize as factors those columns holding character data? In other words can anything else also be recognized as a factor by default? I'm guessing not but the documentation I'm looking at only speaks of the relation of character data to factors but no other types and makes me weary that I may be making the converse error.
R- data.frame Documentation
stringsAsFactors logical: should character vectors be converted to factors? The ‘factory-fresh’ default is TRUE, but this can be changed by setting options(stringsAsFactors = FALSE).
Basically the check I'm intending will go something like
if ( any( sapply( myCsvDataFrame, class ) == "factor" ) ) {
stop("DataIntegrityError--dataframe contains character data")
}
Further documentation seems to support my guess:
Unless colClasses is specified, all columns are read as character columns and then converted using type.convert to logical, integer, numeric, complex or (depending on as.is) factor as appropriate. Quotes are (by default) interpreted in all fields, so a column of values like "42" will result in an integer column.
So this explains more of the behavior
as.is the default behavior of read.table is to convert character variables (which are not converted to logical, numeric or complex) to factors. The variable as.is controls the conversion of columns not otherwise specified by colClasses. Its value is either a vector of logicals (values are recycled if necessary), or a vector of numeric or character indices which specify which columns should not be converted to factors.
Note: to suppress all conversions including those of numeric columns, set colClasses = "character".
Note that as.is is specified per column (not per variable) and so includes the column of row names (if any) and any columns to be skipped.
What I'm taking away from all this is that R first loads everything as characters (which makes sense in the CSV context, being just a flat text file right) and then attempts to coerce/convert certain columns into numeric/logical types and only where such conversion was unsuccessful are left columns which remain as character data which are subsequently stored within factors to become what we see in the resulting data frame.