Here is a version of a read script that parses the column names from the first row of the file, cleans them with a combination of tidyr::gather()
and gsub()
, and uses them as the input to read::read_csv()
. It then summarizes the Row.Number
field to confirm that its maximum value, 6254267 matches the row number from the last row in the file.
library(readr)
library(tidyr)
# read first row and clean column names
colNamesData <- read_csv("./data/Chicago_Crimes_2005_to_2007.csv",col_names=FALSE,n_max=1)
# set NA to Row Number
colNamesData[1,1] <- "Row Number"
# use tidyr::gather() to turn rows into columns
xColNames <- gather(colNamesData)
# use gsub() to replace blanks with periods so data can be used as column names
xColNames$value <- gsub(" ",".",xColNames$value)
# read with readr::read_csv() and set column names to data extracted from first row
# skip first row because it contains bad column names and is missing the first column name
crimeData <- read_csv("./data/Chicago_Crimes_2005_to_2007.csv",col_names=xColNames$value,skip=1)
# last row in file is row number 6254267
summary(crimeData$Row.Number)
...and the output:
> summary(crimeData$Row.Number)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 235792 471370 1944429 5601310 6254267
>
NOTE: The file does not read all records correctly because at row 533,719 the record appears to end with a redundant list of variable names.
To correct this, one must either hand edit the data to delete the redundant list of variable names or code around the error.
Interestingly, the row number count restarts at 0 in row 533,720 of the raw data file, which indicates that the people who created this data concatenated multiple files incorrectly to create this data file.
data.table::fread()
... my experience it that it sometimes will automatically "repair" strange errors in the source files – WimpelIn data.table::fread("Chicago_Crimes_2005_to_2007.csv", header = TRUE) : Stopped early on line 533719. Expected 23 fields but found 46. Consider fill=TRUE and comment.char=. First discarded non-empty line:
– S002<<537288,5601758,HN409865,06/16/2007 08:15:00 PM,020XX E 94TH ST,1330,CRIMINAL TRESPASS,TO LAND,OTHER RAILROAD PROP / TRAIN DEPOT,False,False,413,4.0,8.0,48.0,26,1191237.0,1843038.0,2007,04/15/2016 08:55:02 AM,41.724300463,-87.575094193,"(41.724300463, -87.5,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location>>
– S002