0
votes

when I load this file Chicago_Crimes_2005_to_2007.csv (Link https://www.kaggle.com/currie32/crimes-in-chicago) into RStudio I am always getting an error (Warnmeldung: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,: EOF in Zeichenkette / English: EOF within quoted string) , and not all obervations are included. Do you know how to solve the problem? With the other 3 files I had no problems. I am using this code:

c2 = read.csv("Chicago_Crimes_2005_to_2007.csv", header = TRUE)

I tried to fix it with this code:

c2 = read.csv("Chicago_Crimes_2005_to_2007.csv", header = TRUE, quote = "", row.names = NULL, stringsAsFactors = FALSE). 

Didnt work out. I tried all answers here in stackoverflow with the same error. Nothing helped. Since 1 week no success. Hope somebody can help me. Working with R in RStudio.

2
try reading the file using data.table::fread()... my experience it that it sometimes will automatically "repair" strange errors in the source filesWimpel
@Wimpel thanks for your help. Tried it but getting this error: In data.table::fread("Chicago_Crimes_2005_to_2007.csv", header = TRUE) : Stopped early on line 533719. Expected 23 fields but found 46. Consider fill=TRUE and comment.char=. First discarded non-empty line:S002
<<537288,5601758,HN409865,06/16/2007 08:15:00 PM,020XX E 94TH ST,1330,CRIMINAL TRESPASS,TO LAND,OTHER RAILROAD PROP / TRAIN DEPOT,False,False,413,4.0,8.0,48.0,26,1191237.0,1843038.0,2007,04/15/2016 08:55:02 AM,41.724300463,-87.575094193,"(41.724300463, -87.5,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location>>S002
Please edit your question instead of adding comments. Also note that "Zeichenkette" is just string, not quoted string. Clarify the error description --- "doesn't work" is not useful.Robert

2 Answers

0
votes

Here you go:

require(tidyverse)
df <- readr::read_csv("Chicago_Crimes_2005_to_2007.csv")

You may decide to clean up the column names as some have spaces in them, if so:

colnames(df) <- c("rowNo",
                   "ID",
                   "Case.Number",
                   "Date",
                   "Block",
                   "IUCR",
                   "Primary.Type",
                   "Description",
                   "Location.Description",
                   "Arrest",
                   "Domestic",
                   "Beat",
                   "District",
                   "Ward",
                   "Community.Area",
                   "FBI.Code",
                   "X.Coordinate",
                   "Y.Coordinate",
                   "Year",
                   "Updated.On",
                   "Latitude",
                   "Longitude",
                   "Location")
0
votes

Here is a version of a read script that parses the column names from the first row of the file, cleans them with a combination of tidyr::gather() and gsub(), and uses them as the input to read::read_csv(). It then summarizes the Row.Number field to confirm that its maximum value, 6254267 matches the row number from the last row in the file.

library(readr)
library(tidyr)
# read first row and clean column names
colNamesData <- read_csv("./data/Chicago_Crimes_2005_to_2007.csv",col_names=FALSE,n_max=1)
# set NA to Row Number
colNamesData[1,1] <- "Row Number"
# use tidyr::gather() to turn rows into columns
xColNames <- gather(colNamesData)
# use gsub() to replace blanks with periods so data can be used as column names
xColNames$value <- gsub(" ",".",xColNames$value)
# read with readr::read_csv() and set column names to data extracted from first row
# skip first row because it contains bad column names and is missing the first column name 
crimeData <- read_csv("./data/Chicago_Crimes_2005_to_2007.csv",col_names=xColNames$value,skip=1)
# last row in file is row number 6254267
summary(crimeData$Row.Number)

...and the output:

> summary(crimeData$Row.Number)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0  235792  471370 1944429 5601310 6254267 
>

NOTE: The file does not read all records correctly because at row 533,719 the record appears to end with a redundant list of variable names.

enter image description here

To correct this, one must either hand edit the data to delete the redundant list of variable names or code around the error.

Interestingly, the row number count restarts at 0 in row 533,720 of the raw data file, which indicates that the people who created this data concatenated multiple files incorrectly to create this data file.