将 .csv 文件加载到 RStudio 时出现问题。带引号的字符串中的 EOF答案

【问题标题】：Problems loading .csv file into RStudio. EOF within quoted string将 .csv 文件加载到 RStudio 时出现问题。带引号的字符串中的 EOF
【发布时间】：2019-05-10 17:41:41
【问题描述】：

当我将此文件 Chicago_Crimes_2005_to_2007.csv（链接 https://www.kaggle.com/currie32/crimes-in-chicago）加载到 RStudio 中时，我总是收到错误消息（Warnmeldung：在 scan(file = file, what = what, sep = sep, quote = quote, dec = dec,: EOF in Zeichenkette / English: EOF within quoted string) 中，并非所有观察都包括在内。你知道如何解决问题吗？对于其他 3 个文件，我没有问题。我正在使用此代码：

c2 = read.csv("Chicago_Crimes_2005_to_2007.csv", header = TRUE)

我试图用这段代码修复它：

c2 = read.csv("Chicago_Crimes_2005_to_2007.csv", header = TRUE, quote = "", row.names = NULL, stringsAsFactors = FALSE).

没有解决。我在stackoverflow中尝试了所有答案，但出现了同样的错误。没有任何帮助。自 1 周以来没有成功。希望有人能帮助我。在 RStudio 中使用 R。

【问题讨论】：

尝试使用data.table::fread()读取文件...我的经验是它有时会自动“修复”源文件中的奇怪错误
@Wimpel 感谢您的帮助。试过但得到这个错误：In data.table::fread("Chicago_Crimes_2005_to_2007.csv", header = TRUE) : Stopped early on line 533719. Expected 23 fields but found 46. Consider fill=TRUE and comment.char=. First discarded non-empty line:
<<537288,5601758,HN409865,06/16/2007 08:15:00 PM,020XX E 94TH ST,1330,CRIMINAL TRESPASS,TO LAND,OTHER RAILROAD PROP / TRAIN DEPOT,False,False,413,4.0,8.0,48.0,26,1191237.0,1843038.0,2007,04/15/2016 08:55:02 AM,41.724300463,-87.575094193,"(41.724300463, -87.5,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location>>
请edit您的问题，而不是添加 cmets。另请注意，“Zeichenkette”只是字符串，而不是带引号的字符串。澄清错误描述——“不起作用”没有用。

标签： r csv import dataset

【解决方案1】：

这是一个读取脚本的版本，它从文件的第一行解析列名，使用tidyr::gather() 和gsub() 的组合清理它们，并将它们用作read::read_csv() 的输入。然后它汇总Row.Number 字段以确认其最大值6254267 与文件中最后一行的行号匹配。

library(readr)
library(tidyr)
# read first row and clean column names
colNamesData <- read_csv("./data/Chicago_Crimes_2005_to_2007.csv",col_names=FALSE,n_max=1)
# set NA to Row Number
colNamesData[1,1] <- "Row Number"
# use tidyr::gather() to turn rows into columns
xColNames <- gather(colNamesData)
# use gsub() to replace blanks with periods so data can be used as column names
xColNames$value <- gsub(" ",".",xColNames$value)
# read with readr::read_csv() and set column names to data extracted from first row
# skip first row because it contains bad column names and is missing the first column name 
crimeData <- read_csv("./data/Chicago_Crimes_2005_to_2007.csv",col_names=xColNames$value,skip=1)
# last row in file is row number 6254267
summary(crimeData$Row.Number)

...和输出：

> summary(crimeData$Row.Number)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0  235792  471370 1944429 5601310 6254267 
>

注意：该文件未正确读取所有记录，因为在第 533,719 行，该记录似乎以冗余的变量名称列表结尾。

要纠正此问题，必须手动编辑数据以删除冗余的变量名称列表或错误周围的代码。

有趣的是，原始数据文件的第 533,720 行的行号计数从 0 重新开始，这表明创建此数据的人错误地连接了多个文件以创建此数据文件。

【讨论】：

【解决方案2】：

给你：

require(tidyverse)
df <- readr::read_csv("Chicago_Crimes_2005_to_2007.csv")

您可能决定清理列名，因为有些列名中有空格，如果是这样：

colnames(df) <- c("rowNo",
                   "ID",
                   "Case.Number",
                   "Date",
                   "Block",
                   "IUCR",
                   "Primary.Type",
                   "Description",
                   "Location.Description",
                   "Arrest",
                   "Domestic",
                   "Beat",
                   "District",
                   "Ward",
                   "Community.Area",
                   "FBI.Code",
                   "X.Coordinate",
                   "Y.Coordinate",
                   "Year",
                   "Updated.On",
                   "Latitude",
                   "Longitude",
                   "Location")

【讨论】：

我做了你的代码，它被加载到了 Rstudio。也做了列名代码。但是当我检查数据集时，所有信息都只在 rowNo 列中。似乎代码没有将所有值分隔到正确的列中。所以分离没有成功。我能做什么？
我试过了，现在值在正确的列中，但是在每 8 行或第 7 行中值不正确。它们向左移动 2-3 列。例如 Community.Area 值在 X.Coordinate 列内等等。为什么？
你能给我一些有这个问题的行号吗？
这不会发生在我的电脑上。它完美解析。
@S002 - 检查文件第 533,719 行的数据错误，即使在正确下载后也是如此。