【问题标题】:All lines not being read while executing read.csv in R在 R 中执行 read.csv 时未读取所有行
【发布时间】:2013-07-20 14:36:57
【问题描述】:

这是输入文件:http://www.yourfilelink.com/get.php?fid=841283。我执行了

options(stringsAsFactors=FALSE)
x=read.csv("test1.csv", header = FALSE, sep="'"). 

结果是这样的:http://www.yourfilelink.com/get.php?fid=841284

我只得到 7 行,而不是 135 行!列数正确,为 13。x[6,10] 后面的行的内容也是如此,只是在字符串中用 \n 分隔。

请帮助我。我被这个问题困住了! :/

【问题讨论】:

    标签: r csv


    【解决方案1】:

    描述的带有多个“\n”的超长项目的症状表明您可能需要处理不匹配的引号。如果名称或地址条目中有引号,则解析器将等待下一个引号,然后再考虑完成条目。试试”

    x=read.csv("test1.csv", header = FALSE, sep="'", quote="")
    

    这实际上不适用于我下载的文件。 (请注意,read.csv 中的 sep 参数将被忽略。)我需要先将 count.fields 与该分隔符一起使用,然后将read.tablefill =TRUE 一起使用。结果仍然有点混乱,有几列用逗号填充,但至少有一些东西可以使用:

    table( count.fields("~/Downloads/test1.txt", sep="'", quote=""))
    
     10  13 
      5 130 
     x <- read.table("~/Downloads/test1.txt", header = FALSE, sep="'", quote="", stringsAsFactors=FALSE, skip=5)
    #Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
    #  line 6 did not have 13 elements
     x <- read.table("~/Downloads/test1.txt", header = FALSE, sep="'", 
                      quote="", stringsAsFactors=FALSE, fill=TRUE)
     str(x)
     #########################################################
    'data.frame':   135 obs. of  13 variables:
     $ V1 : chr  "INSERT INTO message VALUES (52," "INSERT INTO message VALUES (53," "INSERT INTO message VALUES (54," "INSERT INTO message VALUES (55," ...
     $ V2 : chr  "press.release@enron.com" "office.chairman@enron.com" "office.chairman@enron.com" "press.release@enron.com" ...
     $ V3 : chr  "," "," "," "," ...
     $ V4 : chr  "2000-01-21 04:51:00" "2000-01-24 01:37:00" "2000-01-24 02:06:00" "2000-02-02 10:21:00" ...
     $ V5 : chr  "," "," "," "," ...
     $ V6 : chr  "<12435833.1075863606729.JavaMail.evans@thyme>" "<29664079.1075863606676.JavaMail.evans@thyme>" "<15300605.1075863606629.JavaMail.evans@thyme>" "<10522232.1075863606538.JavaMail.evans@thyme>" ...
     $ V7 : chr  "," "," "," "," ...
     $ V8 : chr  "ENRON HOSTS ANNUAL ANALYST CONFERENCE PROVIDES BUSINESS OVERVIEW AND GOALS FOR 2000" "Over $50 -- You made it happen!" "Over $50 -- You made it happen!" "ROAD-SHOW.COM Q4i.COM CHOOSE ENRON TO DELIVER FINANCIAL WEB CONTENT" ...
     $ V9 : chr  "," "," "," "," ...
     $ V10: chr  "HOUSTON - Enron Corp. hosted its annual equity analyst conference today in==20Houston.  Ken Lay, Enron chairman and chief execu"| __truncated__ "On Wall Street, people are talking about Enron.  At Enron, we re talking=20about people...our people.  You are the driving forc"| __truncated__ "On Wall Street, people are talking about Enron.  At Enron, we re talking=20about people...our people.  You are the driving forc"| __truncated__ "HOUSTON =01) Enron Broadband Services (EBS), a wholly owned subsidiary of E=nron=20Corp. and a leader in the delivery of high-b"| __truncated__ ...
     $ V11: chr  "" "," "," "," ...
     $ V12: chr  "" "Robert_Badeer_Aug2000Notes FoldersPress releases" "Robert_Badeer_Aug2000Notes FoldersPress releases" "Robert_Badeer_Aug2000Notes FoldersPress releases" ...
     $ V13: chr  "" ");" ");" ");" ...
    

    使用逗号作为分隔符并仅使用单引号而不是 read.*-functions 使用的默认单引号或双引号,我得到了更好的结果:

    x2 <- read.table("~/Downloads/test1.txt", header = FALSE, sep=",",
                      quote="'", stringsAsFactors=FALSE, fill=TRUE)
     str(x2)
    

    【讨论】:

    • 我在原始代码中使用 quote="" 就得到了想要的结果!为了摆脱逗号,我将所需的列复制到另一个数据框中。
    • 有时需要修复的另一个问题是注释字符参数。 count.fields 函数接受对该参数的更改并将其包装在 table 中,让您可以快速检查两个参数的替代选择的规律性。
    【解决方案2】:

    检查你的文本,想想你在使用计算机时的期望。它开始时没有分隔符 ('),在 press releases 中看到第一个 ('),然后开始做一些愚蠢的事情。不要计算你读取的第一个条目,先检查输出。

    INSERT INTO message VALUES (52,'press.release@enron.com','2000-01-21 04:51:00','<12435833.1075863606729.JavaMail.evans@thyme>','ENRON HOSTS
    

    【讨论】:

    • 每一列中的文本都是正确的并且符合预期,除了行数问题!此外,我刚刚对第 6 行的内容进行了一些编辑。我观察到它是最长的行,所以我删除了“消息”部分的一些内容,使其不再是最长的行。当我再次运行代码时,我得到了正确且符合预期的行数!!!但我不能像这样编辑我的数据集:/ 这会给出不正确的结果,而且这也是安然数据集.. 相当大!
    • 只是为了了解您所做的:您是否尝试删除第 7 行,然后一切正常?如果不是(或者即使):尝试显式设置变量 colClasses(可能是所有字符串)。无论如何,这是加快 readTable 速度的最重要技巧。
    猜你喜欢
    • 2014-07-29
    • 1970-01-01
    • 1970-01-01
    • 2020-05-25
    • 1970-01-01
    • 2015-07-07
    • 1970-01-01
    • 2012-07-04
    • 2019-06-27
    相关资源
    最近更新 更多