【发布时间】:2018-06-14 07:27:02
【问题描述】:
我有一个大的 (8GB+) csv 文件(逗号分隔),我想读入 R。该文件包含三列
-
date#in 2017-12-27 格式 -
text#一个字符串 -
type#每个字符串的标签(NA、typeA或typeB)
我遇到的问题是text列包含各种字符串指示符:'(单引号),"(双引号),没有引号。标记,以及多个分隔的字符串。
例如
date text type
2016-01-01 great job! NA
2016-01-02 please, type "submit" typeA
2016-01-02 "can't see the "error" now" typeA
2016-01-03 "add \\"/filename.txt\\"" NA
为了读取这些大数据,我尝试了:
- 基础
read.csv和readr的read_csv功能:部分工作正常但失败(可能是由于内存)或需要很长时间才能阅读 - 通过 Mac 终端将数据分批成 1m 行:失败,因为行似乎任意中断
- 使用
fread(我希望这能解决另外两个问题):使用Error: Expecting 3 cols, but line 1103 contains text after processing all cols.失败
我的想法是通过使用我知道的数据细节来解决这些问题,即每行以日期开头并以NA、typeA 或typeB 结尾。
我该如何实现这个(使用纯readLines 或转换成fread)?
编辑: 使用 Mac TextWrangler 打开的示例数据(匿名):
"date","text","type"
"2016-03-30","Maybe use `tapply` from `base`, and check how that works.",NA
"2016-04-01","Fiex this now. Please check.","typeA"
"2016-04-01","Does it work? Maybe try the other approach.","typeB"
"2016-04-01","This won't work. You should remove ABC ... each line starts with a date and ends with ... and this line is veeeeeeeeeeeeeeeeeery long.",NA
"2014-05-02","Tried to remove ""../"" but no success @myid",typeA
样本数据 2:
"date","text","type"
"2018-05-02","i try this, but it doesnt work",NA
"2018-05-02","Thank you very much. Cheers !!",NA
"2018-05-02","@myid. I'll change this.",NA
可重现 fread 错误的样本数据 "Expecting 3 cols, but line 3 contains text after processing all cols.":
"date","text","type"
"2015-03-02","Some text, some text, some question? Please, some question?",NA
"2015-03-02","Here you have the error ""Can’t access {file \""Macintosh HD:abc:def:filename\"", \""/abc.txt\""} from directory."" something -1100 from {file ""Macintosh HD:abc:def:filename"", ""/abc.txt""} to file",NA
"2015-03-02","good idea",NA
"2015-03-02","Worked perfectly :)",NA
会话信息:
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.4-3 readr_1.1.1
loaded via a namespace (and not attached):
[1] compiler_3.5.0 assertthat_0.2.0 R6_2.2.2 cli_1.0.0
[5] hms_0.4.2 tools_3.5.0 pillar_1.2.2 rstudioapi_0.7
[9] tibble_1.4.2 yaml_2.1.19 crayon_1.3.4 Rcpp_0.12.16
[13] utf8_1.1.3 pkgconfig_2.0.1 rlang_0.2.0
【问题讨论】:
-
我认为您的问题是您的
text列中有逗号。例如please, type "submit"这就是fread失败的原因。我们需要以行格式查看实际数据。例如,尝试用记事本打开它 -
@DavidArenburg 添加了数据。您有解决方法的想法吗?
-
我可以使用
fread阅读您的两个示例,没有任何问题。也许尝试更新您的 data.table/R 版本 -
谢谢。我现在找到了出错的部分,并添加了用 fread 重现确切错误的数据。您认为是什么问题?
-
您的 data.table 版本非常旧。尝试更新版本
标签: r read.table