读取带有杂乱字符串和多个字符串指示符的大数据 R答案

【问题标题】：Reading large data with messy strings and multiple string indicators R读取带有杂乱字符串和多个字符串指示符的大数据 R
【发布时间】：2018-06-14 07:27:02
【问题描述】：

我有一个大的 (8GB+) csv 文件（逗号分隔），我想读入 R。该文件包含三列

date#in 2017-12-27 格式
text #一个字符串
type #每个字符串的标签（NA、typeA 或 typeB）

我遇到的问题是text列包含各种字符串指示符：'（单引号），"（双引号），没有引号。标记，以及多个分隔的字符串。

例如

date        text                        type
2016-01-01  great job!                  NA
2016-01-02  please, type "submit"       typeA
2016-01-02  "can't see the "error" now" typeA
2016-01-03  "add \\"/filename.txt\\""   NA

为了读取这些大数据，我尝试了：

基础read.csv 和readr 的read_csv 功能：部分工作正常但失败（可能是由于内存）或需要很长时间才能阅读
通过 Mac 终端将数据分批成 1m 行：失败，因为行似乎任意中断
使用fread（我希望这能解决另外两个问题）：使用Error: Expecting 3 cols, but line 1103 contains text after processing all cols. 失败

我的想法是通过使用我知道的数据细节来解决这些问题，即每行以日期开头并以NA、typeA 或typeB 结尾。

我该如何实现这个（使用纯readLines 或转换成fread）？

编辑：使用 Mac TextWrangler 打开的示例数据（匿名）：

"date","text","type"
"2016-03-30","Maybe use `tapply` from `base`, and check how that works.",NA
"2016-04-01","Fiex this now. Please check.","typeA"
"2016-04-01","Does it work? Maybe try the other approach.","typeB"
"2016-04-01","This won't work. You should remove ABC ... each line starts with a date and ends with ... and this line is veeeeeeeeeeeeeeeeeery long.",NA
"2014-05-02","Tried to remove ""../"" but no success @myid",typeA

样本数据 2：

"date","text","type"
"2018-05-02","i try this, but it doesnt work",NA
"2018-05-02","Thank you very much. Cheers !!",NA
"2018-05-02","@myid. I'll change this.",NA

可重现 fread 错误的样本数据 "Expecting 3 cols, but line 3 contains text after processing all cols."：

"date","text","type"
"2015-03-02","Some text, some text, some question? Please, some question?",NA
"2015-03-02","Here you have the error ""Can’t access {file \""Macintosh HD:abc:def:filename\"", \""/abc.txt\""} from directory."" something -1100 from {file ""Macintosh HD:abc:def:filename"", ""/abc.txt""} to file",NA
"2015-03-02","good idea",NA
"2015-03-02","Worked perfectly :)",NA

会话信息：

R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.4-3 readr_1.1.1        

loaded via a namespace (and not attached):
[1] compiler_3.5.0   assertthat_0.2.0 R6_2.2.2         cli_1.0.0       
[5] hms_0.4.2        tools_3.5.0      pillar_1.2.2     rstudioapi_0.7  
[9] tibble_1.4.2     yaml_2.1.19      crayon_1.3.4     Rcpp_0.12.16    
[13] utf8_1.1.3       pkgconfig_2.0.1  rlang_0.2.0

【问题讨论】：

我认为您的问题是您的 text 列中有逗号。例如please, type "submit" 这就是 fread 失败的原因。我们需要以行格式查看实际数据。例如，尝试用记事本打开它
@DavidArenburg 添加了数据。您有解决方法的想法吗？
我可以使用fread 阅读您的两个示例，没有任何问题。也许尝试更新您的 data.table/R 版本
谢谢。我现在找到了出错的部分，并添加了用 fread 重现确切错误的数据。您认为是什么问题？
您的 data.table 版本非常旧。尝试更新版本

标签： r read.table

【解决方案1】：

readLines 方法可以是

infile <- file("test.txt", "r")
txt <- readLines(infile, n = 1)
df <- NULL

#change this value as per your requirement
chunksize <- 1

while(length(txt)){
  txt <- readLines(infile, warn=F, n = chunksize)
  df  <- rbind(df, data.frame(date = gsub("\\s.*", "", txt),
                              text = trimws(gsub("\\S+(.*)\\s+\\S+$", "\\1", txt)),
                              type = gsub(".*\\s", "", txt),
                              stringsAsFactors = F))
  }

给了

> df
        date                          text  type
1 2016-01-01                    great job!    NA
2 2016-01-02         please, type "submit" typeA
3 2016-01-02   "can't see the "error" now" typeA
4 2016-01-03 "add \\\\"/filename.txt\\\\""    NA

示例数据： test.txt 包含

date        text                        type
2016-01-01  great job!                  NA
2016-01-02  please, type "submit"       typeA
2016-01-02  "can't see the "error" now" typeA
2016-01-03  "add \\"/filename.txt\\""   NA

更新： 您可以使用下面的正则表达式解析器修改上面的代码来解析另一组示例数据

df  <- rbind(df, data.frame(date = gsub("\"(\\S{10}).*", "\\1", txt),
                            text = gsub(".*\"\\,\"(.*)\"\\,(\"|NA).*", "\\1", txt),
                            type = gsub(".*\\,|\"", "", txt),
                            stringsAsFactors = F))

另一组样本数据：

"date","text","type"
"2016-03-30","Maybe use `tapply` from `base`, and check how that works.",NA
"2016-04-01","Fiex this now. Please check.","typeA"
"2016-04-01","Does it work? Maybe try the other approach.","typeB"
"2016-04-01","This won't work. You should remove ABC ... each line starts with a date and ends with ... and this line is veeeeeeeeeeeeeeeeeery long.",NA
"2014-05-02","Tried to remove ""../"" but no success @myid","typeA"

【讨论】：

谢谢@Prem。我已经发布了示例数据。如果我在您的方法中使用我的示例数据 2，我会得到如下行："2018-05-02","Thank[SEPARATEDHERE]you very much. Cheers[SEPARATEDHERE]!!",NA