fread 从大文件中读取前 n 行答案

【问题标题】：fread to read top n rows from a large filefread 从大文件中读取前 n 行
【发布时间】：2019-02-28 18:30:36
【问题描述】：

使用fread 从一个大文件（大约 50 GB）读取前 n 行时出现错误。看起来是内存问题。我尝试使用 nrows=1000 。但没有运气。使用linux

file ok but could not memory map it. This is a 64bit process. There is probably not enough contiguous virtual memory available.

可以用read.csv 替换下面的代码并使用下面使用的所有选项吗？有帮助吗？

  rdata<- fread(
      file=csvfile, sep= "|", header=FALSE, col.names= colsinfile,
    select= colstoselect, key = "keycolname", na.strings= c("", "NA")
    , nrows= 500
  )

【问题讨论】：

如果将csvfile 替换为paste('head -n 500', csvfile) 会怎样？
@mt1022 : 出现错误File 'head -n 500 /csvfile' doesnt exist
参数最终应该看起来像input = "head -n 500 /path/to/csvfile"。请使用 input 参数而不是 file 参数来允许 shell 命令。我没有要测试的大文件。我希望这行得通。
@mt1022 ：太棒了。当与input 一起使用时，它可以工作！.. 你应该把它作为答案

标签： r data.table fread

【解决方案1】：

另一种解决方法是使用 shell 命令获取前 500 行：

rdata<- fread(
    cmd = paste('head -n 500', csvfile),
    sep= "|", header=FALSE, col.names= colsinfile,
    select= colstoselect, key = "keycolname", na.strings= c("", "NA")
)

不过，我不知道为什么 nrows 不起作用。

【讨论】：

较新的版本（介于 3.4.0 和 3.6.0 之间）建议使用 cmd = 而不是 input =。

【解决方案2】：

也许这会对你有所帮助：

processFile = function(filepath) {
con = file(filepath, "r")
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
  break
}
print(line)
}
close(con)
}

见reading a text file in R line by line.. 在您的情况下，您可能希望将 while ( TRUE ) 替换为 for(i in 1:1000)

【讨论】：