有没有比 fread() 更快的方法来读取大数据？答案

【问题标题】：Is there a faster way than fread() to read big data?有没有比 fread() 更快的方法来读取大数据？
【发布时间】：2019-10-17 04:52:15
【问题描述】：

您好，首先我已经在 stack 和 google 上搜索并找到了这样的帖子： Quickly reading very large tables as dataframes。虽然这些很有帮助并且得到了很好的回答，但我正在寻找更多信息。

我正在寻找读取/导入高达 50-60GB 的“大”数据的最佳方式。我目前正在使用来自data.table 的fread() 函数，它是我目前知道的最快的函数。我工作的 pc/server 有一个很好的 cpu（工作站）和 32 GB 的 RAM，但仍然有超过 10 GB 的数据，有时接近数十亿的观测值需要很长时间才能读取。

我们已经有了 sql 数据库，但由于某些原因，我们必须在 R 中处理大数据。当涉及到像这样的大文件时，有没有办法加速 R 或者比fread() 更好的选择？

谢谢。

编辑：fread("data.txt", verbose = TRUE)

omp_get_max_threads() = 2
omp_get_thread_limit() = 2147483647
DTthreads = 0
RestoreAfterFork = true
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 2 threads (omp_get_max_threads()=2, nth=2)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file C://somefolder/data.txt
  File opened, size = 1.083GB (1163081280 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<ID,Dat,No,MX,NOM_TX>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=','  with 100 lines of 5 fields using quote rule 0
  Detected 5 columns on line 1. This line is either column names or first data row. Line starts as: <<ID,Dat,No,MX,NOM_TX>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 5
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 100 because (1163081278 bytes from row 1 to eof) / (2 * 5778 jump0size) == 100647
  Type codes (jump 000)    : 5A5AA  Quote rule 0
  Type codes (jump 100)    : 5A5AA  Quote rule 0
  'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) in the rest of the 10054 sample rows
  =====
  Sampled 10054 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 2 to the end of last row: 1163081249
  Line length: mean=56.72 sd=20.65 min=25 max=128
  Estimated number of rows: 1163081249 / 56.72 = 20506811
  Initial alloc = 41013622 rows (20506811 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 5A5AA
[10] Allocate memory for the datatable
  Allocating 5 column slots (5 - 0 dropped) with 41013622 rows
[11] Read the data
  jumps=[0..1110), chunk_size=1047820, total_size=1163081249
|--------------------------------------------------|
|==================================================|
Read 20935277 rows x 5 columns from 1.083GB (1163081280 bytes) file in 00:31.484 wall clock time
[12] Finalizing the datatable
  Type counts:
         2 : int32     '5'
         3 : string    'A'
=============================
   0.007s (  0%) Memory map 1.083GB file
   0.739s (  2%) sep=',' ncol=5 and header detection
   0.001s (  0%) Column type detection using 10054 sample rows
   1.809s (  6%) Allocation of 41013622 rows x 5 cols (1.222GB) of which 20935277 ( 51%) rows used
  28.928s ( 92%) Reading 1110 chunks (0 swept) of 0.999MB (each chunk 18860 rows) using 2 threads
   +   26.253s ( 83%) Parse to row-major thread buffers (grown 0 times)
   +    2.639s (  8%) Transpose
   +    0.035s (  0%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
  31.484s        Total

【问题讨论】：

你真的需要 R 中的所有数据吗？我建议预先使用例如转换、过滤或创建子集。 awk、sed 和/或 cat 在 unix 环境中。另一种方法是使用furrr:future_map 读取垃圾数据以进行并行化。
...或者由于您已经在 sql 数据库中拥有数据，只需连接到该数据库并拉入子样本即可使用。
如果您事先知道数据集的维度，您可以预先分配所需的空间并自己编写 Rccp 函数（用于导入），它应该会快一点（但不要期望有很大的改进） .
@Jimbou 谢谢，我会看看furrr:future_map。 @joran这是不切实际的，但我无法直接连接到sql db，这就是我在这里问这个的原因。 @JacobJacox 谢谢，已经尝试过了，但并没有让它更快！
您提到您的工作站具有良好的 cpu 和 32 gb 内存，如果它是 SSD、HDD，您没有说明存储子系统的任何内容。 SDD当然会比HDD好得多。甚至比大多数 SSD 更快的是使用 Intel Optane 内存。鉴于您正在使用的数据集的大小，我会将系统内存增加到 64 GB。

标签： r data.table bigdata fread

【解决方案1】：

您可以使用select = columns 仅加载相关列而不会使您的内存饱和。例如：

dt <- fread("./file.csv", select = c("column1", "column2", "column3"))

我使用read.delim() 读取了fread() 无法完全加载的文件。因此，您可以将数据转换为 .txt 并使用 read.delim()。

但是，为什么不打开与要从中提取数据的 SQL 服务器的连接。您可以使用library(odbc) 打开与 SQL 服务器的连接，并像往常一样编写查询。您可以通过这种方式优化内存使用。

查看this short introduction 至odbc。

【讨论】：

是的，我已经用过odbc，实际上我没有用过select = columns，我应该想到这一点。就像我在 cmets 中所说的那样，我无法直接连接到 SQL 服务器（这就是我问这个问题的原因）。我知道这确实不切实际，但我必须在 R 中执行此操作。我会接受您的回答，因为如果我无法直接连接到 SQL 服务器，fread() 似乎仍然是 R 中最快的选项。谢谢！
很高兴我能帮上忙。或者，您可以尝试在 Python 中使用 pandas 加载它。就个人而言，我认为data.table 在语法方面是最好的包，但是pandas 读取文件非常快。它还有一个usecols 参数。例如：pd,read_csv("./file.csv", usecols = ["column1", "column2"]).
好主意，我对pandas 了解不多，但值得一试。我将尝试在这两者之间运行基准测试。再次感谢！
我经常在 50GB 上使用 fread，它运行良好。为什么你认为这个尺寸不能很好地工作？它是为如此大的数据集设计的，甚至更大。 Pandas 无济于事，由于 pandas 的高内存使用率，它无法在 128GB 机器上加载 50GB csv。最终，python 数据表可能会有所帮助。
@Jangorecki 我不知道数据表是为如此大的数据集制作的。我这么说是因为我最近在加载文件时遇到了问题。 data.table 无法加载所有行。

【解决方案2】：

假设您希望文件完全读入 R，使用数据库或选择列/行的子集不会有太大帮助。

在这种情况下有帮助的是：
- 确保您使用的是最新版本的 data.table
- 确保设置了最佳线程数
使用setDTthreads(0L) 使用所有可用线程，默认情况下data.table 使用50% 的可用线程。
- 检查fread(..., verbose=TRUE) 的输出，并可能在此处将其添加到您的问题中
- 将您的文件放在快速磁盘或 RAM 磁盘上，然后从那里读取

如果你的数据有很多不同的字符变量，你可能无法获得很好的速度，因为填充 R 的内部全局字符缓存是单线程的，因此解析可以很快，但创建字符向量将是瓶颈。

【讨论】：

谢谢，我会调查的！
@Gainz 我会说磁盘读取速度，尝试使用一些外部工具测量您的驱动器读取速度，与 fread 的速度进行比较。如答案中所述，如果有许多不同的字符，CPU 可能绝对是一个问题。向问题添加添加详细输出。
@Gainz 看起来最简单的加速方法是使用更多内核。在工作站机器上应该有超过 2 个线程。有关内核的更多详细信息可以从getDTthreads(verbose=TRUE) 获得
你如何访问机器？ SSH？只需从命令行检查应该有多少线程。 getDTthreads 只报告 2 个。服务器上可能有配置为每个用户分配最多 2 个线程。
是的 ssh，我认为您对分配的线程是正确的，看来我的同事也只能访问 2 个线程。我会试着和 TI 谈谈这件事。谢谢 jangorecki，非常感谢您的帮助！