在大型 csv 文件中搜索答案

【问题标题】：Search in large csv files在大型 csv 文件中搜索
【发布时间】：2018-05-29 18:02:18
【问题描述】：

问题

我在一个文件夹中有数千个 csv 文件。每个文件有 128,000 个条目，每行有四列。有时（一天两次）我需要将一个列表（10,000 个条目）与所有 csv 文件进行比较。如果其中一个条目与其中一个 csv 文件的第三或第四列相同，我需要将整个 csv 行写入一个额外的文件。

可能的解决方案

Grep

#!/bin/bash
getArray() {
    array=()
    while IFS= read -r line
    do
        array+=("$line")
    done < "$1"
}

getArray "entries.log"
for e in "${array[@]}"
do
    echo "$e"
    /bin/grep $e ./csv/* >> found
done

这似乎有效，但它会永远持续下去。将近 48 小时后，脚本只检查了大约 10,000 条的 48 个条目。

MySQL

接下来的尝试是将所有 csv 文件导入 mysql 数据库。但是我的表在大约 50,000,000 个条目时遇到了问题。所以我写了一个脚本，它在 49,000,000 个条目后创建了一个新表，因此我能够导入所有 csv 文件。我试图在第二列上创建一个索引，但它总是失败（超时）。在导入过程之前创建索引也是不可能的。它将导入速度从几个小时缩短到了几天。 select 语句很糟糕，但它确实有效。比“grep”解决方案快得多，但仍然很慢。

我的问题

我还能尝试在 csv 文件中搜索什么？为了加快速度，我将所有 csv 文件复制到 ssd。但我希望还有其他方法。

【问题讨论】：

您能否举例说明 CSV 文件和列表的外观？

标签： mysql shell csv search grep

【解决方案1】：

这不太可能为您提供有意义的好处，但对您的脚本有一些改进

使用内置的mapfile 将文件放入数组中：
```
mapfile -t array < entries.log
```
将 grep 与模式文件和适当的标志一起使用。

我假设您希望将 entries.log 中的项目匹配为固定字符串，而不是正则表达式模式。
我还假设您想匹配整个单词。
```
grep -Fwf entries.log ./csv/*
```
这意味着您不必 grep 1000 次 csv 文件 1000 次（entry.log 中的每个项目一次）。实际上，仅此一项应该给您带来真正有意义的性能提升。

这也完全消除了将 entry.log 读入数组的需要。

【讨论】：

感谢您的帮助。这将 grep 脚本减少到一个衬里！测试正在运行......
这不符合 OP 问题中的If one of the entries is identical with the third or fourth column 要求，对吧？
完全正确。这是“此行有一个单词在列表中”
嗨，格伦，抱歉回复晚了。以下是 11.000.000.000（110 亿）个 csv 行的测试结果：开始：Mo 11. Jun 03:46:30 CEST 2018... End: Mo 11. Jun 12:24:31 CEST 2018... 所以，大约8.5小时。它比以前快得多，但我也会测试其他解决方案。再次感谢您的帮助！

【解决方案2】：

在 awk 中假设所有 csv 文件都发生了变化，否则跟踪已经检查的文件是明智的。但首先是一些测试材料：

$ mkdir test        # the csvs go here
$ cat > test/file1  # has a match in 3rd
not not this not
$ cat > test/file2  # no match
not not not not
$ cat > test/file3  # has a match in 4th
not not not that
$ cat > list        # these we look for
this
that

然后是脚本：

$ awk 'NR==FNR{a[$1];next} ($3 in a) || ($4 in a){print >> "out"}' list test/*
$ cat out
not not this not
not not not that

解释：

$ awk '                   # awk
NR==FNR {                 # process the list file
    a[$1]                 # hash list entries to a
    next                  # next list item
} 
($3 in a) || ($4 in a) {  # if 3rd or 4th field entry in hash
    print >> "out"        # append whole record to file "out"
}' list test/*            # first list then the rest of the files

脚本将所有列表条目散列到 a 并读取 csv 文件，在匹配时输出的散列中查找第 3 和第 4 字段条目。

如果你测试它，请告诉我它运行了多长时间。

【讨论】：

【解决方案3】：

您可以构建一个模式文件，然后使用xargs 和grep -Ef 来搜索批量 csv 文件中的所有模式，而不是像当前解决方案中那样一次一个模式：

# prepare patterns file
while read -r line; do
  printf '%s\n' "^[^,]+,[^,]+,$line,[^,]+$"       # find value in third column
  printf '%s\n' "^[^,]+,[^,]+,[^,]+,$line$"       # find value in fourth column
done < entries.log > patterns.dat

find /path/to/csv -type f -name '*.csv' -print0 | xargs -0 grep -hEf patterns.dat > found.dat

find ... - 发出找到的所有 csv 文件的 NUL 分隔列表
xargs -0 ... - 将文件列表批量传递给 grep

【讨论】：