根据关键字有条件地 grep 和连接 CSV 中的行答案

【问题标题】：Conditionally grep and concatenate lines in CSVs based on keyword根据关键字有条件地 grep 和连接 CSV 中的行
【发布时间】：2021-07-12 14:02:53
【问题描述】：

我有两个 CSV 太大（每个 20M 行），无法将它们全部加载到 Pandas 中，然后过滤出我实际需要的子集。（最终输出需要是 Pandas 数据帧。）我正在使用 subprocess grep 匹配我需要的所有行中出现的字符串的所有行，然后我使用 BytesIO 读取输出。

from io import BytesIO
import subprocess as sub
import pandas as pd

output = sub.check_output(f'grep -i "{string}" /path/for/csv/file.csv', shell=True)

df = pd.read_csv(BytesIO(output))

我对这两个文件都这样做，然后将它们与 Pandas 连接起来。问题是在某些情况下，字符串只出现在两个文件之一中。目前我正在使用这个 if 语句来防止脚本抛出错误：

output = sub.check_output(f'grep -i "{string}" /path/for/csv/file.csv | wc -l', shell=True)

if output > 0:
     ...

我要考虑的是如何有条件地将结果连接在一起，以便如果字符串出现在 csv 中的任何位置，则两个输出都保存到变量中，然后仅当字符串出现在两个 CSV 中时才连接在一起。

目前我能想到的最简洁的方法是检查两个 CSV 并将输出保存到两个不同的变量

output_1 = sub.check_output(f'grep -i "{string}" /path/for/csv/file1.csv | wc -l', shell=True)
output_2 = sub.check_output(f'grep -i "{string}" /path/for/csv/file2.csv | wc -l', shell=True)

然后编写一系列条件语句：

if (output_1 > 0) & (output_2 > 0):

    output = sub.check_output(f'grep -i "{string}" /path/for/csv/file.csv1', shell=True)
    df1 = pd.read_csv(BytesIO(output))

    output = sub.check_output(f'grep -i "{string}" /path/for/csv/file.csv2', shell=True)
    df2 = pd.read_csv(BytesIO(output))

    df = pd.concat([df1, df2], axis=0)

elif (output_1 > 0) & (output_2 < 1):

    output = sub.check_output(f'grep -i "{string}" /path/for/csv/file.csv1', shell=True)
    df = pd.read_csv(BytesIO(output))

elif (output_1 < 1) & (output_2 > 0):

    output = sub.check_output(f'grep -i "{string}" /path/for/csv/file.csv2', shell=True)
    df = pd.read_csv(BytesIO(output))

else:

    print(f'{string} does not appear in the files.)

这似乎不必要地笨重，并且对于需要使用三个或更多 CSV 执行相同类型的事情并不可靠。有没有一种方法可以更高效/更简洁地处理两个 CSV，或者另外可以像处理 2 个 CSV 一样轻松处理 3 个以上的 CSV？

编辑：

我也尝试了以下方法（@Shawn 的建议），但目前它的执行速度比 grep 慢 3 倍（8.7 秒对 2.6 秒）：

%%time
file_path = '/path/for/csv/file.csv'

with open(file_path) as csvfile:
    filtered = list(filter(lambda row: ('nonadmd' in row), csvfile))

然后我正在执行以下操作以将其转换为数据框格式：

df = pd.DataFrame(filtered)
df.columns = ['all_data']
df = pd.DataFrame(df.all_data.apply(lambda x: x.split(',')).tolist(), index=df.index)

样本数据：

import random
df1 = pd.DataFrame({
    'col1': random.choices(['A', 'B', 'C'], k=4*10**6),
    'col2': random.sample(range(0, 100), k=4*10**6)
})
df2 = pd.DataFrame({
    'col1': random.choices(['A', 'B', 'C'], k=4*10**6),
    'col2': random.sample(range(0, 100), k=4*10**6)
})

【问题讨论】：

我不明白你为什么要把像 grep 这样的外部程序拖到可以在 python 中很容易完成的东西中。
@Shawnn，文件太大，所以在读入 pandas 之前，在 shell 中进行了一些处理。 OP在问题开始时解释了
请创建可复制的示例数据。我可以尝试一下，看看是否有解决方案
@sammywemmy 我的意思是，直接上 python。打开文件，逐行读取，查找子字符串。没有这些熊猫的东西。
@KristianCanler 是的。虽然从外观上看，即使是 csv 库也可能是矫枉过正。

标签： python linux pandas command-line subprocess

【解决方案1】：

一个简单的方法是在使用grep之前cat所有csv：

$ cat *.csv | grep -i ...

【讨论】：