如何过滤csv文件？答案

【问题标题】：How to filter csv file?如何过滤csv文件？
【发布时间】：2018-04-18 09:54:08
【问题描述】：

我有一个包含随机数据的 csv 文件，但我想过滤文件中的数据。我想过滤所有内容以 $ 开头并以 # 结尾的行

2017-09-07 03:11:03,5,hello
2017-09-07 03:11:16,6,yellow
2017-09-07 03:11:22,28,some other stuff with spaces
2017-09-08 20:24:36,157,
        2017-10-28 04:39:25,54,$SITE0011,1654,0000,0000,0000,00000000,000000^A^A^A^A^A^A^@^@#
        2017-10-28 04:39:48,108,$SITE0011,1654,0000,0000,0000,00000000,000000^A^A^A^A^A^A^@^@#$SITE0011,1654,0000,0000,0000,00000000,000000^A^A^A^A^A^A^@^@#
        2017-10-28 04:40:26,54,$SITE0011,1654,0000,0000,0000,00000000,000000^A^A^A^A^A^A^@^@#
        2017-10-28 04:40:29,54,$SITE0011,1654,0000,0000,0000,00000000,000000^A^A^A^A^A^A^@^@#

【问题讨论】：

我不太清楚你在追求什么。您能否显示上述示例输入的预期输出？
我只是像这样过滤行 2017-10-28 04:39:25,54,$SITE0011,1654,0000,0000,0000,00000000,000000^A^A^A^A^ A^A^@^@#
当您说过滤器时，您的意思是“排除”这些行还是“包含”它们？
我的 csv 文件中只包含这些行。

标签： python django csv

【解决方案1】：

我认为这将是过滤生成器函数的一个很好的用例：

import re
import csv


def filter_lines(f):
    """this generator funtion uses a regular expression
    to include only lines that have a `$` and end with a `#`.
    """
    filter_regex = r'.*\$.*\#$'
    for line in f:
        line = line.strip()
        m = re.match(filter_regex, line)
        if m:
            yield line


with open(CSV_FILENAME) as f:
    filter_generator = filter_lines(f)
    csv_reader = csv.reader(filter_generator)
    for row in csv_reader:
        pass

编辑：

我现在意识到，在您的示例中，单个“行”可能包含多个匹配项（如第 6 行所示）。这个稍加修改的版本也可以处理这个问题：

import re
import csv


def filter_lines(f):
    """this generator funtion uses a regular expression
    to include only lines that have a `$` and end with a `#`.
    """
    filter_regex = r'(\$[^#]*\#)'
    for line in f:
        line = line.strip()
        matches = re.findall(filter_regex, line)
        for m in matches:
            yield m


with open(CSV_FILENAME) as f:
    filter_generator = filter_lines(f)
    csv_reader = csv.reader(filter_generator)
    for row in csv_reader:
        print row

从您的示例输入生成的输出：

['$SITE0011', '1654', '0000', '0000', '0000', '00000000', '000000^A^A^A^A^A^A^@^@#']
['$SITE0011', '1654', '0000', '0000', '0000', '00000000', '000000^A^A^A^A^A^A^@^@#']
['$SITE0011', '1654', '0000', '0000', '0000', '00000000', '000000^A^A^A^A^A^A^@^@#']
['$SITE0011', '1654', '0000', '0000', '0000', '00000000', '000000^A^A^A^A^A^A^@^@#']
['$SITE0011', '1654', '0000', '0000', '0000', '00000000', '000000^A^A^A^A^A^A^@^@#']

【讨论】：

嘿 @smassey 在我的 csv 文件中有很多不需要的数据，当我使用你的代码时它返回 UnicodeDecodeError: 'utf-8' codec can't decode byte 0x98 in position 826: invalid start byte error .
嗨@vipulgangwar 编码错误是我在 python 2 中遇到的最大的 PITA。在哪一行会引发 UnicideDecodeError？我怀疑将yield m 更改为yield unicode(m).encode('utf-8') 可以解决您的问题。
嘿@smassey 你的建议代码工作正常，但我也希望 2017-10-28 04:39:25,54 这两个字段与其他字段。