计算短语，除非它们前面有 Python 中的另一个短语答案

【问题标题】：Counting phrases EXCEPT when they are preceded by another phrase in Python计算短语，除非它们前面有 Python 中的另一个短语
【发布时间】：2015-12-07 11:54:26
【问题描述】：

在 Python 2.7 中使用 pandas 我试图计算一个短语（例如，“非常好”）出现在存储在 CSV 文件中的文本片段中的次数。我有多个短语和多段文字。我使用以下代码在第一部分中取得了成功：

for row in df_book.itertuples():
    index, text = row
    normed = re.sub(r'[^\sa-zA-Z0-9]', '', text).lower().strip()

for row in df_phrase.itertuples():
    index, phrase = row
    count = sum(1 for x in re.finditer(r"\b%s\b" % (re.escape(phrase)), normed))
    file.write("%s," % (count))

但是，如果它前面有不同的短语（例如，“它不是”），我不想计算该短语。因此我使用了一个否定的lookbehind断言：

for row in df_phrase.itertuples():
    index, phrase = row
    for row in df_negations.itertuples():
        index, negation = row
        count = sum(1 for x in re.finditer(r"(?<!%s )\b%s\b" % (negation, re.escape(phrase)), normed))

这种方法的问题在于它记录了从 df_negations 数据帧中提取的每个否定的值。因此，如果 finditer 没有找到“它不是‘非常好’”，那么它将记录一个 0。对于每个可能的否定，依此类推。

我真正想要的只是一个短语在没有前面短语的情况下使用的总次数。换句话说，我想计算每次“非常好”出现的时间，但前提是它之前没有在我的否定列表中出现否定（“它不是”）。

另外，我很高兴听到有关加快流程运行的建议。我有 100 多个短语、100 多个否定句和 1 多万条文本。

【问题讨论】：

我相信你应该读到这个：Regex Pattern to Match, Excluding when… / Except between
这看起来正合我意。您对我如何将这种方法与单独的 CSV 文件一起使用有什么建议吗？我的所有否定都存储在每一行中？

标签： python regex python-2.7 pandas

【解决方案1】：

我并没有真正做 pandas，但是这个俗气的非 Pandas 版本根据您发送给我的数据给出了一些结果。

主要的复杂情况是 Python re 模块不允许可变宽度的负后瞻断言。因此，此示例查找匹配的短语，保存每个短语的起始位置和文本，然后，如果找到，则在同一源字符串中查找否定，保存否定的结束位置。为确保否定结束位置与短语起始位置相同，我们在每个否定之后捕获空格以及否定本身。

在 re 模块中重复调用函数是相当昂贵的。如果您说的文本很多，则可能需要批量处理，例如通过在某些源字符串上使用 'non-matching-string'.join()。

import re
from collections import defaultdict
import csv

def read_csv(fname):
    with open(fname, 'r') as csvfile:
        result = list(csv.reader(csvfile))
    return result

df_negations = read_csv('negations.csv')[1:]
df_phrases = read_csv('phrases.csv')[1:]
df_book = read_csv('test.csv')[1:]

negations = (str(row[0]) for row in df_negations)
phrases = (str(re.escape(row[1])) for row in df_phrases)

# Add a word to the negation pattern so it overlaps the
# next group.
negation_pattern = r"\b((?:%s)\W+)" % '|'.join(negations)
phrase_pattern = r"\b(%s)\b" % '|'.join(phrases)

counts = defaultdict(int)

for row in df_book:
    normed = re.sub(r'[^\sa-zA-Z0-9]', '', row[0]).lower().strip()

    # Find the location and text of any matching good groups
    phrases = [(x.start(), x.group()) for x in
                    re.finditer(phrase_pattern, normed)]
    if not phrases:
        continue

    # If we had matches, find the (start, end) locations of matching bad
    # groups
    negated = set(x.end() for x in re.finditer(negation_pattern, normed))

    for start, text in phrases:
        if start not in negated:
            counts[text] += 1
        else:
            print("%r negated and ignored" % text)

for pattern, count in sorted(counts.items()):
    print(count, pattern)

【讨论】：

由于这是我第一次发帖，我还能提供什么有用的信息？我尝试运行代码，但出现错误：'Traceback（最近一次调用最后一次）：文件“C：\...\Extract.py”，第 28 行，在短语_模式 = '|'.join(短语）文件“C:\...\Extract.py”，第 26 行，在形容词 = (re.escape(row[1]) for row in df_phrases.itertuples()) 文件“C:\Python27 \lib\re.py"，第 210 行，转义 s = list(pattern) TypeError: 'numpy.int64' object is not iterable'
我想，你的一行中有一个数字，而不是一个字符串。尝试将 row[1] 包装为 str(row[1])