使用 Python 计算特定短语答案

【问题标题】：Counting Specific Phrases Using Python使用 Python 计算特定短语
【发布时间】：2020-06-24 22:12:24
【问题描述】：

所以我试图从我创建的字符串中获取 Python 中特定短语的计数。我已经能够列出特定的单个单词，但从来没有涉及两个短语的任何内容。我只是希望能够为每个项目创建一个包含两个单词的项目列表。

import pandas as pd
import numpy as np
import re
import collections
import plotly.express as px

df = pd.read_excel("Datasets/realDonaldTrumprecent2020.xlsx", sep='\t', 
                   names=["Tweet_ID", "Date", "Text"])

df = pd.DataFrame(df)
df.head()

tweets = df["Text"]

raw_string = ''.join(tweets)
no_links = re.sub(r'http\S+', '', raw_string)
no_unicode = re.sub(r"\\[a-z][a-z]?[0-9]+", '', no_links)
no_special_characters = re.sub('[^A-Za-z ]+', '', no_unicode)
no_capital_letters = re.sub('[A-Z]+', lambda m: m.group(0).lower(), no_special_characters)

words_list = no_capital_letters.split(" ")

phrases = ['fake news', 'lamestream media', 'sleepy joe', 'radical left', 'rigged election']

我最初能够获得仅包含单个单词的列表，但我希望能够获得出现短语的实例列表。有没有办法做到这一点？

【问题讨论】：

标签： python python-3.x pandas list

【解决方案1】：

Pandas 提供了一些很好的工具来做这些事情。

例如，如果您的DataFrame 如下：

import pandas as pd

df = pd.DataFrame({'text': [
    'Encyclopedia Britannica is FAKE NEWS!',
    'What does Sleepy Joe read? Webster\'s Dictionary? Fake News!',
    'Sesame Street is lamestream media by radical leftist Big Bird!!!',
    '1788 was a rigged election! Landslide for King George! Fake News',
]})

...您可以像这样选择包含短语“假新闻”的推文：

selector = df.text.str.lower().str.contains('fake news')

这会产生以下 Series 的布尔值：

0     True
1     True
2    False
3     True
Name: text, dtype: bool

你可以用 sum 计算有多少是正数：

sum(selector)

并使用它来索引数据框以获取推文数组

df.text[selector].values

【讨论】：

【解决方案2】：

如果您要计算这些短语在文本中出现的次数，下面的代码应该可以工作。

for phrase in phrases:
    sum(s.count(phrase) for phrase in words_list)
    print(phrase, sum)

就“出现短语的实例列表”而言，您应该能够稍微修改上面的 for 循环：

phrase_list = []
for phrase in phrases:
    for tweet in tweets:
        if tweet in phrase:
            phrase_list.append(tweet)

【讨论】：

所以当我使用底部代码时得到了这条评论：AttributeError: 'str' object has no attribute 'contains' ，当我尝试用 no_capital_letters 替换推文时，我得到了同样的错误
我将“包含”更改为“在”。我希望这对你有用！
嗨，我没有收到错误，但我得到了一个空列表。