Python WordCloud 不删除停用词答案

【问题标题】：Python WordCloud not removing StopwordsPython WordCloud 不删除停用词
【发布时间】：2020-07-02 03:41:59
【问题描述】：

我正在尝试构建一个 Wordcloud，它会自动从职位描述中提取单词并构建一个 wordcloud。如果你有 stopwords=None，它应该删除 wordcloud 的已知停用词列表，但我的程序没有。我相信这可能与我如何用漂亮的汤来拉扯职位描述有关。我需要帮助，要么用 beautifulsoup 以不同的方式提取单词，要么我没有正确使用停用词。

import requests
# pip install bs4
from bs4 import BeautifulSoup
# pip install wordcloud
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Goes to a job description
url = "https://career.benteler.jobs/job/Paderborn-Head-of-Finance-&amp;-Controlling-North-America-NW/604307901/?locale=en_US"
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'html.parser')

# Goes through all the words in the beautiful soup text
combinedWords = ''

for words in soup.find_all('span'):
    separatedWords = words.text.split(' ')
    combinedWords += " ".join(separatedWords) + ' '

# creates wordcloud
resumeCloud = WordCloud(stopwords=None, background_color='white', max_words=75, max_font_size=75, random_state=1).generate(combinedWords)

plt.figure(figsize=(8, 4))
plt.imshow(resumeCloud)
plt.axis('off')
plt.show()

【问题讨论】：

这能回答你的问题吗？ Why are stop words not being excluded from the word cloud when using Python's wordcloud library?
重复stackoverflow.com/questions/61953788/…
@barny，第二个肯定有帮助。设置 collocations=False 有效。谢谢。

标签： python beautifulsoup word-cloud

【解决方案1】：

主要问题是所有代码都在一个块中。尝试将逻辑拆分为方法并单独测试每个位。请求不检查错误（例如服务器可能不可用但这现在应该不是问题。）

BeautifulSoup 正在提取页面上的所有 span 元素。这意味着它将包括菜单/页脚。如果您想要职位描述，那么您可能需要选择类名称为 jobdescription 的跨度。之后您可以调用 text 来删除 html。我不确定您是否需要删除逗号和句号等其他内容。

我没有任何使用 Word Cloud 的经验。但是在下面的代码中，它返回的东西看起来像结果。

import requests
from bs4 import BeautifulSoup
from wordcloud import WordCloud
import matplotlib.pyplot as plt

def get_job_html(url):
    response = requests.get(url)
    response.raise_for_status() # check for 4xx & 5xx errors
    return response.text

def extract_combined_words(html):
    soup = BeautifulSoup(html, 'html.parser')
    job_description = soup.find("span", {"class": "jobdescription"}).text.replace('\n', ' ') # Target span with class jobdescription. text will strip out html.
    print(job_description) # TODO - Check this is the results you expect?
    return job_description

def create_resume_cloud(combinedWords):
    return WordCloud(stopwords=None, background_color='white', max_words=75, max_font_size=75, random_state=1).generate(combinedWords)

def plot_resume_cloud(resumeCloud):
    plt.figure(figsize=(8, 4))
    plt.imshow(resumeCloud)
    plt.axis('off')
    plt.show()

def run(url):
    html = get_job_html(url)
    combinedWords = extract_combined_words(html)
    resumeCloud = create_resume_cloud(combinedWords)
    plt = plot_resume_cloud(resumeCloud)
    return plt # TODO - not sure how the results gets consumed

if __name__ == '__main__':
    run("https://career.benteler.jobs/job/Paderborn-Head-of-Finance-&amp;-Controlling-North-America-NW/604307901/?locale=en_US")

【讨论】：

这正是我想要清理数据的。此外，其他人也给了我 WordCloud 解决方案。谢谢！！！！！！