【发布时间】:2020-07-02 03:41:59
【问题描述】:
我正在尝试构建一个 Wordcloud,它会自动从职位描述中提取单词并构建一个 wordcloud。如果你有 stopwords=None,它应该删除 wordcloud 的已知停用词列表,但我的程序没有。我相信这可能与我如何用漂亮的汤来拉扯职位描述有关。我需要帮助,要么用 beautifulsoup 以不同的方式提取单词,要么我没有正确使用停用词。
import requests
# pip install bs4
from bs4 import BeautifulSoup
# pip install wordcloud
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Goes to a job description
url = "https://career.benteler.jobs/job/Paderborn-Head-of-Finance-&-Controlling-North-America-NW/604307901/?locale=en_US"
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'html.parser')
# Goes through all the words in the beautiful soup text
combinedWords = ''
for words in soup.find_all('span'):
separatedWords = words.text.split(' ')
combinedWords += " ".join(separatedWords) + ' '
# creates wordcloud
resumeCloud = WordCloud(stopwords=None, background_color='white', max_words=75, max_font_size=75, random_state=1).generate(combinedWords)
plt.figure(figsize=(8, 4))
plt.imshow(resumeCloud)
plt.axis('off')
plt.show()
【问题讨论】:
-
@barny,第二个肯定有帮助。设置 collocations=False 有效。谢谢。
标签: python beautifulsoup word-cloud