【发布时间】:2017-04-26 23:10:09
【问题描述】:
我正在制作一个程序,它从互联网上获取三首诗歌,并使用 Python 解析 HTML 并找出字数和名词短语种类等信息。 在我的函数 def(frequency_counter) 中,我试图计算三首诗中最常见的词,并且我试图只计算长度超过 3 的词(所以像“a”和“the”这样的词是不包括在内),但我认为我在列表理解中犯了一个错误(item = [item for item in total_library if len(item) >= 3])。我已经包含了我的导入和前两个上下文函数,但我遇到的问题只是在最后一个小函数中。关于我的列表理解应该如何显示的任何提示?
import requests
from bs4 import BeautifulSoup
import html2text
from textblob import TextBlob
from collections import Counter
def get_text(*args):
text_list =[]
total_list=[]
for link in args:
url = link
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
title = soup.find('title') #finds title
#print(title)
text = html2text.html2text(soup.prettify())
lines = text.split("\n")
for word in lines: #for every item in text
text_tuple = [title, word] #makes tuple
text_list.append(text_tuple) #append tuple to empty list
# print(text_list)
for item in text_list:
title_dictionary = {"title": title, "text": item[1]}
total_list.append(title_dictionary)
#print(total_list)
return total_list
def big_index(text_list):
each_text = []
for entry in text_list: #for every entry in text_list, creates smaller
total_text = ""
for x in each_text:
y = str(x)
total_text = total_text + y
total_library = total_text.split("text title:")
#print(total_text)
return total_library
#problem I ran into here: this gives me the books twice, not once. I plan
#to solve this by taking any counts I get in the future functions and
#dividing them by two. Ugly, but I can't figure out where the problem is.
def frequency_counter(total_library):
words = []
for item in total_library:
item = [item for item in total_library if len(item) >= 3]
blob1 = TextBlob(item)
count = blob1.word_counts
frequency = Counter(count).most_common(10) #10 most common words
words.append(frequency)
print(words)
return words
【问题讨论】:
-
如果你不希望像
the这样的词成为列表列表理解应该是item = [item for item in total_library if len(item) > 3] -
你可能应该做
word not in stop_words而不是测试长度,其中stop_words是像ranks.nl/stopwords这样的集合 -
而且你绝对不应该在一个循环中这样做
-
我还应该说:我运行它时得到的错误是 TypeError:传递给
__init__(text)的text参数必须是字符串,而不是。 -
请更新您的错误信息
标签: python