【问题标题】:How to skip words under a certain length when counting frequency计算频率时如何跳过一定长度以下的单词
【发布时间】:2017-04-26 23:10:09
【问题描述】:

我正在制作一个程序,它从互联网上获取三首诗歌,并使用 Python 解析 HTML 并找出字数和名词短语种类等信息。 在我的函数 def(frequency_counter) 中,我试图计算三首诗中最常见的词,并且我试图只计算长度超过 3 的词(所以像“a”和“the”这样的词是不包括在内),但我认为我在列表理解中犯了一个错误(item = [item for item in total_library if len(item) >= 3])。我已经包含了我的导入和前两个上下文函数,但我遇到的问题只是在最后一个小函数中。关于我的列表理解应该如何显示的任何提示?

import requests
from bs4 import BeautifulSoup
import html2text
from textblob import TextBlob
from collections import Counter


def get_text(*args):
    text_list =[]
    total_list=[]
    for link in args:
        url = link
        r = requests.get(url)
        soup = BeautifulSoup(r.content,'html.parser')
        title = soup.find('title') #finds title 
        #print(title)
        text = html2text.html2text(soup.prettify())
        lines = text.split("\n")
        for word in lines: #for every item in text
            text_tuple = [title, word] #makes tuple
            text_list.append(text_tuple) #append tuple to empty list
           # print(text_list)
        for item in text_list:  
            title_dictionary = {"title": title, "text": item[1]}
            total_list.append(title_dictionary)
    #print(total_list)
    return total_list

def big_index(text_list):
    each_text = []
    for entry in text_list: #for every entry in text_list, creates smaller 

    total_text = ""
    for x in each_text:
        y = str(x)
        total_text = total_text + y
    total_library = total_text.split("text title:")
    #print(total_text)
    return total_library
    #problem I ran into here: this gives me the books twice, not once. I plan
    #to solve this by taking any counts I get in the future functions and 
    #dividing them by two. Ugly, but I can't figure out where the problem is. 

def frequency_counter(total_library):
    words = []
    for item in total_library:
        item = [item for item in total_library if len(item) >= 3]
        blob1 = TextBlob(item)
        count = blob1.word_counts
        frequency = Counter(count).most_common(10) #10 most common words
        words.append(frequency)
    print(words) 
    return words

【问题讨论】:

  • 如果你不希望像 the 这样的词成为列表列表理解应该是 item = [item for item in total_library if len(item) > 3]
  • 你可能应该做word not in stop_words而不是测试长度,其中stop_words是像ranks.nl/stopwords这样的集合
  • 而且你绝对不应该在一个循环中这样做
  • 我还应该说:我运行它时得到的错误是 TypeError:传递给__init__(text)text 参数必须是字符串,而不是
  • 请更新您的错误信息

标签: python


【解决方案1】:
def frequency_counter(total_library):
    words = []
    items = [item1 for item1 in total_library if len(item1) > 3]
    for item in items:
        blob1 = TextBlob(item)
        count = blob1.word_counts
        frequency = Counter(count).most_common(10) #10 most common words
        words.append(frequency)
    print(words) 
    return words

【讨论】:

  • 嗯。出于某种原因,尽管您的回答对我来说很有意义,但它仍然给了我最常用的任何长度的词,包括“a”和“of”之类的词。
  • 你的字符编码是 UTF-8 吗?
  • 我真的不知道这意味着什么,抱歉。我用谷歌搜索了它,但它对我来说大多是胡言乱语,我只是一个初学者。我知道某些情况下您必须使用它,但我不知道为什么。
  • 如果你的“total_library”只包含英文单词?
  • 将此行添加到 .py 文件的顶部 # -*- coding: utf-8 -*-
猜你喜欢
  • 2019-02-01
  • 2015-06-06
  • 2019-12-19
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2019-01-27
  • 2015-01-07
  • 2023-03-22
相关资源
最近更新 更多