计算特定 URL 上特定单词的频率 - Python答案

【问题标题】：Count the frequency of a specific word on a specific URL - Python计算特定 URL 上特定单词的频率 - Python
【发布时间】：2022-01-03 17:42:36
【问题描述】：

我希望计算特定单词在给定 URL 上显示的频率。我目前有一种方法可以为一小组 URL 和一个单词做到这一点：

import requests
from bs4 import BeautifulSoup

url_list = ["https://www.example.org/","https://www.example.com/"]

#the_word = input()
the_word = 'Python'

total_words = []
for url in url_list:
    r = requests.get(url, allow_redirects=False)
    soup = BeautifulSoup(r.content.lower(), 'lxml')
    words = soup.find_all(text=lambda text: text and the_word.lower() in text)
    count = len(words)
    words_list = [ ele.strip() for ele in words ]
    for word in words:
        total_words.append(word.strip())

    print('\nUrl: {}\ncontains {} of word: {}'.format(url, count, the_word))
    print(words_list)


#print(total_words)
total_count = len(total_words)

但是，我希望能够将一组单词映射到它们各自的 URL，如下面的数据框所示。

Target Word	Target URL
word1	www.example.com/topic-1/
word2	www.example.com/topic-2/

理想情况下，输出会为我提供一个新列，其中包含单词在其关联 URL 上显示的频率。例如，'word1' 在 'www.example.com/topic-1/' 上显示的频率。

非常感谢任何和所有帮助！

【问题讨论】：

您是否尝试过使用str.count()？

标签： python dataframe web-scraping beautifulsoup word-count

【解决方案1】：

只需遍历您的结构 - dict、dicts 列表……以下示例将指向一个方向，因为您的问题不是那么清楚，并且缺少确切的预期结果。我相信您可以根据自己的特殊需求对其进行调整。

示例

import requests
from bs4 import BeautifulSoup
import pandas as pd

data = [
    {'word':'Python','url':'https://stackoverflow.com/questions/tagged/python'},
    {'word':'Question','url':'https://stackoverflow.com/questions/tagged/python'}
]

for item in data:
    r = requests.get(item['url'], allow_redirects=False)
    soup = BeautifulSoup(r.content.lower(), 'lxml')
    count = soup.body.get_text(strip=True).lower().count(item['word'].lower())
    item['count'] = count

pd.DataFrame(data)

输出

word	url	count
Python	https://stackoverflow.com/questions/tagged/python	93
Question	https://stackoverflow.com/questions/tagged/python	13

注意： 根据你想要确定的词频，你应该考虑以下几点：

人类可读将与 html 分开提取，例如搭配 BeautifulSoup。
根据网页内容的静态/动态提供方式，必须选择工具。例如，对于动态内容，selenium 是首选，因为与请求不同，它还呈现 JavaScript。

【讨论】：

谢谢！这很有帮助。您显示的输出正是我正在寻找的。我正在使用一个 csv，它已经变成了一个类似于你上面的字典列表。但是，我无法遍历 dicts 列表以获得相同的输出。想法？
很高兴支持 - 这将注定 asking a new question 保持有关 count() 的实际问题的范围干净。我们也会注意到它并提供支持，提供您的 dicts 列表的示例以及您卡住的地方，这会很棒。 -- 如果此答案或任何其他答案解决了您的问题，请将其标记为已接受 - someone-answers
再次感谢！在此处更新问题stackoverflow.com/questions/70581444/…
@AlexFuss 看到您找到了答案，太好了 - 还为我的答案添加了一些上下文的注释，以防您处理动态提供的内容。

【解决方案2】：

你应该试试count() 字符串的方法使用您的代码，它将如下所示：

count = url.count(the_word)
print('\nUrl: {}\ncontains {} of word: {}'.format(url, count, the_word))

【讨论】：