【问题标题】:getting data from sites and making summary of it从站点获取数据并进行汇总
【发布时间】:2021-02-03 04:25:27
【问题描述】:

您好,我正在编写两种不同的脚本,一种是通过 selenium 获取数据,另一种是获取数据摘要。所以从站点获取数据工作正常,但是当我传递该数据以对数据进行汇总时,数据并没有在我的摘要中传递。请让我知道我在哪里出错以及如何解决这个问题。我是 python selenium 的新手。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

"""
Taking input from user
"""

search_input = input("Input the keyword you want to search for:")
search_input = search_input.replace(' ', '+')

driver = webdriver.Chrome(executable_path="E:\chromedriver\chromedriver.exe")

for i in range(1):
    matched_elements = driver.get("https://www.google.com/search?q=" +
                                     search_input + "&start=" + str(i))

print(driver.title)
driver.maximize_window()
time.sleep(5)

links_url = driver.find_elements_by_xpath("//div[@class='yuRUbf']/a[@href]")
links = []


for x in links_url:
    links.append(x.get_attribute('href'))

link_data = []

for new_url in links:
    # print('\nnew url : ', new_url)

    driver.get(new_url)

    #Getting the data from the site

    try:
        link = driver.find_elements(By.TAG_NAME, "p")

        for p in link:
            datas = p.get_attribute("innerText")
            print(datas)
    except:
        continue


driver.quit()

#getting summary of data

print("\nOriginal text:")
print(datas)
textWordCount = len(datas.split())
print("The number of words in Original text are : " + str(textWordCount))


stopWords = set(stopwords.words("english"))
words = word_tokenize(datas)


freqTable = dict()
for word in words:
    word = word.lower()
    if word in stopWords:
        continue
    if word in freqTable:
        freqTable[word] += 1
    else:
        freqTable[word] = 1


sentences = sent_tokenize(datas)
sentenceValue = dict()

for sentence in sentences:
    for word, freq in freqTable.items():
        if word in sentence.lower():
            if sentence in sentenceValue:
                sentenceValue[sentence] += freq
            else:
                sentenceValue[sentence] = freq

sumValues = 0
for sentence in sentenceValue:
    sumValues += sentenceValue[sentence]

average = int(sumValues / len(sentenceValue))

summary = ''
for sentence in sentences:
    if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.2 * average)):
        summary += " " + sentence
print("\nSummary:")
print(summary)
summaryWordCount = len(summary.split())
print("\nThe number of words in summary are : " + str(summaryWordCount))

【问题讨论】:

  • 有很多代码需要我们审核。您能否缩小脚本失败的部分,然后向我们展示您收到的错误消息
  • 直到 driver.quit 它工作正常并显示所有已抓取的数据。但在那之后,当我得到数据摘要时。它没有通过整个数据来获得它的摘要。你可以看到后面的驱动程序。当我使用 print(datas) 时退出,它不会打印整个文本。

标签: python python-3.x selenium summarization


【解决方案1】:

问题出在这一行:

datas = p.get_attribute("innerText")

这会在循环的每次迭代中重写 datas 的值。 我猜你真的想追加到一个列表,或者用单词之间的空格扩展一个字符串?

【讨论】:

    猜你喜欢
    • 2019-12-30
    • 2021-04-08
    • 2013-02-13
    • 2012-11-13
    • 2016-01-02
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多