Selenium Python 无法在所有跨度标签中提取文本答案

【问题标题】：Selenium Python not able to extract text within all span tagsSelenium Python 无法在所有跨度标签中提取文本
【发布时间】：2023-04-01 07:25:01
【问题描述】：

我正在创建一个自动化 10fastfingers 的小型 Python 程序。为了做到这一点，我必须首先提取我必须输入的所有单词。所有这些词都存储在span 标签中，如下所示：

当我运行我的代码时，它只提取前 20-30 个单词，而不是提取所有单词。为什么会这样？这是我的代码：

from selenium import webdriver
import time

url = "https://10fastfingers.com/typing-test/english"

browser = webdriver.Chrome("D:\\Python_Files\\Programs\\chromedriver.exe")

browser.get(url)

time.sleep(10)

count = 1

wordlst = []

while True:
    
    try:
        word = browser.find_element_by_xpath(f'//*[@id="row1"]/span[{count}]')
        wordlst.append(word.text)
        count += 1
        
    except:
        break

print(wordlst)

输出：

['them', 'how', 'said', 'light', 'show', 'seem', 'not', 'two', 'under', 'hear', 'them', 'there', 'about', 'face', 'us', 'change', 'year', 'only', 'leave', 'number', 'found', 'father', 'people', 'house', 'really', 'my', 'spell', 'when', 'look', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

如何解决这个问题？任何帮助，将不胜感激。谢谢！

【问题讨论】：

标签： python selenium selenium-chromedriver screen-scraping

【解决方案1】：

BeautifulSoup 可以做到这一点

from selenium import webdriver
import time
from bs4 import BeautifulSoup

url = "https://10fastfingers.com/typing-test/english"

browser = webdriver.Chrome("D:\\Python_Files\\Programs\\chromedriver.exe")
browser.get(url)
time.sleep(3)
html_soup = BeautifulSoup(browser.page_source, 'html.parser')
div = html_soup.find_all('div', id = 'row1')
wordlst=div[0].get_text().split()
browser.quit()
print(wordlst)

或

继续你的方法，

from selenium import webdriver
import time

url = "https://10fastfingers.com/typing-test/english"
browser = webdriver.Chrome("D:\\Python_Files\\Programs\\chromedriver.exe")
browser.get(url)
time.sleep(6)
wordlst=browser.find_elements_by_xpath('//div[@id="row1"]/span')
wordlst=[x.get_attribute("innerHTML") for x in wordlst]
browser.quit()
print(wordlst)

【讨论】：

嘿！谢谢你！那行得通！你介意解释一下wordlst=div[0].get_text().split() 到底做了什么吗？我与BeautifulSoup 合作的时间不多，所以我无法理解它到底做了什么。
当然。 Div 返回一个包含 id 为“row1”的所有元素的列表。 get_text 提供div 标签之间的所有文本，包括删除标签时来自 span 的文本。现在因为你想要一个单词列表，我添加了 split()。 BeautifulSoup 通常用于在 python 中轻松处理 html
好的...谢谢你的解释！但是我的方法有什么问题？
你的方法没有错。它实际上有一个非常简单的解决方案。将wordlst.append(word.text) 替换为wordlst.append(word.get_attribute("innerHTML"))。实际上，您只需使用 find_elements_by_xpath 就可以使您的代码非常小
好的...我按照相同的逻辑提取所有单词以在多人游戏中输入，但我得到了一个差异输出。你愿意帮助我吗？所有多人游戏的url都是10ff.net/login 登录后我试了一下：html_soup = soup(browser.page_source, 'html.parser') div = html_soup.find_all('div', id = 'game') wordlst=div[0].get_text().split() print(wordlst)