从网站上抓取由 javascript 编写的文本答案

【问题标题】：Scrape a text that was written by javascript from website从网站上抓取由 javascript 编写的文本
【发布时间】：2019-02-11 19:27:08
【问题描述】：

我正在使用 BeautifulSoup 从网站上抓取角色信息。 BeautifulSoup 在尝试获取角色的胜率时，找不到。

当我检查文本时，它列在 .我在网站源代码中能找到的所有内容，以及 BeautifulSoup 找到的所有内容都是“ranking-stats-placeholder”。

这是我目前正在使用的代码。

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = "https://u.gg/lol/champions/darius/build/?role=top"

#opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

#html parsing
page_soup = soup(page_html, "html.parser")

#champion name
champ_name = page_soup.findAll("span", {"class":"champion-name"})[0].text

#champion win rate
champ_wr = page.soup.findAll("div", {"class":"win-rate okay-tier"})

我相信获胜率文本是由 javascript 添加的，但我不知道如何获取文本。我目前拥有的代码为 champ_wr 返回“无”

【问题讨论】：

stackoverflow.com/questions/13960567/…

标签： javascript python web-scraping beautifulsoup

【解决方案1】：

虽然这个文本在技术上可能在 javascript 本身中，但我的第一个猜测是 JS 是通过 ajax 请求将其拉入的。让您的程序对此进行模拟，您可能会直接获得所需的所有数据，而无需进行任何抓取！

不过，这需要一些侦探工作。我建议打开您的网络流量记录器（例如 Firefox 中的“Web Developer Toolbar”），然后访问该站点。将注意力集中在任何/所有 XmlHTTPRequest 上。

祝你好运！

【讨论】：

我找不到任何 XmlHTTPRequest，但我设法发现我需要的一切都在一个 .js 文件中。我不知道如何处理这个......
如果你找到了你需要的东西，你能不能直接从.js文件中解析出来，也许使用正则表达式？
Chrome 的开发工具（Linux 中的 Ctrl+Shift+J）有一个“网络”选项卡，可以列出这些请求……
@SamMason 是的，但我没有在其中找到任何 XmlHTTPRequests

【解决方案2】：

不确定您与 BeautifulSoup 的关系如何，但我可以让 selenium 做一些有用的事情：

# load code from selenium package
from selenium.webdriver import Remote
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

# start an instance of Chrome up
chrome = Service('/usr/local/bin/chromedriver')
chrome.start()
driver = Remote(chrome.service_url)

# get the page loading
driver.get("https://u.gg/lol/champions/darius/build/?role=top")

# wait for the win rate to be populated
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CLASS_NAME, "win-rate")))

# get the values you wanted
name = driver.find_element_by_class_name("champion-name").text
winrate = driver.find_element_by_class_name("win-rate").text

# display them
print(f"name: {repr(name)}, winrate: {winrate.split()[0]}")

# clean up a bit
driver.quit()

【讨论】：