如何在滚动时从使用 javascript 加载元素的网页中抓取？答案

【问题标题】：How can I scrape from a webpage that uses javascript to load in elements as you scroll?如何在滚动时从使用 javascript 加载元素的网页中抓取？
【发布时间】：2020-02-01 13:42:47
【问题描述】：

我的朋友问我是否可以编写一个网页抓取脚本来从特定网站收集 pokemon 的数据。

我编写了以下代码来呈现 javascript 并获取一个特定的类来从网站 (https://www.smogon.com/dex/ss/pokemon/) 收集数据。

问题是，当您向下滚动页面时，页面会加载更多条目。有没有办法从这个刮？我是网络抓取的新手，所以我不完全确定这一切是如何运作的。

from requests_html import HTMLSession

def getPokemon(link):
    session = HTMLSession()
    r = session.get(link)
    r.html.render()
    for pokemon in r.html.find("div.PokemonAltRow"):
        print(pokemon)
    quit()

getPokemon('https://www.smogon.com/dex/ss/pokemon/')

【问题讨论】：

您可以使用Selenium实现此目的

标签： javascript python web-scraping python-requests-html

【解决方案1】：

数据实际上存在于页面源中。请参阅view-source:https://www.smogon.com/dex/ss/pokemon/（它作为 javascript 变量存在于脚本标记中）。

import requests
import re
import json


response = requests.get('https://www.smogon.com/dex/ss/pokemon/')

# The following regex will help you take the json string from the response text
data = "".join(re.findall(r'dexSettings = (\{.*\})', response.text))

# the above will only return a string, we need to parse that to json in order to process it as a regular json object using `json.loads()`
data = json.loads(data)

# now we can query json string like below.
data = data.get('injectRpcs', [])[1][1].get('items', [])

for row in data:
  print(row.get('name', ''))
  print(row.get('description', ''))

在行动中看到它here

【讨论】：

感谢您的解决方案，比我写的要简洁得多！我知道它在页面的源代码中，但想知道是否可以通过加载所有 javascript 从页面中抓取。我想这只是不必要地过于复杂了。