使用 python 抓取具有无限滚动的站点答案

【问题标题】：crawl site that has infinite scrolling using python使用 python 抓取具有无限滚动的站点
【发布时间】：2014-05-07 07:00:21
【问题描述】：

我一直在做研究，到目前为止，我发现了我将计划使用其scrapy 的 python 包，现在我正在尝试找出使用 scrapy 来抓取网站的爬虫的好方法无限滚动。在四处挖掘之后，我发现有一个包调用 selenium 并且它有 python 模块。我有一种感觉，有人已经使用 Scrapy 和Selenium 以无限滚动的方式抓取网站。如果有人可以指出一个例子，那就太好了。

【问题讨论】：

一种方法是触发一些向下箭头键使您的浏览器向下滚动。
看一看：stackoverflow.com/questions/17975471/…

标签： python selenium web-crawler scrapy

【解决方案1】：

这是对我有用的简短代码：

SCROLL_PAUSE_TIME = 20

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

posts = driver.find_elements_by_class_name("post-text")

for block in posts:
    print(block.text)

【讨论】：

将所有需要的包含和定义（例如driver）添加到您的脚本中会很有帮助?
我正在使用这段代码，但它只返回滚动的最后一个元素，而不是页面中的每个元素

【解决方案2】：

对于无限滚动数据被请求到 Ajax 调用。打开 web 浏览器 --> network_tab --> 单击停止图标清除以前的请求历史记录--> 滚动网页--> 现在您可以找到滚动事件的新请求--> 打开请求标头 --> 您可以找到请求的 URL ---> 将 URL 复制并粘贴到单独的选项卡中--> 你可以找到 Ajax 调用的结果 --> 只需形成请求的 URL 即可获取数据页面，直到页面结束

【讨论】：

我同意，根据我的经验，网页自动化从来都不是实现爬虫的最佳方式。

【解决方案3】：

您可以使用 selenium 来废弃 twitter 或 facebook 等无限滚动的网站。

第 1 步：使用 pip 安装 Selenium

pip install selenium

第2步：使用下面的代码自动无限滚动并提取源代码

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import sys

import unittest, time, re

class Sel(unittest.TestCase):
    def setUp(self):
        self.driver = webdriver.Firefox()
        self.driver.implicitly_wait(30)
        self.base_url = "https://twitter.com"
        self.verificationErrors = []
        self.accept_next_alert = True
    def test_sel(self):
        driver = self.driver
        delay = 3
        driver.get(self.base_url + "/search?q=stackoverflow&src=typd")
        driver.find_element_by_link_text("All").click()
        for i in range(1,100):
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(4)
        html_source = driver.page_source
        data = html_source.encode('utf-8')


if __name__ == "__main__":
    unittest.main()

for 循环允许您解析无限滚动并发布您可以提取加载的数据。

第 3 步：如果需要，打印数据。

【讨论】：

【解决方案4】：

from selenium.webdriver.common.keys import Keys
import selenium.webdriver
driver = selenium.webdriver.Firefox()
driver.get("http://www.something.com")
lastElement = driver.find_elements_by_id("someId")[-1]
lastElement.send_keys(Keys.NULL)

这将打开一个页面，找到具有给定id 的最底部元素并将该元素滚动到视图中。随着页面加载更多，您必须不断查询驱动程序以获取最后一个元素，而且我发现随着页面变大，这会非常慢。时间主要是对driver.find_element_* 的调用，因为我不知道有一种方法可以显式查询页面中的最后一个元素。

通过实验，您可能会发现页面动态加载的元素数量有一个上限，最好写一些东西来加载该数字，然后才调用driver.find_element_*。

【讨论】：