需要向下滚动时进行网页抓取答案

【问题标题】：Web scraping when scrolling down is needed需要向下滚动时进行网页抓取
【发布时间】：2019-03-05 02:47:30
【问题描述】：

我想抓取网页https://www.quora.com/topic/Stack-Overflow-4/all_questions下的前200个问题的标题。我尝试了以下代码：

import requests
from bs4 import BeautifulSoup

url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"
print("url")
print(url)
r = requests.get(url) # HTTP request
print("r")
print(r)
html_doc = r.text # Extracts the html
print("html_doc")
print(html_doc)
soup = BeautifulSoup(html_doc, 'lxml') # Create a BeautifulSoup object
print("soup")
print(soup)

它给了我一条短信https://pastebin.com/9dSPzAyX。如果我们搜索href='/，我们可以看到 html 确实包含一些问题的标题。但是，问题是数量不够；实际上在网页上，用户需要手动向下滚动以触发额外加载。

有谁知道我如何模仿程序“向下滚动”以加载更多页面内容？

【问题讨论】：

How can I scroll a web page using selenium webdriver in python?的可能重复

标签： python web-scraping python-requests

【解决方案1】：

网页上的无限滚动基于 Javascript 功能。因此，要找出我们需要访问的 URL 以及要使用的参数，我们需要彻底研究页面内工作的 JS 代码，或者最好检查浏览器在向下滚动页面时所做的请求。我们可以使用开发者工具研究请求。 See example for quora

向下滚动的次数越多，生成的请求就越多。所以现在您的请求将针对该 url 而不是普通 url 完成，但请记住发送正确的标头和播放负载。

其他更简单的解决方案是使用硒

【讨论】：

【解决方案2】：

找不到使用请求的响应。但是你可以使用硒。首先在第一次加载时打印出问题的数量，然后发送 End 键以模拟向下滚动。您可以看到发送结束键后问题的数量从 20 个变为 40 个。

我使用了驱动程序。在再次加载 DOM 之前隐式等待 5 秒，以防脚本在加载 DOM 之前加载速度过快。您可以通过将 EC 与 selenium 一起使用来改进。

页面每次滚动加载 20 个问题。因此，如果您要收集 100 个问题，则需要发送 End 键 5 次。

要使用下面的代码，您需要安装 chromedriver。 http://chromedriver.chromium.org/downloads

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.common.by import By

    CHROMEDRIVER_PATH = ""
    CHROME_PATH = ""
    WINDOW_SIZE = "1920,1080"

    chrome_options = Options()
    # chrome_options.add_argument("--headless")  
    chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)
    chrome_options.binary_location = CHROME_PATH
    prefs = {'profile.managed_default_content_settings.images':2}
    chrome_options.add_experimental_option("prefs", prefs)

    url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"

    def scrape(url, times):

    if not url.startswith('http'):
        raise Exception('URLs need to start with "http"')

    driver = webdriver.Chrome(
    executable_path=CHROMEDRIVER_PATH,
    chrome_options=chrome_options
    )

    driver.get(url)

    counter = 1
    while counter <= times:

        q_list = driver.find_element_by_class_name('TopicAllQuestionsList')
        questions = [x for x in q_list.find_elements_by_xpath('//div[@class="pagedlist_item"]')]
        q_len = len(questions)
        print(q_len)

        html = driver.find_element_by_tag_name('html')
        html.send_keys(Keys.END)

        wait = WebDriverWait(driver, 5)
        time.sleep(5)

        questions2 = [x for x in q_list.find_elements_by_xpath('//div[@class="pagedlist_item"]')]
        print(len(questions2))

        counter += 1

    driver.close()

if __name__ == '__main__':
    scrape(url, 5)

【讨论】：

感谢您提供完整代码...但是您确定 driver.implicitly_wait(5) 有效吗？在我的测试中浏览器立即关闭，我们得到questions2 与questions 相同。
此外，我们需要向下滚动以获得额外的负载，我们在您的代码中看不到 scroll down。
使用发送结束键来模拟向下滚动。使用 wait 和 time.sleep 更新代码。这不应该是最好的方法，但我不知道如何使用 EC 等待元素出现在 DOM 中。

【解决方案3】：

我建议使用selenium 而不是 bs。
selenium 可以控制浏览器和解析。比如向下滚动、点击按钮等等……

此示例用于向下滚动以在 instagram 中获取所有喜欢的用户。
https://stackoverflow.com/a/54882356/5611675

【讨论】：

【解决方案4】：

如果内容仅在“向下滚动”时加载，这可能意味着页面正在使用 Javascript 动态加载内容。

您可以尝试使用PhantomJS等Web客户端加载页面并在其中执行javascript，并通过注入一些JS如document.body.scrollTop = sY;（Simulate scroll event using Javascript）来模拟滚动。

【讨论】：