抓取 Android 商店答案

【问题标题】：Scraping Android Store抓取 Android 商店
【发布时间】：2019-04-12 02:42:13
【问题描述】：

我正在尝试使用 Beautiful Soup 抓取 Android Store 页面，以获取包含软件包列表的文件。这是我的代码：

from requests import get
from bs4 import BeautifulSoup
import json
import time

url = 'https://play.google.com/store/apps/collection/topselling_free'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)

app_container = html_soup.find_all('div', class_="card no-rationale 
square-cover apps small")
file = open("applications.txt","w+")
for i in range(0,60):
#if range > 60 ; "IndexError: list index out of range"
    print(app_container[i].div['data-docid'])
    file.write(app_container[i].div['data-docid'] + "\n")

file.close()

问题是我只能收集 60 个包名称，因为没有加载 javascript，如果我必须加载更多应用程序，我必须向下滚动。如何在 Python 中重现此行为以获得 60 多个结果？

【问题讨论】：

标签： javascript python beautifulsoup

【解决方案1】：

我的建议是使用 Scrapy 和 Splash

http://splash.readthedocs.io/en/stable/scripting-tutorial.html。

Splash 是一个无头浏览器，你可以渲染 JS 和执行脚本。一些代码示例

function main(splash)
    local num_scrolls = 10
    local scroll_delay = 1.0

    local scroll_to = splash:jsfunc("window.scrollTo")
    local get_body_height = splash:jsfunc(
        "function() {return document.body.scrollHeight;}"
    )
    assert(splash:go(splash.args.url))
    splash:wait(splash.args.wait)

    for _ = 1, num_scrolls do
        scroll_to(0, get_body_height())
        splash:wait(scroll_delay)
    end        
    return splash:html()
end

要渲染此脚本，请使用“执行”端点而不是 render.html 端点：

script = """<Lua script> """
scrapy_splash.SplashRequest(url, self.parse,
                            endpoint='execute', 
                            args={'wait':2, 'lua_source': script}, ...)

我正在使用 Scrapy 进行抓取，我相信您需要定期运行抓取。你可以使用 Scrapyd 来运行 Scrapy spider。

我从here得到这个代码

【讨论】：

【解决方案2】：

您会考虑使用功能更全面的刮板吗？ Scrapy 是专门为这项工作而构建的：https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016

Selenium 就像用代码驱动浏览器——如果你能亲自完成，你可能也可以在 selenium 中完成：scrape websites with infinite scrolling

其他人得出的结论是 bs4 和 requests 不足以完成任务：How to load all entries in an infinite scroll at once to parse the HTML in python

另请注意，抓取可能是一个灰色地带，您应该始终努力了解并尊重网站政策。他们的 TOS 和 robots.txt 始终是阅读的好地方。

【讨论】：