如何通过不更改 URL 的“显示更多”按钮获取数据？答案

【问题标题】：How to get data past the "Show More" button that DOESN'T change the URL?如何通过不更改 URL 的“显示更多”按钮获取数据？
【发布时间】：2021-09-14 01:17:02
【问题描述】：

我正在尝试使用网站搜索关键字从 Vogue 中抓取文章标题和链接。我无法获得前 100 个结果，因为“显示更多”按钮会遮盖它们。我之前通过使用更改 URL 解决了这个问题，但 Vogue 的 URL 没有更改以包含页码、结果编号等。

import requests
from bs4 import BeautifulSoup as bs

url = 'https://www.vogue.com/search?q=HARRY+STYLES&sort=score+desc'
r = requests.get(url)
soup = bs(r.content, 'html')

links = soup.find_all('a', {'class':"summary-item-tracking__hed-link summary-item__hed-link"})
titles = soup.find_all('h2', {'class':"summary-item__hed"})

res = []
for i in range(len(titles)):
    entry = {'Title': titles[i].text.strip(), 'Link': 'https://www.vogue.com'+links[i]['href'].strip()}
    res.append(entry)

关于如何通过“显示更多”按钮抓取数据的任何提示？

【问题讨论】：

标签： python web-scraping beautifulsoup python-requests

【解决方案1】：

您必须通过开发人员工具检查网络。然后你必须确定网站如何请求数据。您可以在屏幕截图中看到请求和响应。

如您所见，该网站正在使用页面参数。

每页有 8 个标题。所以你必须使用循环来获得 100 个标题。

代码：

import cloudscraper,json,html
counter=1
for i in range(1,14):
    url = f'https://www.vogue.com/search?q=HARRY%20STYLES&page={i}&sort=score%20desc&format=json'
    scraper = cloudscraper.create_scraper(browser={'browser': 'firefox','platform': 'windows','mobile': False},delay=10)
    byte_data = scraper.get(url).content
    json_data = json.loads(byte_data)
    for j in range(0,8):
        title_url = 'https://www.vogue.com' + (html.unescape(json_data['search']['items'][j]['url']))
        t = html.unescape(json_data['search']['items'][j]['source']['hed'])
        print(counter," - " + t + ' - ' + title_url)
        if (counter == 100):
            break
        counter = counter + 1

输出：

【讨论】：

【解决方案2】：

您可以使用浏览器的 Web 开发人员工具检查网站上的请求，以确定它是否对您感兴趣的数据提出了特定请求。在这种情况下，网站通过向这样的 URL 发出 GET 请求来加载更多信息：

https://www.vogue.com/search?q=HARRY STYLES&page=<page_number>&sort=score desc&format=json

<page_number> 的位置 > 1，因为第 1 页是您访问网站时默认看到的内容。

假设您可以/将请求有限数量的页面，并且由于数据格式是 JSON，您必须将其转换为 dict() 或其他数据结构以提取您想要的数据。专门针对 JSON 对象的 "search.items" 键，因为它包含请求页面的文章数据数组。

然后，“标题”将是 search.items[i].source.hed，您可以使用 search.items[i].url 组装链接。

作为提示，我认为尝试手动查看网站如何工作，然后尝试自动化该过程是一个很好的做法。如果您想向该 URL 请求数据，请确保在请求之间包含一些延迟，以免您被踢出或被阻止。

【讨论】：