Web Scraper：我需要帮助提取属性之间的文本...任何帮助将不胜感激答案

【问题标题】：Web Scraper: I need help pulling out the text in between the attribute... Any help would be appreciateWeb Scraper：我需要帮助提取属性之间的文本...任何帮助将不胜感激
【发布时间】：2021-01-07 19:37:46
【问题描述】：

链接 = https://www.imdb.com/search/title/?title_type=video_game&amp&sort=user_rating,desc&amp&after=1&amp&ref_=adv_nxt

我的目标

我需要收集每个页面上的所有视频游戏名称、类型、描述、类型和发布年份。

我的问题 https://www.imdb.com/search/title/?title_type=video_game&sort=user_rating,desc&start=9951&ref_=adv_nxt

total_games = 26,215

在下一页迭代中，“start=9951”更改为“after=WzUuNSwidHQ4NjcxMDM2IiwxMDAwMV0%3D”

我原本打算循环：pages = np.arange(1, total_games, 50)，每页从 1 到 26215 每 50 个条目，但后来我偶然发现了这个问题。

HTML：下一个 »

如何取出部分 href 链接并添加到整个链接以循环？

结果：

"https://www.imdb.com/search/title/?title_type=video_game&sort=user_rating,desc&" + "after=WzUuNSwidHQ4NjcxMDM2IiwxMDAwMV0%3D" + "&ref_=adv_nxt" p>

粗体：这是我想在每个页面上抓取的 HREF 部分以迭代到下一页/这是在更改的 href 内。

任何解决方案将不胜感激！

【问题讨论】：

标签： python html css web-scraping beautifulsoup

【解决方案1】：

您可以省去麻烦，只需检查 HTML 中是否存在“下一步”按钮。如果是，您只需提取 href 并点击链接，否则您已到达最后一页。

假设您正在使用 BeautifulSoup 并且您已经准备好了汤：

next_link_tag = soup.find('a', {'class': 'next-page'}) # Find the a tag with a class "next-page"
if next_link_tag: # If there is any
    next_link = next_link_tag.get('href') # Get the href (Don't forget to prepend it with 'https://www.imdb.com/')
else:
    pass # There's no next page. Act accordingly

【讨论】：