使用 BeautifulSoup 抓取网站时出错答案

【问题标题】：Error while scraping website using BeautifulSoup使用 BeautifulSoup 抓取网站时出错
【发布时间】：2020-12-22 15:00:19
【问题描述】：

我正在尝试从天才那里收集一些歌曲。我创建了以下方法：

import requests
from bs4 import BeautifulSoup

    def get_song_lyrics(link):
    
        response = requests.get(link)
        soup = BeautifulSoup(response.text, "html.parser")
        lyrics = soup.find("div",attrs={'class':'lyrics'}).find("p").get_text()
        return [i for i in lyrics.splitlines()]

我不明白这是为什么

get_song_lyrics('https://genius.com/Kanye-west-black-skinhead-lyrics')

AttributeError: 'NoneType' 对象没有属性 'find'

此时：

get_song_lyrics('https://genius.com/Kanye-west-hold-my-liquor-lyrics')

正确返回歌曲的歌词。两个页面具有相同的布局。有人可以帮我弄清楚吗？

【问题讨论】：

标签： python beautifulsoup screen-scraping

【解决方案1】：

我不确定是什么原因造成的，但看起来 BeautifulSoup 有时会成功，有时不会成功，而不是由于您的代码。如果代码不成功，一种解决方法是再次运行该函数：

import requests
from bs4 import BeautifulSoup

def get_song_lyrics(link):
    
    response = requests.get(link)
    soup = BeautifulSoup(response.text, "html.parser")
    try:
        lyrics = soup.find("div",attrs={'class':'lyrics'}).find("p").get_text()
        return [i for i in lyrics.splitlines()] 
    except AttributeError:
        return get_song_lyrics(link)
    
get_song_lyrics('https://genius.com/Kanye-west-black-skinhead-lyrics')

【讨论】：

【解决方案2】：

页面返回两个版本的 HTML。您可以使用此脚本来处理它们：

import requests
from bs4 import BeautifulSoup


url = 'https://genius.com/Kanye-west-black-skinhead-lyrics'
soup = BeautifulSoup(requests.get(url).content, 'lxml')

for tag in soup.select('div[class^="Lyrics__Container"], .song_body-lyrics p'):

    for i in tag.select('i'):
        i.unwrap()
    tag.smooth()

    t = tag.get_text(strip=True, separator='\n')
    if t:
        print(t)

打印：

[Produced By Daft Punk & Kanye West]
[Verse 1]
For my theme song (Black)
My leather black jeans on (Black)
My by-any-means on

...and so on.

【讨论】：

随机返回哪个版本？
@7koFnMiP 是的，它似乎是某种形式的反机器人保护。
您是如何发现该页面返回了两个版本的 HTML？
@Porridge 我把print(soup) 放在try..except 里面，然后搜索了一下，缺少什么标签。有时脚本运行正常，但有时标签完全不同。