用beautifulsoup从天才歌词中获取歌曲歌词│python 3.8答案

【问题标题】：Getting lyrics of song from genius lyrics with beautifulsoup │python 3.8用beautifulsoup从天才歌词中获取歌曲歌词│python 3.8
【发布时间】：2020-12-05 17:27:07
【问题描述】：

我正在尝试使用 beautifulsoup 从天才歌词中获取歌曲的歌词，但是在尝试打印歌词时，我没有得到任何输出。这是我的代码：

import requests 
from bs4 import BeautifulSoup
songURL = requests.get("https://genius.com/Marshmello-and-bastille-happier-lyrics")
song = songURL.content
soup = BeautifulSoup(song, 'lxml')
lyrics = soup.find_all("section")
for lyr in lyrics:
    for lyr1 in lyrics.select("p"):
        print(lyr1.text)

为什么这不起作用，有人可以看看这个，因为我一直在尝试这样做。

【问题讨论】：

标签： python html python-3.x beautifulsoup python-requests

【解决方案1】：

服务器似乎返回了两个版本的页面：一个版本的标签带有class="song_body-lyrics"，另一个版本带有class="Lyrics__Container..."。

此脚本尝试处理这两种情况：

import requests 
from bs4 import BeautifulSoup

url = 'https://genius.com/Marshmello-and-bastille-happier-lyrics'
soup = BeautifulSoup(requests.get(url).content, 'lxml')

for tag in soup.select('div[class^="Lyrics__Container"], .song_body-lyrics p'):
    t = tag.get_text(strip=True, separator='\n')
    if t:
        print(t)

打印：

[Intro]
Lately, I've been, I've been thinking
I want you to be happier, I want you to be happier
[Verse 1]

...and so on.

【讨论】：

所以它会选择任何一个匹配项？在 song_body-lyrics 类之后的 div 标签或 p 标签下以 Lyrics__Container 开头的类？哇，好强大。
@politicalscientist 是的，它是带有逗号 , (w3schools.com/cssref/sel_element_comma.asp) 的 CSS 选择器。 BeautifulSoup 支持就好了，很方便。

【解决方案2】：

import requests 
from bs4 import BeautifulSoup
songURL = requests.get("https://genius.com/Marshmello-and-bastille-happier-lyrics")
song = songURL.content
soup = BeautifulSoup(song, 'lxml')
final_lyrics = []
lyrics = soup.find('div', {'class': "lyrics"})
lyrics = lyrics.find_all('p')
for i in lyrics:
    final_lyrics.append(i.text)
    print(i)

【讨论】：

【解决方案3】：

你应该得到所有在特定 div 中的文本。您可以在浏览器中找到带有devtools 或viewsource 的特定div。这里特定的 div 是 <div class='lyrics'> 这个 div 的独特之处在于它的类，即 class 'lyrics' 所以我们应该在 HTML 中找到这个特定的 div，然后打印该 div 中的所有文本。

import bs4 as bs
import urllib.request

source = urllib.request.urlopen('https://alirezaarabi.com/view-source_https___genius.com_Alessia-cara-ready-lyrics.html').read()

soup = bs.BeautifulSoup(source,'lxml')
print(soup.title.string)

for div in soup.find_all('div', class_='lyrics'):
    print(div.text)

【讨论】：

【解决方案4】：

如果您查看实际的 HTML 源代码，则没有 section 标记。这是结构的实际外观：

<div class="song_body column_layout" initial-content-for="song_body">
  <div class="column_layout-column_span column_layout-column_span--primary">
    <div class="song_body-lyrics">
      
        <h2 class="text_label text_label--gray text_label--x_small_text_size u-top_margin">Happier Lyrics</h2>
      
      <div initial-content-for="lyrics">
        <div class="lyrics">
          
            <!--sse-->
            <p>[Intro]<br>
Lately, I've been, I've been thinking<br>
I want you to be happier, I want you to be happier<br>
<br>
...

【讨论】：