【问题标题】:Getting lyrics of song from genius lyrics with beautifulsoup │python 3.8用beautifulsoup从天才歌词中获取歌曲歌词│python 3.8
【发布时间】:2020-12-05 17:27:07
【问题描述】:

我正在尝试使用 beautifulsoup 从天才歌词中获取歌曲的歌词,但是在尝试打印歌词时,我没有得到任何输出。这是我的代码:

import requests 
from bs4 import BeautifulSoup
songURL = requests.get("https://genius.com/Marshmello-and-bastille-happier-lyrics")
song = songURL.content
soup = BeautifulSoup(song, 'lxml')
lyrics = soup.find_all("section")
for lyr in lyrics:
    for lyr1 in lyrics.select("p"):
        print(lyr1.text)      

为什么这不起作用,有人可以看看这个,因为我一直在尝试这样做。

【问题讨论】:

    标签: python html python-3.x beautifulsoup python-requests


    【解决方案1】:

    服务器似乎返回了两个版本的页面:一个版本的标签带有class="song_body-lyrics",另一个版本带有class="Lyrics__Container..."

    此脚本尝试处理这两种情况:

    import requests 
    from bs4 import BeautifulSoup
    
    url = 'https://genius.com/Marshmello-and-bastille-happier-lyrics'
    soup = BeautifulSoup(requests.get(url).content, 'lxml')
    
    for tag in soup.select('div[class^="Lyrics__Container"], .song_body-lyrics p'):
        t = tag.get_text(strip=True, separator='\n')
        if t:
            print(t)
    

    打印:

    [Intro]
    Lately, I've been, I've been thinking
    I want you to be happier, I want you to be happier
    [Verse 1]
    
    ...and so on.
    

    【讨论】:

    • 所以它会选择任何一个匹配项?在 song_body-lyrics 类之后的 div 标签或 p 标签下以 Lyrics__C​​ontainer 开头的类?哇,好强大。
    • @politicalscientist 是的,它是带有逗号 , (w3schools.com/cssref/sel_element_comma.asp) 的 CSS 选择器。 BeautifulSoup 支持就好了,很方便。
    【解决方案2】:
    import requests 
    from bs4 import BeautifulSoup
    songURL = requests.get("https://genius.com/Marshmello-and-bastille-happier-lyrics")
    song = songURL.content
    soup = BeautifulSoup(song, 'lxml')
    final_lyrics = []
    lyrics = soup.find('div', {'class': "lyrics"})
    lyrics = lyrics.find_all('p')
    for i in lyrics:
        final_lyrics.append(i.text)
        print(i)
    

    【讨论】:

      【解决方案3】:

      你应该得到所有在特定 div 中的文本。您可以在浏览器中找到带有devtoolsviewsource 的特定div。这里特定的 div 是 <div class='lyrics'> 这个 div 的独特之处在于它的类,即 class 'lyrics' 所以我们应该在 HTML 中找到这个特定的 div,然后打印该 div 中的所有文本。

      import bs4 as bs
      import urllib.request
      
      source = urllib.request.urlopen('https://alirezaarabi.com/view-source_https___genius.com_Alessia-cara-ready-lyrics.html').read()
      
      soup = bs.BeautifulSoup(source,'lxml')
      print(soup.title.string)
      
      for div in soup.find_all('div', class_='lyrics'):
          print(div.text)
      

      【讨论】:

        【解决方案4】:

        如果您查看实际的 HTML 源代码,则没有 section 标记。这是结构的实际外观:

        <div class="song_body column_layout" initial-content-for="song_body">
          <div class="column_layout-column_span column_layout-column_span--primary">
            <div class="song_body-lyrics">
              
                <h2 class="text_label text_label--gray text_label--x_small_text_size u-top_margin">Happier Lyrics</h2>
              
              <div initial-content-for="lyrics">
                <div class="lyrics">
                  
                    <!--sse-->
                    <p>[Intro]<br>
        Lately, I've been, I've been thinking<br>
        I want you to be happier, I want you to be happier<br>
        <br>
        ...
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2021-12-08
          • 2021-09-02
          • 2021-09-26
          • 2021-10-14
          • 1970-01-01
          相关资源
          最近更新 更多