【问题标题】:Extracting data from script tag using BeautifulSoup in Python在 Python 中使用 BeautifulSoup 从脚本标签中提取数据
【发布时间】:2018-06-10 08:58:30
【问题描述】:

我想在 Python 中使用 BeautifulSoup 从“脚本”标签中的代码中提取“SNG_TITLE”和“ART_NAME”值。 (整个脚本太长,无法粘贴)

<script>window.__DZR_APP_STATE__ = {"TAB":{"loved":{"data":[{"SNG_ID":"126884459","PRODUCT_TRACK_ID":"360276641","UPLOAD_ID":0,"SNG_TITLE":"Heathens","ART_ID":"647650","PROVIDER_ID":"3","ART_NAME":"Twenty One Pilots","ARTISTS":[{"ART_ID":"647650","ROLE_ID":"0","ARTISTS_SONGS_ORDER":"1","ART_NAME":"Twenty One Pilots","ART_PICTURE":"259dcf52853363d79753ec301377645d","SMARTRADIO":"1","RANK":"487762","LOCALES":[],"__TYPE__":"artist"}],"ALB_ID":"13371165","ALB_TITLE":"Heathens","TYPE":0,"MD5_ORIGIN":"5cea723b83af1ff0a62d65d334b978d4","VIDEO":false,"DURATION":"195","ALB_PICTURE":"3dfc8c9e406cf1bba8ce0695a44a9b7e","ART_PICTURE":"259dcf52853363d79753ec301377645d","RANK_SNG":"967143","SMARTRADIO":"1","FILESIZE_AAC_64":0,"FILESIZE_MP3_64":"0","FILESIZE_MP3_128":"3135946","FILESIZE_MP3_256":0,"FILESIZE_MP3_320":"7839868","FILESIZE_FLAC":"21777150","FILESIZE":"3135946","GAIN":"-12","MEDIA_VERSION":"4","DISK_NUMBER":"1","TRACK_NUMBER":"1","VERSION":"","EXPLICIT_LYRICS":"0","RIGHTS":{"STREAM_ADS_AVAILABLE":true,"STREAM_ADS":"2000-01-01","STREAM_SUB_AVAILABLE":true,"STREAM_SUB":"2000-01-01"},"ISRC":"USAT21601930","DATE_ADD":1497886149,"HIERARCHICAL_TITLE":"","SNG_CONTRIBUTORS":{"mainartist":["Twenty One Pilots"],"engineer":["Adam Hawkins"],"mixer":["Adam Hawkins"],"masterer":["Chris Gehringer"],"drums":["Josh Dun"],"producer":["Mike Elizondo","Tyler Joseph"],"programmer":["Mike Elizondo","Tyler Joseph"],"vocals":["Tyler Joseph"],"writer":["Tyler Joseph"]},"LYRICS_ID":30553991,"__TYPE__":"song"},{"SNG_ID":"99976952","PRODUCT_TRACK_ID":"171067651","UPLOAD_ID":0,"SNG_TITLE":"Stressed Out","ART_ID":"647650","PROVIDER_ID":"3","ART_NAME":"Twenty One Pilots","ARTISTS":[{"ART_ID":"647650","ROLE_ID":"0","ARTISTS_SONGS_ORDER":"1","ART_NAME":"Twenty One Pilots", ...</script>

代码的想法是打印出用户名、可以在给定页面上找到的所有歌曲和艺术家姓名。

import requests
from bs4 import BeautifulSoup

base_url = 'https://www.deezer.com/en/profile/1589856782/loved'

r = requests.get(base_url)

soup = BeautifulSoup(r.text, 'html.parser')

user_name = soup.find(class_='user-name')
print(user_name.text)

这会打印用户名。

for script in soup.find_all('script'):
    print(script.contents) 

如果我理解正确的话,我需要的脚本是一个字典,所以我只需要找到它并获取它的内容。问题是我不知道如何专门找到正是这个“脚本”。它没有任何属性或任何使其独特的东西。所以我尝试了一个循环来查找页面上的所有脚本并打印出它们的内容,但不知道如何进一步进行。

如何在页面上只找到这个特定的“脚本”?我可以通过其他方式访问这些值吗?

【问题讨论】:

  • 您是否要提取其中包含“window.__DZR_APP_STATE__”的脚本元素?
  • 计算代码中的脚本 - 它们不会改变位置 - 并使用索引来获得正确的 - 即。第三个脚本soup.find_all('script')[2]
  • 顺便说一句:脚本是普通字符串,因此您可以使用标准字符串函数来检查它 - 即if "loved" in script.contents:

标签: python beautifulsoup deezer


【解决方案1】:

脚本不会更改代码中的位置,因此您可以计算它们并使用索引来获取正确的脚本。

all_scripts[6]

脚本是普通字符串,因此您也可以使用标准字符串函数,即。

if '{"loved"' in script.text:

两种方法的代码 - 我使用[:100] 只显示字符串的一部分。

import requests
from bs4 import BeautifulSoup

base_url = 'https://www.deezer.com/en/profile/1589856782/loved'

r = requests.get(base_url)

soup = BeautifulSoup(r.text, 'html.parser')

all_scripts = soup.find_all('script')

print('--- first method ---')
print(all_scripts[6].text[:100])

print('--- second method ---')
for number, script in enumerate(all_scripts):
    if '{"loved"' in script.text:
        print(number, script.text[:100])

结果:

--- first method ---
window.__DZR_APP_STATE__ = {"TAB":{"loved":{"data":[{"SNG_ID":"126884459","PRODUCT_TRACK_ID":"360276
--- second method ---
6 window.__DZR_APP_STATE__ = {"TAB":{"loved":{"data":[{"SNG_ID":"126884459","PRODUCT_TRACK_ID":"360276

编辑:当您有正确的脚本时,您可以使用切片仅获取JSON 字符串并使用模块json 将其转换为python 字典,然后您可以获取数据

import requests
from bs4 import BeautifulSoup
import json

base_url = 'https://www.deezer.com/en/profile/1589856782/loved'

r = requests.get(base_url)

soup = BeautifulSoup(r.text, 'html.parser')

all_scripts = soup.find_all('script')

data = json.loads(all_scripts[6].get_text()[27:])

print('key:', data.keys())
print('key:', data['TAB'].keys())
print('key:', data['DATA'].keys())
print('---')

for item in data['TAB']['loved']['data']:
    print('ART_NAME:', item['ART_NAME'])
    print('SNG_TITLE:', item['SNG_TITLE'])
    print('---')

结果:

key: dict_keys(['TAB', 'DATA'])
key: dict_keys(['loved'])
key: dict_keys(['USER', 'FOLLOW', 'FOLLOWING', 'HAS_BLOCKED', 'IS_BLOCKED', 'IS_PUBLIC', 'CURATOR', 'IS_PERSONNAL', 'NB_FOLLOWER', 'NB_FOLLOWING'])
---
ART_NAME: Twenty One Pilots
SNG_TITLE: Heathens
---
ART_NAME: Twenty One Pilots
SNG_TITLE: Stressed Out
---
ART_NAME: Linkin Park
SNG_TITLE: Numb
---
ART_NAME: Three Days Grace
SNG_TITLE: Animal I Have Become
---
ART_NAME: Three Days Grace
SNG_TITLE: Painkiller
---
ART_NAME: Slipknot
SNG_TITLE: Before I Forget
---
ART_NAME: Slipknot
SNG_TITLE: Duality
---
ART_NAME: Skrillex
SNG_TITLE: Make It Bun Dem
---
ART_NAME: Skrillex
SNG_TITLE: Bangarang (feat. Sirah)
---
ART_NAME: Limp Bizkit
SNG_TITLE: Break Stuff
---
ART_NAME: Three Days Grace
SNG_TITLE: I Hate Everything About You
---
ART_NAME: Three Days Grace
SNG_TITLE: Time of Dying
---
ART_NAME: Three Days Grace
SNG_TITLE: I Am Machine
---
ART_NAME: Three Days Grace
SNG_TITLE: Riot
---
ART_NAME: Three Days Grace
SNG_TITLE: So What
---
ART_NAME: Three Days Grace
SNG_TITLE: Pain
---
ART_NAME: Three Days Grace
SNG_TITLE: Tell Me Why
---
ART_NAME: Three Days Grace
SNG_TITLE: Chalk Outline
---
ART_NAME: Three Days Grace
SNG_TITLE: Gone Forever
---
ART_NAME: Slipknot
SNG_TITLE: The Devil In I
---
ART_NAME: Linkin Park
SNG_TITLE: No More Sorrow
---
ART_NAME: Linkin Park
SNG_TITLE: Bleed It Out
---
ART_NAME: The Doors
SNG_TITLE: Roadhouse Blues
---
ART_NAME: The Doors
SNG_TITLE: Riders On The Storm
---
ART_NAME: The Doors
SNG_TITLE: Break On Through (To The Other Side)
---
ART_NAME: The Doors
SNG_TITLE: Alabama Song (Whisky Bar)
---
ART_NAME: The Doors
SNG_TITLE: People Are Strange
---
ART_NAME: My Chemical Romance
SNG_TITLE: Welcome to the Black Parade
---
ART_NAME: My Chemical Romance
SNG_TITLE: Teenagers
---
ART_NAME: My Chemical Romance
SNG_TITLE: Na Na Na [Na Na Na Na Na Na Na Na Na]
---
ART_NAME: My Chemical Romance
SNG_TITLE: Famous Last Words
---
ART_NAME: The Doors
SNG_TITLE: Soul Kitchen
---
ART_NAME: The Black Keys
SNG_TITLE: Lonely Boy
---
ART_NAME: Katy Perry
SNG_TITLE: I Kissed a Girl
---
ART_NAME: Katy Perry
SNG_TITLE: Hot N Cold
---
ART_NAME: Katy Perry
SNG_TITLE: E.T.
---
ART_NAME: Linkin Park
SNG_TITLE: Given Up
---
ART_NAME: My Chemical Romance
SNG_TITLE: Dead!
---
ART_NAME: My Chemical Romance
SNG_TITLE: Mama
---
ART_NAME: My Chemical Romance
SNG_TITLE: The Sharpest Lives
---

【讨论】:

    【解决方案2】:

    如果我的理解是正确的,您只需要其中包含“SNG_TITLE”的脚本元素。

    您可以使用re 并仅获取包含您感兴趣的字段的脚本元素,如下所示:

    import requests
    from bs4 import BeautifulSoup
    import re
    
    base_url = 'https://www.deezer.com/en/profile/1589856782/loved'
    
    r = requests.get(base_url)
    
    soup = BeautifulSoup(r.text, 'html.parser')
    
    user_name = soup.find(class_='user-name')
    print(user_name.text)
    
    for script in soup(text=re.compile(r'SNG_TITLE' )):
        print(script.parent)
    

    编辑:

    @furas 答案是使用json 查找“SNG_TITLE”和“ART_TITLE”的完整解决方案。我的回答可帮助您仅找到带有“SNG_TITLE”的脚本。您可以将两者结合起来以获得更好的代码。

    【讨论】:

      猜你喜欢
      • 2023-03-03
      • 2019-07-26
      • 1970-01-01
      • 2020-12-07
      • 1970-01-01
      • 1970-01-01
      • 2020-06-18
      • 2020-02-19
      • 2014-05-22
      相关资源
      最近更新 更多