【问题标题】:Extrack JSON data in python Beatifulsoup在 python Beautifulsoup 中提取 JSON 数据
【发布时间】:2020-02-12 18:43:23
【问题描述】:

我正在尝试使用 Beautiful soup Framework 以 Python 中的 JSON 格式从特定网站 (https://paytm.com/movies/coimbatore/) 中抓取内容。但我不知道如何获取特定数据

import requests 
from bs4 import BeautifulSoup
import json


 URL = "https://paytm.com/movies/coimbatore/"
 r = requests.get(URL)   
 soup = BeautifulSoup(r.content, 'html.parser')

 movies_showing_now_div = soup.find_all('div', attrs = {'class':'_1ZMxg'})
 movies_showing_now_div = soup.findAll('script',attrs={"type":"application/ld+json"})
 for title in movies_showing_now_div:     
  print(title.text)

【问题讨论】:

    标签: python-3.x web-scraping beautifulsoup


    【解决方案1】:

    首先你可以只取body标签内的所有script,然后你可以使用json.loads()来访问你想要的数据。

    你可以试试这个:

    代码:

    import requests 
    from bs4 import BeautifulSoup
    import json
    URL = "https://paytm.com/movies/coimbatore/"
    r = requests.get(URL)
    
    soup = BeautifulSoup(r.content, 'html.parser')
    
    movies_showing_now_div = soup.find('body').find_all('script',attrs={"type":"application/ld+json"})
    
    movies = []
    
    for script in movies_showing_now_div:
        jsonscript = json.loads(script.text)
        if jsonscript['@type'] and jsonscript['@type'] == 'Movie':
            movie = {
                'title': jsonscript['name'],
                'genre': jsonscript['genre']
            }
            movies.append(movie)
    
    print(movies)
    

    结果:

    [{'genre': 'drama', 'title': ' Vaanam Kottatum'},
     {'genre': 'drama', 'title': 'Seeru'},
     {'genre': 'drama, thriller', 'title': 'Psycho'},
     {'genre': 'action, adventure, crime', 'title': 'Birds of Prey'},
     {'genre': 'horror, romance', 'title': 'Malang'},
     {'genre': 'action, drama', 'title': 'Darbar'},
     {'genre': 'drama', 'title': '1917'},
     {'genre': 'drama, comedy', 'title': 'Naadodigal 2'},
     {'genre': 'drama, historical, romantic', 'title': 'Shikara'},
     {'genre': 'drama', 'title': 'Jaanu'},
     {'genre': 'drama', 'title': 'Ala Vaikunthapurramuloo'},
     {'genre': 'drama', 'title': 'Little Women'},
     {'genre': 'action, drama', 'title': 'Pattas'},
     {'genre': 'thriller, crime, mystery', 'title': 'Anjaam Pathiraa'},
     {'genre': 'action, thriller, crime', 'title': 'Bad Boys For Life'},
     {'genre': 'drama', 'title': 'Anveshanam'},
     {'genre': 'drama', 'title': 'Dagaalty'},
     {'genre': 'horror, comedy', 'title': 'Sandimuni '},
     {'genre': 'action, thriller, crime', 'title': 'Bad Boys For Life'}]
    

    【讨论】:

    • 谢谢,然后我需要你的更多帮助..我想将这些数据插入到mysql表中..在scipt下面..
    • 对于 movies_showing_now_div 中的脚本: jsonscript = json.loads(script.text) if jsonscript['@type'] and jsonscript['@type'] == 'Movie': cursor.execute("插入 tbl_movies (movie_name) 值 (%s)",(movie_name))
    【解决方案2】:

    试试这个

    import requests
    from bs4 import BeautifulSoup
    import json
    
    URL = "https://paytm.com/movies/coimbatore/"
    movies = []
    
    r = requests.get(URL)
    
    soup = BeautifulSoup(r.content, 'lxml')  # lxml is faster than html.parser
    
    movies_showing_now_div = soup.find('body').findAll('script', attrs={"type": "application/ld+json"})
    
    for div in movies_showing_now_div:
        movie_dict = {}
        data = json.loads(div.text)
        if data["@type"] == "Movie":
            movie_dict["movie_name"] = data["name"]
            movie_dict["genre"] = data["genre"]
            movies.append(movie_dict)
    
    print(movies)
    

    【讨论】:

      【解决方案3】:

      我使用 title.text 来提取脚本标签中的文本。
      该文本是 json 数据,所以我只是使用 json.loads 将其转换为字典,然后提取您需要的特征并将它们放在列表中以备将来使用。

      鉴于要求(提取 namegenreimage)这是我的代码:

      import requests
      from bs4 import BeautifulSoup
      import json
      
      URL = "https://paytm.com/movies/coimbatore/"
      r = requests.get(URL)
      soup = BeautifulSoup(r.content, 'html.parser')
      
      movies = []
      movies_showing_now_div = soup.find_all('div', attrs={'class': '_1ZMxg'})
      movies_showing_now_div = soup.findAll('script', attrs={"type": "application/ld+json"})
      for title in movies_showing_now_div:
          json_data = json.loads(title.text)
          if '@type' in json_data:
              if json_data['@type'] == "Movie":
                  movie = {"name": json_data["name"],
                           "genre": json_data["genre"],
                           "image": json_data["image"]
                           }
                  movies.append(movie)
      
      for movie in movies:
          print("Name:\t{}\nGenre:\t{}\nImage:\t{}\n".format(movie['name'], movie['genre'], movie['image']))
      

      示例输出

      Name:   Vaanam Kottatum
      Genre:  drama
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/Vaanam-Kottatum-Tamil-Web-poster-705x750-213b1eaf-2e77-4825-9ee8-ae117d354592.jpg
      
      Name:   Seeru
      Genre:  drama
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/seeru_web_705x750_psd-1028fe75-3147-4732-95f4-4e05e558bce5.jpg
      
      Name:   Naan Sirithal
      Genre:  drama
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/Naan-Sirithal-705x750-2042f291-c470-43db-a3ab-21400202a090.jpg
      
      Name:   Psycho
      Genre:  drama, thriller
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/psycho_web_705x750_psd-e1000de5-d47e-455c-a294-9309a725e30b.jpg
      
      Name:   Malang
      Genre:  horror, romance
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/malang-poster_web_705x750_psd-25688127-de49-4bed-94a1-a69bc69e00c4.jpg
      
      Name:   Darbar
      Genre:  action, drama
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/Darbar-tamil-Web-poster-705x750-7ace9f7d-1fe8-4506-b920-f1c72a4d552f.jpg
      
      Name:   World Famous Lover
      Genre:  drama, romance
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/World-Famous-Lover-Telugu-Web-poster-705x750-04e70194-d75a-4309-89a4-2d636c71a08b.jpg
      
      Name:   1917
      Genre:  drama
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/1917-Web-poster-705x750-35a92d72-89f8-4ee3-9da3-da1e35ebdef9.jpg
      
      Name:   Harley Quinn: Birds Of Prey
      Genre:  action, adventure, crime
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/_Birds-of-Prey-Web-poster-705x750-41265fe7-ea32-49ae-b3d8-72fbca2e7970.jpg
      
      Name:   Parasite
      Genre:  drama, thriller
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/Parasite-Korean-Web-poster-705x750-4ca2d1d5-3f94-4af0-9c41-ab564dc455d8.jpg
      
      Name:   Ayyappanum Koshiyum
      Genre:  action, comedy
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/Ayyappanum-Koshiyum-malyalam-Web-poster-705x750-7c94eb30-b197-4a1c-9dd3-e920dbb85592.jpg
      
      Name:   Varane Avashyamund
      Genre:  action, drama, family
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/Varane-Avashyamund-Malayalam-Web-poster-705x750-83264d5e-6fa8-40b6-8527-50eb44b4b8c8.jpg
      
      Name:   Naadodigal 2
      Genre:  drama, comedy
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/NAADODIGAL-2-Tamil-Web-poster-705x750-1f6c24ff-80bd-41b2-8b41-aae87070eff8.jpg
      
      Name:   Shikara
      Genre:  drama, historical, romantic
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/SHIKARA--705x750-3fbabda6-3093-493e-876c-47fc8100d9f4.jpg
      
      Name:   Jaanu
      Genre:  drama
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/Jaanu-Web-poster-705x750-0f3cc028-7ea1-4410-a5c1-dff19844b5c3.jpg
      
      Name:   Ala Vaikunthapurramuloo
      Genre:  drama
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/Ala-Vaikunthapuramulo-Web-poster-705x750-f916ecdc-d07b-4b0e-959c-1bbcf8e2cd40.jpg
      
      Name:   Anjaam Pathiraa
      Genre:  thriller, crime, mystery
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/aanjam-pathiraa_web_705x750_psd-4b4922dd-0877-4db1-ac7e-733bda22ccf9.jpg
      
      Name:   Pattas
      Genre:  action, drama
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/Pattas-Tamil-Web-poster-705x750-6c6b62fa-2590-44ea-918b-cce68a1ac5f0.jpg
      
      Name:   Anveshanam
      Genre:  drama
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/anveshanam_web_705x750_psd-495caeb6-2b1e-4001-8e37-6bf925ea075d.jpg
      
      Name:   Little Women
      Genre:  drama
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/little-women_web_705x750_jpg-77420f75-8e39-4db5-bf03-06298ec93c91.jpg
      
      Name:   Dagaalty
      Genre:  drama
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/Dagaalty-Tamil-Web-poster-705x750-0b4c383d-506c-478c-8979-0654a73bd357.jpg
      
      Name:   Sandimuni
      Genre:  horror, comedy
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/Sandimuni-Tamil-Web-poster-705x750-ac0e354d-e3a6-4f00-a5e6-88f7f2695b4e.jpg
      
      Name:   Bad Boys For Life
      Genre:  action, thriller, crime
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/Bad-Boys-for-Life-Web-poster-705x750-4bbd571f-ca3d-4667-b01d-f38c175511bb.jpg
      
      Name:   Bad Boys For Life
      Genre:  action, thriller, crime
      Image:  https://s3-ap-southeast-1.amazonaws.com/assets.paytm.com/images/cinema/Bad-Boys-for-Life-Web-poster-705x750-b828b281-afe3-4eb1-9f33-42ecddc70496.jpg
      

      【讨论】:

      • cursor.execute("INSERT INTO tbl_movies (movie_name,genre,image,language,duration) VALUES (%s,%s,%s,%s,%s)",(movie[' name'],movie['genre'],movie['image'],movie['inLanguage'],movie['duration']))
      • 我正在尝试将数据插入 mysql.. 但它不起作用.. 请帮助我
      • 嗨@Arunsankar!你得到的错误是什么?也许用这个打开一个新问题是个好主意......
      猜你喜欢
      • 2018-08-01
      • 1970-01-01
      • 2021-04-28
      • 2013-01-29
      • 2019-06-16
      • 1970-01-01
      • 2021-01-05
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多