如何使用 BeautifulSoup 在 IMDB 网站上抓取电影的“描述”？答案

【问题标题】：How do I scrape "description" of movies in the IMDB website using BeautifulSoup?如何使用 BeautifulSoup 在 IMDB 网站上抓取电影的“描述”？
【发布时间】：2020-08-29 13:55:38
【问题描述】：

我正在使用 BeautifulSoup 来抓取 IMDB 网站中的电影。我能够成功地抓取电影的名称、类型、持续时间、评级。但是我无法像在看课程时那样抓取电影的描述，它是“文本静音”的，并且由于该课程多次保存其他数据，例如评级、流派、持续时间。但是由于这些数据也有内部类，所以我更容易刮掉它，但是当它来描述时，它没有任何内部类。因此，当仅使用“文本静音”提取数据时，也会提供其他数据。如何获取电影的描述？

附上代码和截图供参考：

我用来抓取流派的示例代码如下：

genre_tags=data.select(".text-muted .genre")
genre=[g.get_text() for g in genre_tags]
Genre = [item.strip() for item in genre if str(genre)]
print(Genre)

【问题讨论】：

标签： python web-scraping beautifulsoup

【解决方案1】：

总的来说，lxml 比 beautifulsoup 好很多。

import requests 
from lxml 
import html

url = "xxxx"

r = requests.get(url)

tree = html.fromstring(r.text)

rows = tree.xpath('//div[@class="lister-item mode-detail"]')

for row in rows:
    description = row.xpath('.//div[@class="ratings-bar"]/following-sibling::p[@class="text-muted"]/text()')[0].strip()

【讨论】：

【解决方案2】：

你可以使用这个，:)，如果对你有帮助，请 UP 我的解决方案，谢谢，

from bs4 import BeautifulSoup
from requests_html import HTMLSession

URL = 'https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm' #url of Most Popular Movies in IMDB

PAGE = HTMLSession().get(URL)
PAGE_BS4 = BeautifulSoup(PAGE.html.html,'html.parser')

MoviesObj = PAGE_BS4.find_all("tbody","lister-list") #get table body of Most Popular Movies
for index in range(len(MoviesObj[0].find_all("td","titleColumn"))):
    a = list(MoviesObj[0].find_all("td","titleColumn")[index])[1]
    href = 'https://www.imdb.com'+a.get('href') #get each link for movie page
    moviepage = HTMLSession().get(href) #request each page of movie
    moviepage = BeautifulSoup(moviepage.html.html,'html.parser')
    title = list(moviepage.find_all('h1')[0].stripped_strings)[0] #parse title
    year = list(moviepage.find_all('h1')[0].stripped_strings)[2] #parse year
    try:
        score = list(moviepage.find_all('div','ratingValue')[0].stripped_strings)[0] #parse score if is available
    except IndexError:
        score = '-' #if score is not available '-' is filled
    description = list(moviepage.find_all('div','summary_text')[0].stripped_strings)[0] #parse description
    print(f'TITLE: {title}      YEAR: {year}       SCORE: {score}\nDESCRIPTION:{description}\n')

小萨尔达尼亚 @UmSaldanha

【讨论】：