【问题标题】:BeautifulSoup selector returning empty listBeautifulSoup 选择器返回空列表
【发布时间】:2021-04-01 19:25:48
【问题描述】:

首先,我是网络抓取的新手,所以如果我的行话不正确,我深表歉意。我正在尝试从这个 IMDB 1000 强电影网站将四个项目(电影标题、运行时间、类型和年份)放入 Pandas DF。我正在学习一个教程 (https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/),该教程首先分解了该过程,以便您从 HTML 元素列表(在本例中为一部电影)中提取单个元素并获取所需的属性(电影标题、运行时间、流派和年份)使用 HTML 标签。但是,当我尝试继续学习本教程并使用选择器从所有电影中获取主标签下的所有元素时,我最终得到了一个空列表。

所以这是该过程的第一部分(从 HTML 元素列表中提取单个元素并获取该元素(电影)所需的属性:

# Let's get the html from https://www.imdb.com/search/title/?groups=top_1000&sort=runtime,asc. 
# We’ll need to first download it using the requests.get method.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import matplotlib as mpl

page = requests.get("https://www.imdb.com/search/title/?groups=top_1000&sort=runtime,asc")

# create an instance of the BeautifulSoup class to parse our document
soup = BeautifulSoup(page.content, 'html.parser')

top_1000 = soup.find(id = "main") # Find outermost element containing all relevant movie info 
film_items = top_1000.find_all(class_='lister-item mode-advanced') # Get the element containing the list of films
first_film = film_items[0]     # Get first film in list
print(first_film.prettify())

tags = first_film.find_all('a') # Get all <a tags
title = tags[1].text # Title is embedded in the second item in this list
genre = first_film.find(class_='genre').get_text() 
year = first_film.find(class_="lister-item-year text-muted unbold").get_text() 
runtime = first_film.find(class_="runtime").get_text() 

print(title)
print(genre)
print(year)
print(runtime)

输出:

小夏洛克

动作、喜剧、爱情 (1924)

45 分钟

但是...当我使用选择器获取所有电影的数据时,它返回一个空列表

# Select all items with the class genre inside an item with the class lister-item mode-advanced in top_1000.
# Use a list comprehension to call the get_text method on each BeautifulSoup object.
genre = top_1000.select(".lister-item mode-advanced .genre")
genres = [g.get_text() for g in genre]
print(genres)

输出:

[]

我想也许在调用选择器时我必须包含每个嵌套元素,但我尝试调用嵌套在“lister-item mode-advanced”下方的元素,它也返回了一个空列表。事实上,当我在选择器中只包含“lister-item mode-advanced”时,我得到了一个空白列表。我逐字阅读教程,但这似乎不起作用。对于这方面的任何帮助,我将不胜感激,对于任何语言差异,我再次表示歉意——我是使用 HTML 的新手。

【问题讨论】:

    标签: python html beautifulsoup


    【解决方案1】:

    类名不能有空格。当您的类属性中有一个空格分隔的字符串时 - 这意味着它有多个类。

    所以一个像这样的元素:

    <div class="lister-item mode-advanced"></div>
    

    将有 lister-itemmode-advanced 类。

    您可以使用点语法引用多个类,例如.lister-item.mode-advanced

    试试这个:

    genre = top_1000.select(".lister-item.mode-advanced .genre")
    genres = [g.get_text().strip() for g in genre]
    print(genres)
    
    ['Action, Comedy, Romance', 'Drama, Horror', 'Action, Adventure, Comedy', 'Comedy, Drama, Family', 'Comedy, Musical, War', 'Drama, Horror, Sci-Fi', 'Horror, Sci-Fi', 'Animation, Adventure, Family', 'Comedy, Romance', 'Animation, Sci-Fi', 'Drama, History, Thriller', 'Drama, Horror, Sci-Fi', 'Fantasy, Horror, Mystery', 'Animation, Action, Crime', 'Animation, Family, Fantasy', 'Animation, Adventure, Family', 'Comedy', 'Crime, Drama, Mystery', 'Drama, Horror, Sci-Fi', 'Drama', 'Comedy', 'Animation, Comedy, Drama', 'Drama, Romance', 'Animation, Adventure, Comedy', 'Crime, Drama, Thriller', 'Animation, Crime, Mystery', 'Animation, Comedy, Fantasy', 'Comedy, Music', 'Comedy, Fantasy, Romance', 'Animation, Family, Fantasy', 'Animation, Action, Crime', 'Crime, Drama, Film-Noir', 'Comedy, Horror', 'Animation, Family, Fantasy', 'Drama, Thriller, Western', 'Drama, Thriller', 'Comedy, War', 'Comedy, Crime', 'Comedy, Drama, Family', 'Crime, Mystery, Thriller', 'Drama, Romance', 'Animation, Family, Fantasy', 'Animation, Action, Adventure', 'Drama, Music, Romance', 'Comedy, Horror', 'Comedy, Drama, Romance', 'Comedy, Drama, Family', 'Comedy, Musical', 'Action, Crime, Comedy', 'Animation, Action, Drama']
    

    注意:我还添加了 .strip() 方法,该方法消除了任何前导和尾随空格,因为您的原始“流派”数据如下所示:

    '\nAction, Comedy, Romance            '
    

    【讨论】:

    • 谢谢!这非常有帮助。如何从带有 href 的 HTML 代码中获取标题(例如 Sherlock Jr.)? --> 小夏洛克
    • @j.c.hayes82 这是一个单行:print([s.find('a').get_text() for s in top_1000.select(".lister-item-header")])
    猜你喜欢
    • 1970-01-01
    • 2020-02-16
    • 1970-01-01
    • 2019-12-12
    • 2019-03-25
    • 1970-01-01
    • 1970-01-01
    • 2019-11-29
    • 1970-01-01
    相关资源
    最近更新 更多