使用 BeautifulSoup 抓取 IMDb 页面答案

【问题标题】：Webscraping an IMDb page using BeautifulSoup使用 BeautifulSoup 抓取 IMDb 页面
【发布时间】：2015-05-08 20:07:08
【问题描述】：

我是 WebScraping/Python 和 BeautifulSoup 的新手，很难让我的代码正常工作。

我想抓取 url：http://m.imdb.com/feature/bornondate" 来获取：

名人姓名
名人形象
职业
最佳作品

该页面上的十位名人。我不确定我做错了什么。

这是我的代码：

import urllib2
from bs4 import BeautifulSoup

url = 'http://m.imdb.com/feature/bornondate'

test_url = urllib2.urlopen(url)
readHtml = test_url.read()
test_url.close()

soup = BeautifulSoup(readHtml)
# Using it track the number of Actor
count = 0
# Fetching the value present within tag results
person = soup.findChildren('section', 'posters list')
# Changing the person into an iterator
iterperson = iter(person[0].findChildren('a'))

# Finding 'a' in iterperson. Every 'a' tag contains information of a person
for a in iterperson:
    imgSource = a.find('img')['src'].split('._V1.')[0] + '._V1_SX214_AL_.jpg'
    person = a.findChildren('div', 'label')
    title = person[0].find('span', 'title').contents[0]
    ##profession = person[0].find('div', 'detail').contents[0].split(,)
    ##bestWork = person[0].find('div', 'detail').contents[1].split(,)

    print '*******************************IMDB People Born Today***********************************'
    # Printing the S.No of the person
    print 'S.No. --> ',
    count += 1
    print count
    # Printing the title/name of the person
    print 'Title --> ' + title
    # Printing the Image Source of the person
    print 'Image Source --> ', imgSource
    # Printing the Profession of the person
    ##print 'Profession --> ', profession
    # Printing the Best work of the person
    ##print 'Best Work --> ', bestWork

目前没有任何东西被打印出来。另外，如果这含糊不清，您能否解释一下如何只做名人的名字？

如果有帮助，这里是第一个名人的 html 代码：

<section class="posters list">
<h1>March 7</h1>

    <a href="/name/nm0186505/" class="poster "><img src="http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1._CR0,0,1369,2019_SX40_SY59.jpg" style="background:url('http://i.media-imdb.com/images/mobile/people-40x59-fade.png')" width="40" height="59"><div class="label"><span class="title">Bryan Cranston</span><div class="detail">Actor, "Ozymandias"</div></div></a>

【问题讨论】：

标签： python html web-scraping beautifulsoup html-parsing

【解决方案1】：

首先，IMDb "Conditions of Use" 明确禁止屏幕抓取：

机器人和屏幕抓取：您不得使用数据挖掘、机器人、屏幕抓取或类似的数据收集和提取工具本网站，除非得到我们如下所述的明确书面同意。

尝试探索 IMDb JSON API 而不是网络抓取方法。

您当前的问题是 - 通过对IMDb API 的单独调用加载特定日期出生的人的列表并涉及javascript 逻辑。

现在最简单的选择是切换到selenium 浏览器自动化工具。使用 headless PhantomJS 浏览器的工作示例：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.PhantomJS()
driver.get("http://m.imdb.com/feature/bornondate")

# waiting for posters to load
wait = WebDriverWait(driver, 10)
posters = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "section.posters")))

# extracting the data poster by poster
for a in posters.find_elements_by_css_selector('a.poster'):
    img = a.find_element_by_tag_name('img').get_attribute('src').split('._V1.')[0] + '._V1_SX214_AL_.jpg'

    person = a.find_element_by_css_selector('div.detail').text
    title = a.find_element_by_css_selector('span.title').text

    print img, person, title

打印：

http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1_SX214_AL_.jpg Actor, "Ozymandias" Bryan Cranston
http://ia.media-imdb.com/images/M/MV5BNjUxNjcxMjE4N15BMl5BanBnXkFtZTgwNDk4NjA2MzE@._V1_SX214_AL_.jpg Actress, "Karla" Laura Prepon
http://ia.media-imdb.com/images/M/MV5BMTQ4MzM1MDAwMV5BMl5BanBnXkFtZTcwNTU4NzQwMw@@._V1_SX214_AL_.jpg Actress, "The Mummy" Rachel Weisz
http://ia.media-imdb.com/images/M/MV5BMjE0Mjg0NzE2Nl5BMl5BanBnXkFtZTcwMDE1MTkxMw@@._V1_SX214_AL_.jpg Actor, "Jarhead" Peter Sarsgaard
http://ia.media-imdb.com/images/M/MV5BMTMyOTYzODQ5MF5BMl5BanBnXkFtZTcwMjE3MDgzMQ@@._V1_SX214_AL_.jpg Actress, "Blades of Glory" Jenna Fischer
http://ia.media-imdb.com/images/M/MV5BMzE2OTAwNzM0Ml5BMl5BanBnXkFtZTcwNzE1MDg0Mw@@._V1_SX214_AL_.jpg Actress, "Tangled" Donna Murphy
http://ia.media-imdb.com/images/M/MV5BMTI0OTMzMzE0N15BMl5BanBnXkFtZTcwMjI1MzYyMQ@@._V1_SX214_AL_.jpg Actor, "How the Grinch Stole Christmas" T.J. Thyne
http://ia.media-imdb.com/images/M/MV5BNzczODkyNzY4OV5BMl5BanBnXkFtZTcwNTU0NjQzMQ@@._V1_SX214_AL_.jpg Actor, "Home Alone" John Heard
http://ia.media-imdb.com/images/M/MV5BMTg4MjU2MzA2OV5BMl5BanBnXkFtZTgwOTIxMjc4MjE@._V1_SX214_AL_.jpg Actress, "Beerfest" Audrey Marie Anderson
http://ia.media-imdb.com/images/M/MV5BMTQyOTc5NzA0M15BMl5BanBnXkFtZTYwODQ2MjYz._V1_SX214_AL_.jpg Producer, "Kick-Ass" Matthew Vaughn

【讨论】：

谢谢你，在beautifulsoup中不能做吗？
@PatrickLee 是的，如果你愿意，你可以将 page_source 从selenium 传递给BeautifulSoup：一旦海报加载完成：soup = BeautifulSoup(driver.page_source)。
我会确保尝试 selenium，但不幸的是，对于这个问题，我需要单独使用 BeautifulSoup，因为它是我应该使用的工具:(
@PatrickLee 好吧，在这种情况下，您仍然可以使用 BeautifulSoup 进行 HTML 解析部分，但 selenium 是获取 HTML 源代码的最安全和最简单的方法，就像您在浏览器中看到的那样.
我可以使用 BeautifulSoup 和类似的代码从这个页面 imdb.com/search/… 抓取电影。

【解决方案2】：

我正在做同样的任务。 URLlib 库加载 Web URL 的静态内容。使用 selenium 获取包含动态内容的完整 html。如果你使用 urllib2 库，生成的 html 将是

<span class="loading"></span>

希望对你有帮助。

【讨论】：