【问题标题】:Webscraping an IMDb page using BeautifulSoup使用 BeautifulSoup 抓取 IMDb 页面
【发布时间】:2015-05-08 20:07:08
【问题描述】:

我是 WebScraping/Python 和 BeautifulSoup 的新手,很难让我的代码正常工作。

我想抓取 url:http://m.imdb.com/feature/bornondate" 来获取:

  • 名人姓名
  • 名人形象
  • 职业
  • 最佳作品

该页面上的十位名人。我不确定我做错了什么。

这是我的代码:

import urllib2
from bs4 import BeautifulSoup

url = 'http://m.imdb.com/feature/bornondate'

test_url = urllib2.urlopen(url)
readHtml = test_url.read()
test_url.close()

soup = BeautifulSoup(readHtml)
# Using it track the number of Actor
count = 0
# Fetching the value present within tag results
person = soup.findChildren('section', 'posters list')
# Changing the person into an iterator
iterperson = iter(person[0].findChildren('a'))

# Finding 'a' in iterperson. Every 'a' tag contains information of a person
for a in iterperson:
    imgSource = a.find('img')['src'].split('._V1.')[0] + '._V1_SX214_AL_.jpg'
    person = a.findChildren('div', 'label')
    title = person[0].find('span', 'title').contents[0]
    ##profession = person[0].find('div', 'detail').contents[0].split(,)
    ##bestWork = person[0].find('div', 'detail').contents[1].split(,)

    print '*******************************IMDB People Born Today***********************************'
    # Printing the S.No of the person
    print 'S.No. --> ',
    count += 1
    print count
    # Printing the title/name of the person
    print 'Title --> ' + title
    # Printing the Image Source of the person
    print 'Image Source --> ', imgSource
    # Printing the Profession of the person
    ##print 'Profession --> ', profession
    # Printing the Best work of the person
    ##print 'Best Work --> ', bestWork

目前没有任何东西被打印出来。 另外,如果这含糊不清,您能否解释一下如何只做名人的名字?

如果有帮助,这里是第一个名人的 html 代码:

<section class="posters list">
<h1>March 7</h1>

    <a href="/name/nm0186505/" class="poster "><img src="http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1._CR0,0,1369,2019_SX40_SY59.jpg" style="background:url('http://i.media-imdb.com/images/mobile/people-40x59-fade.png')" width="40" height="59"><div class="label"><span class="title">Bryan Cranston</span><div class="detail">Actor, "Ozymandias"</div></div></a>

【问题讨论】:

    标签: python html web-scraping beautifulsoup html-parsing


    【解决方案1】:

    首先,IMDb "Conditions of Use" 明确禁止屏幕抓取:

    机器人和屏幕抓取:您不得使用数据挖掘、机器人、 屏幕抓取或类似的数据收集和提取工具 本网站,除非得到我们如下所述的明确书面同意。

    尝试探索 IMDb JSON API 而不是网络抓取方法。


    您当前的问题是 - 通过IMDb API 的单独调用加载特定日期出生的人的列表 并涉及javascript 逻辑

    现在最简单的选择是切换到selenium 浏览器自动化工具。使用 headless PhantomJS 浏览器的工作示例

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    driver = webdriver.PhantomJS()
    driver.get("http://m.imdb.com/feature/bornondate")
    
    # waiting for posters to load
    wait = WebDriverWait(driver, 10)
    posters = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "section.posters")))
    
    # extracting the data poster by poster
    for a in posters.find_elements_by_css_selector('a.poster'):
        img = a.find_element_by_tag_name('img').get_attribute('src').split('._V1.')[0] + '._V1_SX214_AL_.jpg'
    
        person = a.find_element_by_css_selector('div.detail').text
        title = a.find_element_by_css_selector('span.title').text
    
        print img, person, title
    

    打印:

    http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1_SX214_AL_.jpg Actor, "Ozymandias" Bryan Cranston
    http://ia.media-imdb.com/images/M/MV5BNjUxNjcxMjE4N15BMl5BanBnXkFtZTgwNDk4NjA2MzE@._V1_SX214_AL_.jpg Actress, "Karla" Laura Prepon
    http://ia.media-imdb.com/images/M/MV5BMTQ4MzM1MDAwMV5BMl5BanBnXkFtZTcwNTU4NzQwMw@@._V1_SX214_AL_.jpg Actress, "The Mummy" Rachel Weisz
    http://ia.media-imdb.com/images/M/MV5BMjE0Mjg0NzE2Nl5BMl5BanBnXkFtZTcwMDE1MTkxMw@@._V1_SX214_AL_.jpg Actor, "Jarhead" Peter Sarsgaard
    http://ia.media-imdb.com/images/M/MV5BMTMyOTYzODQ5MF5BMl5BanBnXkFtZTcwMjE3MDgzMQ@@._V1_SX214_AL_.jpg Actress, "Blades of Glory" Jenna Fischer
    http://ia.media-imdb.com/images/M/MV5BMzE2OTAwNzM0Ml5BMl5BanBnXkFtZTcwNzE1MDg0Mw@@._V1_SX214_AL_.jpg Actress, "Tangled" Donna Murphy
    http://ia.media-imdb.com/images/M/MV5BMTI0OTMzMzE0N15BMl5BanBnXkFtZTcwMjI1MzYyMQ@@._V1_SX214_AL_.jpg Actor, "How the Grinch Stole Christmas" T.J. Thyne
    http://ia.media-imdb.com/images/M/MV5BNzczODkyNzY4OV5BMl5BanBnXkFtZTcwNTU0NjQzMQ@@._V1_SX214_AL_.jpg Actor, "Home Alone" John Heard
    http://ia.media-imdb.com/images/M/MV5BMTg4MjU2MzA2OV5BMl5BanBnXkFtZTgwOTIxMjc4MjE@._V1_SX214_AL_.jpg Actress, "Beerfest" Audrey Marie Anderson
    http://ia.media-imdb.com/images/M/MV5BMTQyOTc5NzA0M15BMl5BanBnXkFtZTYwODQ2MjYz._V1_SX214_AL_.jpg Producer, "Kick-Ass" Matthew Vaughn
    

    【讨论】:

    • 谢谢你,在beautifulsoup中不能做吗?
    • @PatrickLee 是的,如果你愿意,你可以将 page_source 从selenium 传递给BeautifulSoup:一旦海报加载完成:soup = BeautifulSoup(driver.page_source)
    • 我会确保尝试 selenium,但不幸的是,对于这个问题,我需要单独使用 BeautifulSoup,因为它是我应该使用的工具:(
    • @PatrickLee 好吧,在这种情况下,您仍然可以使用 BeautifulSoup 进行 HTML 解析部分,但 selenium 是获取 HTML 源代码的最安全和最简单的方法,就像您在浏览器中看到的那样.
    • 我可以使用 BeautifulSoup 和类似的代码从这个页面 imdb.com/search/… 抓取电影。
    【解决方案2】:

    我正在做同样的任务。 URLlib 库加载 Web URL 的静态内容。使用 selenium 获取包含动态内容的完整 html。如果你使用 urllib2 库,生成的 html 将是

    <span class="loading"></span>
    

    希望对你有帮助。

    【讨论】:

      猜你喜欢
      • 2022-10-21
      • 2013-01-29
      • 2017-06-03
      • 1970-01-01
      • 2014-03-25
      • 1970-01-01
      • 1970-01-01
      • 2015-05-06
      • 1970-01-01
      相关资源
      最近更新 更多