【问题标题】:Web scraping using Beautiful Soup, scraping multiple elements without class使用 Beautiful Soup 进行网页抓取,无类抓取多个元素
【发布时间】:2021-01-26 02:19:08
【问题描述】:

所以我想从这件事上刮掉导演。但正如我看到的页面,我知道这部电影有两位导演 Danny Boyle 和 Loveleen Tandan。但是如果我使用 find_all('a') 就没有办法得到它,那么它也会采用 Dev Patel、Freida Pinto 等演员的名字。

我不能使用 find_all('a')[1] 和 find_all('a')[2] 因为其他电影可能只有一个导演。唯一将演员与导演区分开来的是带有类幽灵的跨度标签。 假设可能有一个、两个或三个董事,我应该如何抓取这些数据。

<p class="">
             Directors:
             <a href="/name/nm0000965/">
              Danny Boyle
             </a>
             ,
             <a href="/name/nm0849164/">
              Loveleen Tandan
             </a>
             <span class="ghost">
              |
             </span>
             Stars:
             <a href="/name/nm2353862/">
              Dev Patel
             </a>
             ,
             <a href="/name/nm2951768/">
              Freida Pinto
             </a>
             ,
             <a href="/name/nm0795661/">
              Saurabh Shukla
             </a>
             ,
             <a href="/name/nm0438463/">
              Anil Kapoor
             </a>
            </p>

页面的网址是: https://www.imdb.com/search/title/?count=100&groups=oscar_best_picture_winners&sort=year%2Cdesc&ref_=nv_ch_osc

【问题讨论】:

  • 为什么不获取p标签文本并用|分割?
  • 检查我的回答是否满足你的要求
  • 哦!没有越过我的脑海。感谢您的帮助。

标签: python web-scraping beautifulsoup


【解决方案1】:

这应该对你有帮助:

from bs4 import BeautifulSoup

html = """
<p class="">
             Directors:
             <a href="/name/nm0000965/">
              Danny Boyle
             </a>
             ,
             <a href="/name/nm0849164/">
              Loveleen Tandan
             </a>
             <span class="ghost">
              |
             </span>
             Stars:
             <a href="/name/nm2353862/">
              Dev Patel
             </a>
             ,
             <a href="/name/nm2951768/">
              Freida Pinto
             </a>
             ,
             <a href="/name/nm0795661/">
              Saurabh Shukla
             </a>
             ,
             <a href="/name/nm0438463/">
              Anil Kapoor
             </a>
            </p>
""" #The html code provided by you

soup = BeautifulSoup(html,'html5lib')

p_tag = soup.find('p')

span = p_tag.find('span',class_ = "ghost")

prev = list(span.previous_siblings) #Finds all the tags before the span tag with class ghost and converts them into a list

prev = [str(x) for x in prev]

prev = ''.join(prev) #Converts the list to a string

soup2 = BeautifulSoup(prev,'html5lib') #Creates a new BeautifulSoup object with the newly formed string

a_tags = soup2.find_all('a')

for a in a_tags:
    txt = a.text.strip()
    print(txt)

输出:

Loveleen Tandan
Danny Boyle

希望这会有所帮助!

【讨论】:

    【解决方案2】:

    我把它留在这里:

    import requests
    from bs4 import BeautifulSoup
    
    
    url = 'https://www.imdb.com/search/title/?count=100&groups=oscar_best_picture_winners&sort=year%2Cdesc&ref_=nv_ch_osc'
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    
    directors_and_stars = soup.find_all(text=lambda t: 'Director' in t)
    
    for d in directors_and_stars:
        movie_name = d.find_previous('h3').a.get_text(strip=True)
        directors = [t.strip() for t in d.find_previous('p').find_all(text=True)[1:] if t.strip() and t.strip() != ',']
        directors = directors[:directors.index('|')]
        print('{:<50} {}'.format(movie_name, directors))
    

    打印:

    Parazit                                            ['Bong Joon Ho']
    Zelená kniha                                       ['Peter Farrelly']
    The Shape of Water                                 ['Guillermo del Toro']
    Moonlight                                          ['Barry Jenkins']
    Spotlight                                          ['Tom McCarthy']
    Birdman or (The Unexpected Virtue of Ignorance)    ['Alejandro G. Iñárritu']
    12 rokov otrokom                                   ['Steve McQueen']
    Argo                                               ['Ben Affleck']
    The Artist                                         ['Michel Hazanavicius']
    Králova rec                                        ['Tom Hooper']
    Slumdog Millionaire                                ['Danny Boyle', 'Loveleen Tandan']
    Smrt' caká vsade                                   ['Kathryn Bigelow']
    Táto krajina nie je pre starých                    ['Ethan Coen', 'Joel Coen']
    
    ...and so on.
    

    【讨论】:

      猜你喜欢
      • 2016-05-16
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-07-29
      • 1970-01-01
      • 1970-01-01
      • 2017-03-30
      相关资源
      最近更新 更多