【问题标题】:Web scraping using Beautiful Soup, scraping multiple elements without class使用 Beautiful Soup 进行网页抓取,无类抓取多个元素
【发布时间】:2021-01-26 02:19:08
【问题描述】:
所以我想从这件事上刮掉导演。但正如我看到的页面,我知道这部电影有两位导演 Danny Boyle 和 Loveleen Tandan。但是如果我使用 find_all('a') 就没有办法得到它,那么它也会采用 Dev Patel、Freida Pinto 等演员的名字。
我不能使用 find_all('a')[1] 和 find_all('a')[2] 因为其他电影可能只有一个导演。唯一将演员与导演区分开来的是带有类幽灵的跨度标签。
假设可能有一个、两个或三个董事,我应该如何抓取这些数据。
<p class="">
Directors:
<a href="/name/nm0000965/">
Danny Boyle
</a>
,
<a href="/name/nm0849164/">
Loveleen Tandan
</a>
<span class="ghost">
|
</span>
Stars:
<a href="/name/nm2353862/">
Dev Patel
</a>
,
<a href="/name/nm2951768/">
Freida Pinto
</a>
,
<a href="/name/nm0795661/">
Saurabh Shukla
</a>
,
<a href="/name/nm0438463/">
Anil Kapoor
</a>
</p>
页面的网址是:
https://www.imdb.com/search/title/?count=100&groups=oscar_best_picture_winners&sort=year%2Cdesc&ref_=nv_ch_osc
【问题讨论】:
标签:
python
web-scraping
beautifulsoup
【解决方案1】:
这应该对你有帮助:
from bs4 import BeautifulSoup
html = """
<p class="">
Directors:
<a href="/name/nm0000965/">
Danny Boyle
</a>
,
<a href="/name/nm0849164/">
Loveleen Tandan
</a>
<span class="ghost">
|
</span>
Stars:
<a href="/name/nm2353862/">
Dev Patel
</a>
,
<a href="/name/nm2951768/">
Freida Pinto
</a>
,
<a href="/name/nm0795661/">
Saurabh Shukla
</a>
,
<a href="/name/nm0438463/">
Anil Kapoor
</a>
</p>
""" #The html code provided by you
soup = BeautifulSoup(html,'html5lib')
p_tag = soup.find('p')
span = p_tag.find('span',class_ = "ghost")
prev = list(span.previous_siblings) #Finds all the tags before the span tag with class ghost and converts them into a list
prev = [str(x) for x in prev]
prev = ''.join(prev) #Converts the list to a string
soup2 = BeautifulSoup(prev,'html5lib') #Creates a new BeautifulSoup object with the newly formed string
a_tags = soup2.find_all('a')
for a in a_tags:
txt = a.text.strip()
print(txt)
输出:
Loveleen Tandan
Danny Boyle
希望这会有所帮助!
【解决方案2】:
我把它留在这里:
import requests
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/search/title/?count=100&groups=oscar_best_picture_winners&sort=year%2Cdesc&ref_=nv_ch_osc'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
directors_and_stars = soup.find_all(text=lambda t: 'Director' in t)
for d in directors_and_stars:
movie_name = d.find_previous('h3').a.get_text(strip=True)
directors = [t.strip() for t in d.find_previous('p').find_all(text=True)[1:] if t.strip() and t.strip() != ',']
directors = directors[:directors.index('|')]
print('{:<50} {}'.format(movie_name, directors))
打印:
Parazit ['Bong Joon Ho']
Zelená kniha ['Peter Farrelly']
The Shape of Water ['Guillermo del Toro']
Moonlight ['Barry Jenkins']
Spotlight ['Tom McCarthy']
Birdman or (The Unexpected Virtue of Ignorance) ['Alejandro G. Iñárritu']
12 rokov otrokom ['Steve McQueen']
Argo ['Ben Affleck']
The Artist ['Michel Hazanavicius']
Králova rec ['Tom Hooper']
Slumdog Millionaire ['Danny Boyle', 'Loveleen Tandan']
Smrt' caká vsade ['Kathryn Bigelow']
Táto krajina nie je pre starých ['Ethan Coen', 'Joel Coen']
...and so on.