要获取相关图像,您需要获取posterColumn。从中您可以提取 img src 条目并拉取 jpg 图像。然后可以根据电影标题保存文件,注意删除任何无效的文件名字符,例如::
from lxml.html import parse
import requests
import string
valid_chars = "-_.() " + string.ascii_letters + string.digits
tree = parse('http://www.imdb.com/chart/top')
movies = tree.findall('.//table[@class="chart full-width"]//td[@class="titleColumn"]//a')
posters = tree.findall('.//table[@class="chart full-width"]//td[@class="posterColumn"]//a')
for p, m in zip(posters, movies):
for element, attribute, link, pos in p.iterlinks():
if attribute == 'src':
print "{:50} {}".format(m.text_content(), link)
poster_jpg = requests.get(link, stream=True)
valid_filename = ''.join(c for c in m.text_content() if c in valid_chars)
with open('{}.jpg'.format(valid_filename), 'wb') as f_jpg:
for chunk in poster_jpg:
f_jpg.write(chunk)
所以目前您会看到以下内容:
The Shawshank Redemption https://images-na.ssl-images-amazon.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_UY67_CR0,0,45,67_AL_.jpg
The Godfather https://images-na.ssl-images-amazon.com/images/M/MV5BZTRmNjQ1ZDYtNDgzMy00OGE0LWE4N2YtNTkzNWQ5ZDhlNGJmL2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyNjU0OTQ0OTY@._V1_UY67_CR1,0,45,67_AL_.jpg
The Godfather: Part II https://images-na.ssl-images-amazon.com/images/M/MV5BMjZiNzIxNTQtNDc5Zi00YWY1LThkMTctMDgzYjY4YjI1YmQyL2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyNjU0OTQ0OTY@._V1_UY67_CR1,0,45,67_AL_.jpg