【发布时间】:2019-11-01 01:00:51
【问题描述】:
我有一个网络抓取工具,它使用 bs4 抓取包含许多信息部分的页面。由于很多部分重复div class,因此很难抓取。我试图找到一种方法让它在 html 中的特定短语之后开始搜索 lxml。有没有办法做到这一点?
下面是我正在使用的一个小示例,试图让table_soup 之类的东西在特定短语之后开始。
from bs4 import BeautifulSoup
import csv
import re
# Making get request
r = requests.get('https://m.the-numbers.com/movie/Black-Panther')
# Creating BeautifulSoup object
soup = BeautifulSoup(r.text, 'lxml')
# Localizing table from the BS object
table_soup = soup.find('div', class_='row').find('div', class_='table-responsive').find('table', id='movie_finances')
website = 'https://m.the-numbers.com/'
# Iterating through all trs in the table except the first(header) and the last two(summary) rows
for tr in table_soup.find_all('tr')[1:6]:
tds = tr.find_all('td')
title = tds[0].text.strip()
# make sure that home market performance doesnt check the second one
if title != 'Home Market Performance':
details.append({
'title': title,
'amount': tds[1].text.strip(),
})
summary_soup = soup.find('div', id='summary').find('div', class_='table-responsive').find('table', class_='table table-sm')
summaryList = []
for tr in summary_soup.find_all('tr')[1:4]:
tdmd = tr.find_all('td')
summaryList.append({
'unit': tdmd[1].text.strip(),
})```
【问题讨论】:
标签: python html web-scraping beautifulsoup