【发布时间】:2014-04-22 07:35:22
【问题描述】:
我不确定这是否因为mechanize而没有抢到整张桌子
这行得通:
from bs4 import BeautifulSoup
import requests
page = 'http://www.airchina.com.cn/www/jsp/airlines_operating_data/exlshow_en.jsp'
r = requests.get(page)
r.encoding = 'utf-8'
soup = BeautifulSoup(r.text)
div = soup.find('div', class_='mainRight').find_all('div')[1]
table = div.find('table', recursive=False)
for row in table.find_all('tr', recursive=False):
for cell in row('td', recursive=False):
print cell.text.split()
但这不是:
import mechanize
from bs4 import BeautifulSoup
import requests
URL='http://www.airchina.com.cn/www/jsp/airlines_operating_data/exlshow_en.jsp'
control_year=['2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014']
control_month=['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
br = mechanize.Browser()
r=br.open(URL)
br.select_form("exl")
control_m = br.form.find_control('month')
control_y = br.form.find_control('year')
br[control_m.name]=['06']
br[control_y.name]=['2012']
response = br.submit()
soup = BeautifulSoup(response,'html.parser')
#div = soup.find('div', class_='mainRight')
div = soup.find('div', class_='mainRight').find_all('div')[1]
table = div.find('table', recursive=False)
for row in table.find_all('tr', recursive=False):
for cell in row('td', recursive=False):
print cell.text.strip()
使用mechanize 的那个只产生以下内容,即使在萤火虫中我看到所有tr 和td
Jun 2012
% change vs Jun 2011
% change vs May 2012
Cumulative Jun 2012
% cumulative change
【问题讨论】:
-
可能会自动在表格内添加
tbody元素。尝试在tr之前循环遍历table内的所有tbody。 -
@Wolph。我试过
table.find_all('tbody'),但返回[] -
我相信它可能与您正在使用的
html.parser有关,请参阅我对工作版本的回答
标签: python beautifulsoup mechanize