【发布时间】:2015-04-16 07:02:03
【问题描述】:
我偶然发现了这个优秀的post 使用 Beautiful Soup 进行抓取,我决定承担从互联网上抓取一些数据的任务来尝试。
我正在使用来自 Flight Radar 24 的航班数据,并使用博客文章中描述的内容来尝试自动抓取页面以获取航班数据。
import requests
import bs4
root_url = 'http://www.flightradar24.com'
index_url = root_url + '/data/flights/tigerair-tgw/'
def get_flight_id_urls():
response = requests.get(index_url)
soup = bs4.BeautifulSoup(response.text)
return [a.attrs.get('href') for a in soup.select('div.list-group a[href^=/data]')]
flight_id_urls = get_flight_id_urls()
for flight_id_url in flight_id_urls:
temp_url = root_url + flight_id_url
response = requests.get(temp_url)
soup = bs4.BeautifulSoup(response.text)
try:
table = soup.find('table')
rows = table.find_all('tr')
for row in rows:
flight_data = {}
flight_data['title'] = soup.select('div#cntPagePreTitle h1')[0].get_text()
flight_data['tr'] = row #error here
print (flight_data)
except AttributeError as e:
raise ValueError("No valid table found")
flight data page 的示例
我跌跌撞撞地看到表格,然后意识到我不知道如何横向遍历表格属性以获取嵌入在每一列中的数据。
任何善良的灵魂都有任何线索,甚至是介绍教程,以便我可以阅读如何提取数据。
P.S:感谢 Miguel Grinberg 的出色教程
添加
try:
table = soup.find('table')
rows = table.find_all('tr')
heads = [i.text.strip() for i in table.select('thead th')]
for tr in table.select('tbody tr'):
flight_data = {}
flight_data['title'] = soup.select('div#cntPagePreTitle h1')[0].get_text()
flight_data['From'] = tr.select('td.From')
flight_data['To'] = tr.select('td.To')
print (flight_data)
except AttributeError as e:
raise ValueError("No valid table found")
我更改了代码的最后一部分以形成一个数据对象,但我似乎无法获取数据。
最终编辑:
import requests
import bs4
root_url = 'http://www.flightradar24.com'
index_url = root_url + '/data/flights/tigerair-tgw/'
def get_flight_id_urls():
response = requests.get(index_url)
soup = bs4.BeautifulSoup(response.text)
return [a.attrs.get('href') for a in soup.select('div.list-group a[href^=/data]')]
flight_id_urls = get_flight_id_urls()
for flight_id_url in flight_id_urls:
temp_url = root_url + flight_id_url
response = requests.get(temp_url)
soup = bs4.BeautifulSoup(response.text)
try:
table = soup.find('table')
rows = table.find_all('tr')
for row in rows:
flight_data = {}
flight_data['flight_number'] = tr['data-flight-number']
flight_data['from'] = tr['data-name-from']
print (flight_data)
except AttributeError as e:
raise ValueError("No valid table found")
P.S.S:感谢@amow 的大力帮助:D
【问题讨论】:
-
I don't know how to transverse down the table attributes to get the data that was embedded in each column.把源代码贴在这里。
标签: python beautifulsoup