【发布时间】:2017-09-20 04:27:59
【问题描述】:
我正在尝试从 ESPN 抓取一些日程表:http://www.espn.com/nba/schedule/_/date/20171001
import requests
import bs4
response = requests.get('http://www.espn.com/nba/schedule/_/date/20171001')
soup = bs4.BeautifulSoup(response.text, 'lxml')
print soup.prettify()
table = soup.find_all('table')
data = []
for i in table:
rows = i.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [col.text.strip() for col in cols]
data.append([col for col in cols if col])
我的代码工作正常,只是输出缺少日期信息:
[
"Phoenix PHX",
"Utah UTAH",
"394 tickets available from $6"
],
[],
[
"Miami MIA",
"Orlando ORL",
"1,582 tickets available from $12"
]
经过一番调查,我意识到日期和时间信息包含在标签中,如下所示:
<td data-behavior="date_time" data-date="2017-10-07T23:00Z"><a data-dateformat="time1" href="/nba/game?gameId=400978807" name="&lpos=nba:schedule:time"></a></td>
我也时常在其他网站上看到这一点,但从未真正理解他们为什么这样做。如何在打开的标签中提取文本以在输出中获取“2017-10-07T23:00Z”?
【问题讨论】:
标签: python-2.7 python-3.x web-scraping beautifulsoup