【发布时间】:2013-12-23 00:10:00
【问题描述】:
我有一个 HTML 表格,需要将其解析为 CSV 文件。
import urllib2, datetime
olddate = datetime.datetime.strptime('5/01/13', "%m/%d/%y")
from BeautifulSoup import BeautifulSoup
print("dates,location,name,url")
def genqry(arga,argb,argc,argd):
return arga + "," + argb + "," + argc + "," + argd
part = 1
row = 1
contenturl = "http://www.robotevents.com/robot-competitions/vex-robotics-competition"
soup = BeautifulSoup(urllib2.urlopen(contenturl).read())
table = soup.find('table', attrs={'class': 'catalog-listing'})
rows = table.findAll('tr')
for tr in rows:
try:
if row != 1:
cols = tr.findAll('td')
for td in cols:
if part == 1:
keep = 0
dates = td.find(text=True)
part = 2
if part == 2:
location = td.find(text=True)
part = 2
if part == 3:
name = td.find(text=True)
for a in tr.findAll('a', href=True):
url = a['href']
# Compare Dates
if len(dates) < 6:
newdate = datetime.datetime.strptime(dates, "%m/%d/%y")
if newdate > olddate:
keep = 1
else:
keep = 0
else:
newdate = datetime.datetime.strptime(dates[:6], "%m/%d/%y")
if newdate > olddate:
keep = 1
else:
keep = 0
if keep == 1:
qry = genqry(dates, location, name, url)
print(qry)
row = row + 1
part = 1
else:
row = row + 1
except (RuntimeError, TypeError, NameError):
print("Error: " + name)
我需要能够获取该表中 5/01/13 之后的每个 VEX 事件。到目前为止,这段代码给了我一个关于日期的错误,我似乎无法修复。也许比我更好的人可以修复此代码?在此先感谢史密斯。
编辑 #1:我遇到的错误是:
Value Error: '\n10/5/13' does not match format '%m/%d/%y'
我认为我需要先删除字符串开头的换行符。 编辑#2:让它运行,没有任何输出,任何帮助?
【问题讨论】:
-
您不必为此使用 Beautiful Soup。您可以使用 python3 HTMLParser:github.com/schmijos/html-table-parser-python3/blob/master/…
标签: python html-parsing beautifulsoup