【问题标题】:Parse the HTML Table解析 HTML 表
【发布时间】:2013-12-23 00:10:00
【问题描述】:

我有一个 HTML 表格,需要将其解析为 CSV 文件。

import urllib2, datetime
olddate = datetime.datetime.strptime('5/01/13', "%m/%d/%y")
from BeautifulSoup import BeautifulSoup
print("dates,location,name,url")
def genqry(arga,argb,argc,argd):
return arga + "," + argb + "," + argc + "," + argd
part = 1
row = 1
contenturl = "http://www.robotevents.com/robot-competitions/vex-robotics-competition"
soup = BeautifulSoup(urllib2.urlopen(contenturl).read())
table = soup.find('table', attrs={'class': 'catalog-listing'})
rows = table.findAll('tr')
for tr in rows:
    try:
        if row != 1:
            cols = tr.findAll('td')
            for td in cols:
                if part == 1:
                    keep = 0
                    dates = td.find(text=True)
                    part = 2
                if part == 2:
                    location = td.find(text=True)
                    part = 2
                if part == 3:
                    name = td.find(text=True)
                    for a in tr.findAll('a', href=True):
                        url = a['href']
                # Compare Dates
                if len(dates) < 6:
                    newdate = datetime.datetime.strptime(dates, "%m/%d/%y")
                    if newdate > olddate:
                        keep = 1
                    else:
                        keep = 0
                else:
                    newdate = datetime.datetime.strptime(dates[:6], "%m/%d/%y")
                    if newdate > olddate:
                        keep = 1
                    else:
                        keep = 0
                if keep == 1:
                    qry = genqry(dates, location, name, url)
                    print(qry)
                row = row + 1
                part = 1
        else:
            row = row + 1
    except (RuntimeError, TypeError, NameError):
        print("Error: " + name)

我需要能够获取该表中 5/01/13 之后的每个 VEX 事件。到目前为止,这段代码给了我一个关于日期的错误,我似乎无法修复。也许比我更好的人可以修复此代码?在此先感谢史密斯。

编辑 #1:我遇到的错误是:

Value Error: '\n10/5/13' does not match format '%m/%d/%y'

我认为我需要先删除字符串开头的换行符。 编辑#2:让它运行,没有任何输出,任何帮助?

【问题讨论】:

标签: python html-parsing beautifulsoup


【解决方案1】:

你的问题很糟糕。在不知道确切错误的情况下,我猜问题出在您的 if len(dates) &lt; 6: 块上。考虑以下几点:

>>> date = '10/5/13 - 12/14/13'
>>> len(date)
18
>>> date = '11/9/13'
>>> len(date)
7
>>> date[:6]
'11/9/1'

让您的代码更加 Pythonic 的一个建议:不要使用 row = row + 1,而是使用 enumerate

更新:跟踪你的代码,我得到dates的值如下:

>>> dates
u'\n10/5/13 - 12/14/13            \xa0\n        '

【讨论】:

    猜你喜欢
    • 2011-09-24
    • 2013-01-02
    • 2011-01-04
    • 2011-04-02
    • 2013-04-09
    • 2015-08-17
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多