【问题标题】:Python: parse html and produce a tabular text filePython:解析 html 并生成表格文本文件
【发布时间】:2017-10-10 08:03:56
【问题描述】:

问题:我想解析一个 html 代码并检索一个表格文本文件,例如:

East Counties
Babergh, http://ratings.food.gov.uk/OpenDataFiles/FHRS297en-GB.xml, 876
Basildon, http://ratings.food.gov.uk/OpenDataFiles/FHRS109en-GB.xml, 1134
...
...

我得到的是: txt 文件中仅出现East Counties,因此 for 循环无法打印每个新区域。尝试代码在 html 代码之后。

HTML 代码: 代码可以在this html page找到,参考上表摘录如下:

<h2>
                                    East Counties</h2>

                                        <table>
                                            <thead>
                                                <tr>
                                                    <th>
                                                        <span id="listRegions_lvFiles_0_titleLAName_0">Local authority</span>
                                                    </th>
                                                    <th>
                                                        <span id="listRegions_lvFiles_0_titleUpdate_0">Last update</span>
                                                    </th>
                                                    <th>
                                                        <span id="listRegions_lvFiles_0_titleEstablishments_0">Number of businesses</span>
                                                    </th>
                                                    <th>
                                                        <span id="listRegions_lvFiles_0_titleCulture_0">Download</span>
                                                    </th>
                                                </tr>
                                            </thead>

                                        <tr>
                                            <td>
                                                <span id="listRegions_lvFiles_0_laNameLabel_0">Babergh</span>
                                            </td>
                                            <td>
                                                <span id="listRegions_lvFiles_0_updatedLabel_0">04/05/2017 </span>
                                                at
                                                <span id="listRegions_lvFiles_0_updatedTime_0"> 12:00</span>
                                            </td>
                                            <td>
                                                <span id="listRegions_lvFiles_0_establishmentsLabel_0">876</span>
                                            </td>
                                            <td>
                                                <a id="listRegions_lvFiles_0_fileURLLabel_0" title="Babergh: English language" href="http://ratings.food.gov.uk/OpenDataFiles/FHRS297en-GB.xml">English language</a>
                                            </td>
                                        </tr>

                                        <tr>
                                            <td>
                                                <span id="listRegions_lvFiles_0_laNameLabel_1">Basildon</span>
                                            </td>
                                            <td>
                                                <span id="listRegions_lvFiles_0_updatedLabel_1">06/05/2017 </span>
                                                at
                                                <span id="listRegions_lvFiles_0_updatedTime_1"> 12:00</span>
                                            </td>
                                            <td>
                                                <span id="listRegions_lvFiles_0_establishmentsLabel_1">1,134</span>
                                            </td>
                                            <td>
                                                <a id="listRegions_lvFiles_0_fileURLLabel_1" title="Basildon: English language" href="http://ratings.food.gov.uk/OpenDataFiles/FHRS109en-GB.xml">English language</a>
                                            </td>
                                        </tr>

我的尝试:

from xml.dom import minidom
import urllib2
from bs4 import BeautifulSoup

url='http://ratings.food.gov.uk/open-data/'
f = urllib2.urlopen(url)
mainpage = f.read()
soup = BeautifulSoup(mainpage, 'html.parser')

regions=[]
with open('Regions_and_files.txt', 'w') as f:
    for h2 in soup.find_all('h2')[6:]: #Skip 6 h2 lines 
        region=h2.text.strip() #Get the text of each h2 without the white spaces
        regions.append(str(region))
        f.write(region+'\n')
        for tr in soup.find_all('tr')[1:]: # Skip headers
            tds = tr.find_all('td')
            if len(tds)==0:
                continue
            else:
                a = tr.find_all('a')
                link = str(a)[10:67]
                span = tr.find_all('span')
                places = int(str(span[3].text).replace(',', ''))
                f.write("%s,%s,%s" % \
                              (str(tds[0].text)[1:-1], link, places)+'\n')

我该如何解决这个问题?

【问题讨论】:

    标签: python html beautifulsoup html-parsing text-files


    【解决方案1】:

    我不熟悉 Beautiful Soup 库,但从每个h2 循环中的代码来看,您正在遍历文档的所有tr 元素。您应该只遍历属于与特定 h2 元素相关的表的行。

    已编辑: 快速查看Beautiful Soup docs 后,您似乎可以使用.next_sibling,因为h2 后面总是跟着table,即table = h2.next_sibling.next_sibling(调用两次,因为第一个兄弟是包含空格的字符串)。然后,您可以从table 遍历其所有行。

    您获得威尔士重复的原因是因为实际上源中的重复。

    【讨论】:

    • 您是否将h2 的搜索嵌套在table 的搜索中?
    • 解决了僵局。非常感谢!
    猜你喜欢
    • 2011-12-23
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-09-23
    • 1970-01-01
    • 1970-01-01
    • 2015-07-26
    • 1970-01-01
    相关资源
    最近更新 更多