Python如何从html文件中提取内容答案

【问题标题】：Python how to extract contents from html filePython如何从html文件中提取内容
【发布时间】：2016-09-09 14:47:45
【问题描述】：

我有一个来自 Nose 的 html 格式的测试报告文件。我想在 Python 中提取文本的某些部分。我将在消息部分的电子邮件中发送此内容。

我有以下示例：

    <!DOCTYPE html>
<html>
<head>
    <title>Unit Test Report</title>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>

<style>
body {
    font-family: Calibri, "Trebuchet MS", sans-serif;
}
* {
    word-break: break-all;
}
table, td, th, .dataid {
    border: 1px solid #aaa;
    border-collapse: collapse;
    background: #fff;
}
section {
    background: rgba(0, 0, 0, 0.05);
    margin: 2ex;
    padding: 1ex;
    border: 1px solid #999;
    border-radius: 5px;
}
h1 {
    font-size: 130%;
}
h2 {
    font-size: 120%;
}
h3 {
    font-size: 100%;
}
h4 {
    font-size: 85%;
}
h1, h2, h3, h4, a[href] {
    cursor: pointer;
    color: #0074d9;
    text-decoration: none;
}
h3 strong, a.failed {
    color: #ff4136;
}
.failed {
    color: #ff4136;
}
a.success {
    color: #3d9970;
}
pre {
    font-family: 'Consolas', 'Deja Vu Sans Mono',
                 'Bitstream Vera Sans Mono', 'Monaco',
                 'Courier New', monospace;
}

.test-details,
.traceback {
    display: none;
}
section:target .test-details {
    display: block;
}

</style>
</head>
<body>
    <h1>Overview</h1>
    <section>
        <table>
            <tr>
                <th>Class</th>
                <th class="failed">Fail</th>
                <th class="failed">Error</th>
                <th>Skip</th>
                <th>Success</th>
                <th>Total</th>
            </tr>
                <tr>
                    <td>Regression_TestCase.RegressionProject_TestCase2.RegressionProject_TestCase2</td>
                    <td class="failed">1</td>
                    <td class="failed">9</td>
                    <td>0</td>
                    <td>219</td>
                    <td>229</td>
                </tr>
            <tr>
                <td><strong>Total</strong></td>
                <td class="failed">1</td>
                <td class="failed">9</td>
                <td>0</td>
                <td>219</td>
                <td>229</td>
            </tr>
        </table>
    </section>
    <h1>Failure details</h1>
            <section>
                <h2>Regression_TestCase.RegressionProject_TestCase2.RegressionProject_TestCase2 (1 failures, 9 errors)</h2>
                <div>
                        <section id="Regression_TestCase.RegressionProject_TestCase2.RegressionProject_TestCase2:test_00010_import_user_invalid_credentials">
                            <h3>test_00010_import_user_invalid_credentials: <strong>selenium.common.exceptions.NoSuchElementException</strong></h3>
                            <div class="test-details">
                                <h4>Traceback</h4>
                                <pre class="traceback">Traceback (most recent call last):
  File "C:\Python27\lib\unittest\case.py", line 329, in run
    testMethod()
  File "C:\test_runners\selenium_regression_test_5_1_1\ClearCore - Regression Test\Regression_TestCase\RegressionProject_TestCase2.py", line 221, in test_00010_import_user_invalid_credentials
    Globals.login_password_invalid)
  File "C:\test_runners\selenium_regression_test_5_1_1\ClearCore - Regression Test\Pages\security.py", line 51, in enter_invalid_userid_and_password
    self.enter_user_id(userid)
  File "C:\test_runners\selenium_regression_test_5_1_1\ClearCore - Regression Test\Pages\security.py", line 32, in enter_user_id
    user_id_element = self.get_element(*MainPageLocators.security_user_id_textfield_xpath)
  File "C:\test_runners\selenium_regression_test_5_1_1\ClearCore - Regression Test\Pages\base.py", line 40, in get_element
    element = self.driver.find_element(by=how, value=what)
  File "C:\Python27\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 712, in find_element
    {'using': by, 'value': value})['value']
  File "C:\Python27\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 201, in execute
    self.error_handler.check_response(response)
  File "C:\Python27\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 194, in check_response
    raise exception_class(message, screen, stacktrace)
NoSuchElementException: Message: Message: Unable to find element with xpath == //span[@class="gwt-InlineLabel marginbelow myinlineblock" and contains(text(), "User ID (including domain)")]/following-sibling::input

-------------------- >> begin captured stdout << ---------------------
*** Test import_invalid_user_credentials ***
05_12_1616_49_42
//span[@class="gwt-InlineLabel marginbelow myinlineblock" and contains(text(), "User ID (including domain)")]/following-sibling::input
Element not found 
Message: Unable to find element with xpath == //span[@class="gwt-InlineLabel marginbelow myinlineblock" and contains(text(), "User ID (including domain)")]/following-sibling::input

05_12_1616_51_54

--------------------- >> end captured stdout << ----------------------
----
 # There is more html below. I have not included everything. It will be too long otherwise.

如果我在浏览器中打开文件，格式如下所示：这是我想从 html 文件中提取的文本。

    Class             Fail Error    Skip    Success     Total
Regression_TestCase     1    9       0      219         229

请问我该怎么做？将其保留为表格格式会很好。谢谢，里亚兹

【问题讨论】：

您是否尝试过使用 xml 解析库？（如docs.python.org/2.7/library/…）
我在找美汤stackoverflow.com/questions/16835449/…
您希望输出采用什么格式？您希望它看起来像 excel 中的表格（例如 csv），还是想要一个包含这些行、列和间距的文本文件？

标签： python-2.7

【解决方案1】：

您的示例 html 代码包含未关闭的标签和没有打开标签的关闭标签。我假设您只展示了一个示例，并且您提取的文件格式如下：

<body>
    <h1>Overview</h1>
    <section>
        <table>
            <tr>
                <th>Class</th>
                <th class="failed">Fail</th>
                <th class="failed">Error</th>
                <th>Skip</th>
                <th>Success</th>
                <th>Total</th>
            </tr>
                <tr>
                    <td>Regression_TestCase</td>
                    <td class="failed">1</td>
                    <td class="failed">9</td>
                    <td>0</td>
                    <td>219</td>
                    <td>229</td>
                </tr>
            <tr>
                <td><strong>Total</strong></td>
                <td class="failed">1</td>
                <td class="failed">9</td>
                <td>0</td>
                <td>219</td>
                <td>229</td>
            </tr>
        </table>
     </section>
</body>

您可以使用 Etree 模块将代码解析为 xml。 编辑：将用于查找表的方法更改为使用 xpath 并使其不会打印“总计”列。

编辑 2： 我现在已经使用正则表达式来提取代码中的所有表。小心使用它，因为它是一个非常脆弱的解决方案。如果有一个打开的表格标签而没有一个关闭的表格标签，那么它将提取打开表格标签之后的所有文本并崩溃，因为生成的字符串将不是格式正确的 xml。

import csv
import re
import xml.etree.ElementTree as ET

# Extract well formed tables
start = re.compile(r"<table>", re.IGNORECASE)
end = re.compile(r"</table>", re.IGNORECASE)
html_code = ""
table = False
with open('sample2.xml') as xmlfile:
    for line in xmlfile:
        if not table:
            table = start.search(line)
            if table:
                html_code += line
        else:
            if end.search(line):
                html_code += line[0:end.search(line).end()]
                table = False
            else:
                html_code += line
                table = not end.search(line)            
print html_code

# Parse html code into Etree Element object
root = ET.fromstring(html_code)
elements = root.findall(".//tr")
print elements
row = []
with open('output.csv', 'wb') as csvfile:
    csvwriter = csv.writer(csvfile, delimiter=',', quotechar='"')
    for tablerow in elements:
        # Only write result to file if there is text inside the first column
        if list(tablerow)[0].text:
            for col in list(tablerow):
                row.append(col.text)
            csvwriter.writerow(row)
            print row
            row = []

如果您使用 excel 打开“output.csv”，您将拥有您的表格。如果您使用此方法，请注意文档中的安全警告（zezollo 评论中的链接）。

或者，您可以使用正则表达式，但我太累了，无法编写另一个解决方案。也许明天，或者其他人可能会提供替代解决方案。

【讨论】：

当我解析我的 html 文件时，我收到错误 xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 125, column 47
解析html文件的代码为：tree = ET.parse(r"E:\SeleniumTestReport.html")
您的 html 代码中的所有开始标签是否都有结束标签（反之亦然）？至于您的第二条评论，我的代码假定 html 文件与 python 脚本位于同一目录中。如果你把你的 html 文件放在了不同的目录下，那么你用来解析它的路径自然需要不同。
正如我在回答中所写，我的方法假定格式良好、完整的 html 代码。您提供的文件不完整。我可以使用正则表达式从您的代码中提取格式良好的 xml，但是没有表 ID，我很犹豫是否这样做，因为对于已经非常脆弱的系统来说，这将是一个非常脆弱的解决方案。您的程序是否有可能返回包含多个表的代码？
我已经更新了代码。如果我使用您的整个示例代码（包括所有错误和任何其他随机文本，它将提取您所追求的内容。但是，请注意，如果任何地方都有带有打开标记的 html 代码不是与结束
标记配对。