从 HTML 表中提取数据答案

【问题标题】：Extracting data from HTML table从 HTML 表中提取数据
【发布时间】：2012-08-01 04:42:16
【问题描述】：

我正在寻找一种在 linux shell 环境中从 HTML 获取某些信息的方法。

这是我感兴趣的一点：

<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
  <tr valign="top">
    <th>Tests</th>
    <th>Failures</th>
    <th>Success Rate</th>
    <th>Average Time</th>
    <th>Min Time</th>
    <th>Max Time</th>
  </tr>
  <tr valign="top" class="Failure">
    <td>103</td>
    <td>24</td>
    <td>76.70%</td>
    <td>71 ms</td>
    <td>0 ms</td>
    <td>829 ms</td>
  </tr>
</table>

我想存储在 shell 变量中，或者在从上面的 html 中提取的键值对中回显这些变量。示例：

Tests         : 103
Failures      : 24
Success Rate  : 76.70 %
and so on..

目前我能做的是创建一个 java 程序，该程序将使用 sax 解析器或 html 解析器（如 jsoup）来提取此信息。

但是在这里使用 java 似乎是在您要执行的“包装器”脚本中包含可运行 jar 的开销。

我确信肯定有“shell”语言可以做同样的事情，例如 perl、python、bash 等。

我的问题是我对这些的经验为零，有人可以帮我解决这个“相当简单”的问题

快速更新：

我忘了提到我在 .html 文档中有更多表格和更多行，对此我感到抱歉（清晨）。

更新 #2：

尝试像这样安装 Bsoup，因为我没有 root 访问权限：

$ wget http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-4.1.0.tar.gz
$ tar -zxvf beautifulsoup4-4.1.0.tar.gz
$ cp -r beautifulsoup4-4.1.0/bs4 .
$ vi htmlParse.py # (paste code from ) Tichodromas' answer, just in case this (http://pastebin.com/4Je11Y9q) is what I pasted
$ run file (python htmlParse.py)

错误：

$ python htmlParse.py
Traceback (most recent call last):
  File "htmlParse.py", line 1, in ?
    from bs4 import BeautifulSoup
  File "/home/gdd/setup/py/bs4/__init__.py", line 29
    from .builder import builder_registry
         ^
SyntaxError: invalid syntax

更新 #3：

Running Tichodromas 的回答得到这个错误：

Traceback (most recent call last):
  File "test.py", line 27, in ?
    headings = [th.get_text() for th in table.find("tr").find_all("th")]
TypeError: 'NoneType' object is not callable

有什么想法吗？

【问题讨论】：

有一个不错的 python 库可能会有所帮助：BeautifulSoup -> crummy.com/software/BeautifulSoup/bs4/doc .
@Jakob S. 谢谢你的评论，因为我告诉过你我是新手所以我下载了 tarbal 并尝试安装它python setup.py install 得到这个权限错误error: could not create '/usr/lib/python2.4/site-packages/bs4': Permission denied，我该怎么做指定安装它的目录。安装其他命令时有没有类似-prefix的东西
我不得不承认，如果您没有 root 访问权限，我不确定如何实现这一点 - 而且我目前没有 Linux 可以尝试。原则上，应该可以简单地将包复制到相对于源 .py 文件的正确目录，以便解释器可以找到它。
查看文档：“如果一切都失败了，Beautiful Soup 的许可允许您将整个库与您的应用程序打包在一起。您可以下载 tarball，将其 bs4 目录复制到应用程序的代码库中，然后无需安装即可使用 Beautiful Soup。” (crummy.com/software/BeautifulSoup/bs4/doc/…)
您可以/应该将 bs4 安装在单独的 virtualenv 中。您将在其中拥有伪 root 权限。

标签： python linux perl bash

【解决方案1】：

使用pandas.read_html:

import pandas as pd
html_tables = pd.read_html('resources/test.html')
df = html_tables[0]
df.T # transpose to align
                   0
Tests            103
Failures          24
Success Rate  76.70%
Average Time   71 ms

【讨论】：

【解决方案2】：

以下是我在 python 2.7 上测试过的基于 python 正则表达式的解决方案。它不依赖于 xml 模块——因此可以在 xml 格式不完全的情况下工作。

import re
# input args: html string
# output: tables as a list, column max length
def extract_html_tables(html):
  tables=[]
  maxlen=0
  rex1=r'<table.*?/table>'
  rex2=r'<tr.*?/tr>'
  rex3=r'<(td|th).*?/(td|th)>'
  s = re.search(rex1,html,re.DOTALL)
  while s:
    t = s.group()  # the table
    s2 = re.search(rex2,t,re.DOTALL)
    table = []
    while s2:
      r = s2.group() # the row 
      s3 = re.search(rex3,r,re.DOTALL)
      row=[]
      while s3:
        d = s3.group() # the cell
        #row.append(strip_tags(d).strip() )
        row.append(d.strip() )

        r = re.sub(rex3,'',r,1,re.DOTALL)
        s3 = re.search(rex3,r,re.DOTALL)

      table.append( row )
      if maxlen<len(row):
        maxlen = len(row)

      t = re.sub(rex2,'',t,1,re.DOTALL)
      s2 = re.search(rex2,t,re.DOTALL)

    html = re.sub(rex1,'',html,1,re.DOTALL)
    tables.append(table)
    s = re.search(rex1,html,re.DOTALL)
  return tables, maxlen

html = """
  <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
    <tr valign="top">
      <th>Tests</th>
      <th>Failures</th>
      <th>Success Rate</th>
      <th>Average Time</th>
      <th>Min Time</th>
      <th>Max Time</th>
   </tr>
   <tr valign="top" class="Failure">
     <td>103</td>
     <td>24</td>
     <td>76.70%</td>
     <td>71 ms</td>
     <td>0 ms</td>
     <td>829 ms</td>
  </tr>
</table>"""
print extract_html_tables(html)

【讨论】：

【解决方案3】：

这是最佳答案，适用于 Python3 兼容性，并通过去除单元格中的空格进行了改进：

from bs4 import BeautifulSoup

html = """
  <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
    <tr valign="top">
      <th>Tests</th>
      <th>Failures</th>
      <th>Success Rate</th>
      <th>Average Time</th>
      <th>Min Time</th>
      <th>Max Time</th>
   </tr>
   <tr valign="top" class="Failure">
     <td>103</td>
     <td>24</td>
     <td>76.70%</td>
     <td>71 ms</td>
     <td>0 ms</td>
     <td>829 ms</td>
  </tr>
</table>"""

soup = BeautifulSoup(s, 'html.parser')
table = soup.find("table")

# The first tr contains the field names.
headings = [th.get_text().strip() for th in table.find("tr").find_all("th")]

print(headings)

datasets = []
for row in table.find_all("tr")[1:]:
    dataset = dict(zip(headings, (td.get_text() for td in row.find_all("td"))))
    datasets.append(dataset)

print(datasets)

【讨论】：

【解决方案4】：

使用BeautifulSoup4 的Python 解决方案（编辑： 适当跳过。编辑3： 使用class="details" 选择table）：

from bs4 import BeautifulSoup

html = """
  <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
    <tr valign="top">
      <th>Tests</th>
      <th>Failures</th>
      <th>Success Rate</th>
      <th>Average Time</th>
      <th>Min Time</th>
      <th>Max Time</th>
   </tr>
   <tr valign="top" class="Failure">
     <td>103</td>
     <td>24</td>
     <td>76.70%</td>
     <td>71 ms</td>
     <td>0 ms</td>
     <td>829 ms</td>
  </tr>
</table>"""

soup = BeautifulSoup(html)
table = soup.find("table", attrs={"class":"details"})

# The first tr contains the field names.
headings = [th.get_text() for th in table.find("tr").find_all("th")]

datasets = []
for row in table.find_all("tr")[1:]:
    dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
    datasets.append(dataset)

print datasets

结果如下：

[[(u'Tests', u'103'),
  (u'Failures', u'24'),
  (u'Success Rate', u'76.70%'),
  (u'Average Time', u'71 ms'),
  (u'Min Time', u'0 ms'),
  (u'Max Time', u'829 ms')]]

Edit2：要产生所需的输出，请使用以下内容：

for dataset in datasets:
    for field in dataset:
        print "{0:<16}: {1}".format(field[0], field[1])

结果：

Tests           : 103
Failures        : 24
Success Rate    : 76.70%
Average Time    : 71 ms
Min Time        : 0 ms
Max Time        : 829 ms

【讨论】：

谢谢你的回答，回答你上面的评论。我可以使用类作为标识符，我没有 ID 吗？类将是details
@GandalfStormCrow 是的，这可以做到。我已经编辑了我的答案。
确定这个答案在 Python 2.4 中确实有效吗？ @Gandalf，您在评论中说您安装了“旧版本的 bsoup”（我想是 BeautifulSoup 3）。并且说“我正在使用 Python 2.4.3”的行不见了。所以这有点令人困惑。
Python 2.4.3 在 2006 年 3 月 29 日是 released！我认为更新是可取的。
我有：打印（数据集）[, ] 而标题没问题。

【解决方案5】：

假设您的 html 代码存储在 mycode.html 文件中，这是一种 bash 方式：

paste -d: <(grep '<th>' mycode.html | sed -e 's,</*th>,,g') <(grep '<td>' mycode.html | sed -e 's,</*td>,,g')

注意：输出未完全对齐

【讨论】：

感谢您的回答，我需要特定的桌子，有不止一张桌子
我听说用正则表达式解析 HTML 或 XML 被定义破坏了。

【解决方案6】：

仅使用标准库的 Python 解决方案（利用 HTML 恰好是格式良好的 XML 的事实）。可以处理多行数据。

（使用 Python 2.6 和 2.7 测试。问题已更新，说 OP 使用 Python 2.4，因此在这种情况下，此答案可能不是很有用。ElementTree 是在 Python 2.5 中添加的）

from xml.etree.ElementTree import fromstring

HTML = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
  <tr valign="top">
    <th>Tests</th>
    <th>Failures</th>
    <th>Success Rate</th>
    <th>Average Time</th>
    <th>Min Time</th>
    <th>Max Time</th>
  </tr>
  <tr valign="top" class="Failure">
    <td>103</td>
    <td>24</td>
    <td>76.70%</td>
    <td>71 ms</td>
    <td>0 ms</td>
    <td>829 ms</td>
  </tr>
  <tr valign="top" class="whatever">
    <td>A</td>
    <td>B</td>
    <td>C</td>
    <td>D</td>
    <td>E</td>
    <td>F</td>
  </tr>
</table>"""

tree = fromstring(HTML)
rows = tree.findall("tr")
headrow = rows[0]
datarows = rows[1:]

for num, h in enumerate(headrow):
    data = ", ".join([row[num].text for row in datarows])
    print "{0:<16}: {1}".format(h.text, data)

输出：

Tests           : 103, A
Failures        : 24, B
Success Rate    : 76.70%, C
Average Time    : 71 ms, D
Min Time        : 0 ms, E
Max Time        : 829 ms, F

【讨论】：

感谢您的回答。我可以像这样指定，而不是从特定的 html 字符串中读取：从这个 html 文件中给我一张带有 class="details" 的表，然后执行您刚刚完成的操作吗？
现在它可以处理多个数据行。我已经用 Python 2.6 和 2.7 对此进行了测试，但现在我看到您使用的是 2.4.3（我没有）。所以它可能对你没有帮助。无论如何，我想表明无需额外的库就可以做这种事情。
我（和@Tichodroma）使用的字符串格式化语法在 2.4 中不起作用。
从这个 html 文件中给我一张包含 class="details" 的表格。是的，这可以使用 ElementTree 来完成（但不能使用 Python 2.4）。 ElementTree 是在 Python 2.5 中添加的。

【解决方案7】：

undef $/;
$text = <DATA>;

@tabs = $text =~ m!<table.*?>(.*?)</table>!gms;
for (@tabs) {
    @th = m!<th>(.*?)</th>!gms;
    @td = m!<td>(.*?)</td>!gms;
}
for $i (0..$#th) {
    printf "%-16s\t: %s\n", $th[$i], $td[$i];
}

__DATA__
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>

输出如下：

Tests               : 103
Failures            : 24
Success Rate        : 76.70%
Average Time        : 71 ms
Min Time            : 0 ms
Max Time            : 829 ms

【讨论】：

@cdtits 感谢您的回复，如果我的文件包含多个表格，这会起作用吗？
如果你要使用 perl，我推荐 HTML::TableExtract...IMO 它甚至胜过 python 丑陋的汤解决方案。