在 Perl 中解析 HTML 表格答案

【问题标题】：Parsing a HTML table in Perl在 Perl 中解析 HTML 表格
【发布时间】：2015-07-02 08:26:36
【问题描述】：

我正在尝试解析以下 HTML 表格：

<table cellspacing="0" border="1" width="100%">
 <tr bgcolor="#d0d0d0">
  <th style="COLOR: #ff6600">number</th>
  <th style="COLOR: #ff6600">id</th>
  <th style="COLOR: #ff6600">result</th>
  <th style="COLOR: #ff6600">reason</th>
 </tr>
 <tr>
  <td>1027</td>
  <td><a href="<url>">21cs_337</a></td>
  <td>0</td>
  <td>catch-all caught </td>
  <td>reason</td>  
 </tr>
 <tr>
  <td>10288</td>
  <td><a href="<url>">21cs_437</a></td>
  <td>0</td>
  <td>badfetch</td>
  <td>reason</td>
 </tr>
</table>

我正在尝试从我的 perl 脚本中读取此 html 文件中的值。我为此使用HTML::TagParser，并且能够获取每一行的值：

$table_old = ($html_old->getElementsByTagName("tr"))[1]->innerText();

但我想获取每一列（每一行）的值。我试过这个：

$table_new = ($html_new->getElementsByTagName("tr"))[1];  
my $temp  = ($table_new->getElementsByTagName("td"))[2]->innerText();

这不起作用，关于如何有效解析列元素的任何建议。

【问题讨论】：

这个模块可能更适合：search.cpan.org/~djerius/HTML-TableParser-0.40/lib/HTML/…
谢谢，但我已经在脚本中的大多数其他解析中使用了标签解析器，所以我想继续使用它。也在研究表解析器，但任何关于标签解析器的建议可能会更好。
HTML::TableExtract 非常非常有用。

标签： html perl html-table html-parsing

【解决方案1】：

你需要使用 subTree。

#!/usr/bin/env perl
use warnings;
use strict;
use HTML::TagParser;

my $html = HTML::TagParser->new( 'foo.html' ); # Change this to your file

my $nrow = 0;
for my $tr ( $html->getElementsByTagName("tr" ) ) {
    my $ncol = 0;
    for my $td ( $tr->subTree->getElementsByTagName("td") ) {
        print "Row [$nrow], Col [" . $ncol++ . "], Value [" . $td->innerText() . "]\n";
    }
    $nrow++;
}

产生以下输出（注意第 th 行被省略）：

Row [1], Col [0], Value [1027]
Row [1], Col [1], Value [21cs_337]
Row [1], Col [2], Value [0]
Row [1], Col [3], Value [catch-all caught]
Row [1], Col [4], Value [reason]
Row [2], Col [0], Value [10288]
Row [2], Col [1], Value [21cs_437]
Row [2], Col [2], Value [0]
Row [2], Col [3], Value [badfetch]
Row [2], Col [4], Value [reason]

【讨论】：

谢谢，这行得通，但仍然对我没有帮助，因为我使用的是支持 HTML-TagParser-0.16 (search.cpan.org/~kawasaki/HTML-TagParser-0.16/lib/HTML/…) 的 perl v5.6.1，并且此版本不支持 HTML:: TagParser::Element（子树需要）
最好的建议是停止使用 14 岁且不受支持的 Perl 版本。是什么阻止您使用更新的版本？