Perl HTML::Tokeparser 获取标签之间的原始 html答案

【问题标题】：Perl HTML::Tokeparser get raw html between tagsPerl HTML::Tokeparser 获取标签之间的原始 html
【发布时间】：2014-01-14 03:24:47
【问题描述】：

我正在使用 TokeParser 提取标签内容。

...
$text = $p->get_text("/td") ;
...

通常它会返回清理后的文本。我想要的是返回 td 和 /td 之间的所有内容，但包括所有其他 html 元素。怎么做。

我正在使用this tutorial 中的示例。谢谢

在示例中，

my( $tag, $attr, $attrseq, $rawtxt) = @{ $token };

我相信 $rawtxt 有一些技巧。

【问题讨论】：

我不相信你可以用 HTML::Tokeparser 做到这一点，为什么不使用正则表达式来捕获你的数据？
尝试使用get_tag("td")，然后“转储”结果，我想它会有你寻找的数据，但不确定
这确实是一项具有挑战性的任务。最好尝试使用 DOM 解析器。

标签： perl html-parsing

【解决方案1】：

HTML::TokeParser 没有执行此操作的内置功能。但是，可以单独查看<td>s 之间的每个令牌。

#!/usr/bin/perl
use strictures;
use HTML::TokeParser;
use 5.012;

# dispatch table with subs to handle the different types of tokens
my %dispatch = (
  S  => sub { $_[0]->[4] }, # Start tag
  E  => sub { $_[0]->[2] }, # End tag
  T  => sub { $_[0]->[1] }, # Text
  C  => sub { $_[0]->[1] }, # Comment
  D  => sub { $_[0]->[1] }, # Declaration
  PI => sub { $_[0]->[2] }, # Process Instruction
);

# create the parser
my $p = HTML::TokeParser->new( \*DATA ) or die "Can't open: $!";

# fetch all the <td>s
TD: while ( $p->get_tag('td') ) {
  # go through all tokens ...
  while ( my $token = $p->get_token ) {
    # ... but stop at the end of the current <td>
    next TD if ( $token->[0] eq 'E' && $token->[1] eq 'td' );
    # call the sub corresponding to the current type of token
    print $dispatch{$token->[0]}->($token);
  }
} continue {
  # each time next TD is called, print a newline
  print "\n";  
}

__DATA__
<html><body><table>
<tr>
<td><strong>foo</strong></td>
<td><em>bar</em></td>
<td><font size="10"><font color="#FF0000">frobnication</font></font>
<p>Lorem ipsum dolor set amet fooofooo foo.</p></td>
</tr></table></body></html>

该程序将解析__DATA__ 部分中的HTML 文档并打印包括<td> 和</td> 之间的HTML 在内的所有内容。每个<td> 将打印一行。让我们一步一步来。

阅读documentation 后，我了解到HTML::TokeParser 中的每个标记都有一个与之关联的类型。有六种类型：S、E、T、C、D 和 PI。医生说：
此方法将返回在 HTML 文档中找到的下一个标记，或者文件末尾的undef。令牌作为数组返回参考。数组的第一个元素将是一个字符串，表示此标记的类型：“S”表示开始标签，“E”表示结束标签，“T”表示文本，“C”表示注释，“D”表示声明，“PI”表示过程指示。令牌数组的其余部分取决于类型这个：
```
["S",  $tag, $attr, $attrseq, $text]
["E",  $tag, $text]
["T",  $text, $is_data]
["C",  $text]
["D",  $text]
["PI", $token0, $text]
```
我们想要访问存储在这些令牌中的$text，因为没有其他方法可以获取看起来像 HTML 标记的内容。因此，我创建了一个dispatch table 来在%dispatch 中处理它们。它存储了一堆稍后调用的代码引用。
我们从__DATA__读取文档，方便这个例子。
首先，我们需要使用get_tag方法获取<td>s。 @nrathaus 的评论向我指出了这一点。它会将解析器移动到打开<td> 之后的下一个标记。我们不关心 get_tag 返回什么，因为我们只想要 <td> 之后的令牌。
我们使用方法get_token 来获取下一个令牌并用它做一些事情：
- 但我们只想这样做，直到找到相应的关闭</td>。如果我们看到这一点，我们将next 标记为TD 的外部while 循环。
- 此时，continue block 被调用并打印一个换行符。
- 如果我们没有走到最后，奇迹就会发生：调度表；正如我们之前看到的，令牌数组 ref 中的第一个元素包含类型。在%dispatch 中，每种类型都有一个代码参考。我们调用它并通过 $coderef->(@args) 传递完整的数组 ref $token。我们在当前行打印结果。
  
  这将在每次运行中生成<strong>、foo、</strong> 等内容。

请注意，这只适用于一张桌子。如果表中有一个表（例如<td> ... <td></td> ... </td>），这将中断。您必须对其进行调整以记住它的深度。

另一种方法是使用miyagawa 出色的Web::Scraper。这样，我们的代码就少了很多：

#!/usr/bin/perl
use strictures;
use Web::Scraper;
use 5.012;

my $s = scraper {
  process "td", "foo[]" => 'HTML'; # grab the raw HTML for all <td>s
  result 'foo'; # return the array foo where the raw HTML is stored
};

my $html = do { local $/ = undef; <DATA> }; # read HTML from __DATA__
my $res = $s->scrape( $html ); # scrape

say for @$res; # print each line of HTML

这种方法还可以像魅力一样处理多维表格。

【讨论】：

不客气。这是一项有趣的任务。 :) 你想用这个实现什么？
好吧，我有点喜欢 tokeparser。在 CPAN 中有这么多这样的模块，我真的很头疼试图选择一个。所以我只是尝试使用一种解析工具，至少现在是这样。我正在尝试将嵌套在 td 中的标签更改为分隔符。