Perl WWW::Mechanize Parse Content 问题？答案

【问题标题】：Perl WWW::Mechanize Parse Content issue?Perl WWW::Mechanize Parse Content 问题？
【发布时间】：2012-08-15 06:21:39
【问题描述】：

我正在使用 Perl 的 WWW::Mechanize 库从网站上抓取内容。但是，我注意到网页的原始 HTML 源代码和 WWW::Mechanize 解析的内容不同。因此，我的脚本中的一些功能被破坏了。

所以，这里是脚本（一个子集，只是为了演示错误/问题）

#! /usr/bin/perl

use WWW::Mechanize;
use warnings;

$mech=WWW::Mechanize->new();
$mech->stack_depth(0);

$url="http://www.example.com";

$mech->get($url);

print $mech->content;

简短的代码，它将连接到网站并检索整个 HTML 页面。

我运行脚本并将输出重定向到一个文本文件，以便我可以分析它们。

perl test.pl >> source_code.txt

现在，当我比较 source_code.txt 和浏览器 (Firefox) 显示的网站的实际源代码时，存在差异。

例如：

<tr>
<td nowrap="nowrap">This is Some Text</td>
<td align="right"><a href="http://example.com?value=key">Some more Text</a></td>
</tr><tr>

以上源代码是在浏览器中观察到的。（查看页面源功能）

但是，在文本文件中，source_code.txt（由 WWW::Mechanize 生成）

它显示：

<tr>
<td nowrap="nowrap">This is some text</td>
<td align="right">This is some more text</td>
</tr><tr>

如您所见，嵌套在第二组标签之间的锚标签被删除了。

这是一个已知问题还是我需要使用 $mech->content 以外的其他东西来查看源代码？

谢谢。

【问题讨论】：

我会首先检查用户代理是否对服务器返回的内容产生了影响。如果将agent => 'Windows IE 6' 添加到new() 调用中会发生什么？

标签： perl www-mechanize

【解决方案1】：

这是一种称为"user agent sniffing" 的常见行为，例如对于盲人用户，页面的显示方式会有所不同。您可以使用不同的插件在浏览器中更改您的用户代理字符串，也可以像 @LHMathies 所说，在 WWW::Mechanize 中看到 UserAgent.pm 和 Mechanize->new

例子：

my $mech = WWW::Mechanize->new( agent => 
     'Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)' 
);

另见a list of common user agent strings。

【讨论】：

还有->agent_alias()方法，虽然这些别名很久没有更新了。 search.cpan.org/dist/WWW-Mechanize/lib/WWW/…
谢谢。我将尝试使用用户代理。