【问题标题】:find links containing bold text using WWW::Mechanize使用 WWW::Mechanize 查找包含粗体文本的链接
【发布时间】:2014-10-14 05:34:20
【问题描述】:

假设 HTML 页面的内容是

<a href="abc.com"><b>ABC</b>industry</a>
<a href="google.com">ABC Search</a>
<a href="abc.com">Movies with<b>ABC</b></a>

我只想提取包含粗体文本的链接。我该如何使用 WWW::Mechanize?

输出

ABC industry
Movies with ABC

我用过

@arr=$m->links();
foreach(@arr){print $_->text;}

但这会找到页面中的所有 URL。

【问题讨论】:

  • 在您使用 -&gt;links() 之后似乎没有任何方法可以检索原始内容,因此根据您的实现,您可以使用另一个模块来解析 HTML,例如 HTML::Parser?跨度>

标签: perl mechanize


【解决方案1】:

如果不使用可以解析页面内容的额外模块,使用WWW::Mechanize 将很难实现您的目标。但是,还有其他模块可以让您轻松实现这一目标。

这里是一个使用Mojo::DOM 的示例,它可以让您像在 CSS 中一样选择元素。 Mojolicious 发行版还包含Mojo::UserAgent,因此如果您不太依赖WWW::Mechanize,您可以相当轻松地将代码迁移到Mojo。

# $html is the content of the page
my $dom = Mojo::DOM->new($html);

# extract all <b> elements that are under <a> elements (at any depth beneath the <a>)
# and get the <a> ancestors of those elements
# creates a Mojo::Collection object
my $collection = $dom->find('a b')->map(sub{ return $_->ancestors('a') } )->flatten;

$collection->each( sub {
    say "LINK: " . $_->all_text;
} );

# Use a sub to perform an action on each of the retrieved <a> elements:
$dom->find('a b')->each( sub {
    $_->ancestors('a')->each( sub {
        say "All in one: " . $_->all_text
    } )
} );

这是一个带有示例链接列表的演示:

<html>
<ul><li><a href="abc.com"><b>ABC</b> industry</a></li>
<li><a href="google.com">ABC Search</a></li>
<li>Here is <a href="#">a link 
    <span>with a span 
        <b>and a "b" tag</b> 
          even though
    </span> "b" tags are deprecated.</a> Yay!</li>
<li><a href="abc.com">Movies with <b>ABC</b></a></li></ul></html>

输出:

LINK: ABC industry
LINK: a link with a span and a "b" tag even though "b" tags are deprecated.
LINK: Movies with ABC
All in one: ABC industry
All in one: a link with a span and a "b" tag even though "b" tags are deprecated.
All in one: Movies with ABC

如果您使用Mojo::UserAgent 而不是WWW::Mechanize,您的搜索会更容易。 Mojo::UserAgent可以get一个页面(就像WWW::Mechanize一样),返回页面的DOM可以使用$ua-&gt;get($url)-&gt;res-&gt;dom访问。然后,您可以在上面链接您的查询,以提供以下信息:

my $ua = Mojo::UserAgent->new();
# get the page and find the links with a <b> element in them:
$ua->get('http://my-url-here.com')
   ->res->dom('a b')->each( sub { $_->ancestors('a')->each( sub { say $_->all_text } ) } );

# example using this page:
# print the contents of divs with class 'spacer' that contain a link with a div in it:
$ua->get('http://stackoverflow.com/questions/26353298/find-links-containing-bold-text-using-wwwmechanize')
->res->dom('a div')->each( sub { 
    $_->ancestors('div.spacer')->each( sub {
        say $_->all_text
    } )
} );

输出:

1 How to use WWW::Mechanize to submit a form which isn't there in HTML?
0 How to process a simple loop in Perl's WWW::Mechanize?
0 Perl WWW::Mechanize cookie problem
1 Getting error in accessing a link using WWW::Mechanize
0 How to use output from WWW::Mechanize?
-2 Use WWW::Mechanize to login in webpage without form login but javascript using perl
3 Perl WWW::Mechanize Web Spider. How to find all links
0 Howto use WWW::Mechanize to access pages split by drop-down list
0 What is the best way to extract unique URLs and related link text via perl mechanize?
0 Perl WWW::Mechanize doesn't print results when reading input data from a data file

Mojolicious 文档中有很多示例,以防无法立即理解!

如需观看 Mojo::DOMMojo::UserAgent 的 8 分钟介绍视频,请查看 Mojocast Episode 5

【讨论】:

  • 请注意$dom-&gt;find('a b')-&gt;ancestors('a')-&gt;each( sub { say "All in one: " . $_-&gt;all_text } ) 不正确,因为find('a b') 将返回一个Mojo::Collection,它没有ancestors 方法
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2012-05-27
  • 1970-01-01
  • 1970-01-01
  • 2016-03-18
  • 1970-01-01
  • 2012-10-20
  • 1970-01-01
相关资源
最近更新 更多