如何使用 Perl 解析 HTML 网站？ [关闭]答案

【问题标题】：How do I parse an HTML website using Perl? [closed]如何使用 Perl 解析 HTML 网站？ [关闭]
【发布时间】：2011-02-14 10:42:31
【问题描述】：

您能给我一些关于如何在 Perl 中解析 HTML 的建议吗？我计划解析关键字（包括 URL 链接）并将它们保存到 MySQL 数据库。我使用的是 Windows XP。

另外，我是否首先需要使用一些离线资源管理器工具将一些网站页面下载到本地硬盘？如果我这样做了，你能指点我一个好的下载工具吗？

【问题讨论】：

stackoverflow.com/questions/2220442/…的可能重复

标签： perl html-parsing

【解决方案1】：

您可以使用 LWP 检索您需要解析的页面。有很多方法可以解析 HTML。您可以使用正则表达式来查找链接和关键字（尽管这通常不是一个好的做法），或者像 HTML::TokeParser 或 HTML::TreeBuilder 这样的模块。

【讨论】：

我会尝试 LWP 和 perl HTML 模块。

【解决方案2】：

您可以使用许多 HTML 解析器模块之一。如果您熟悉 jQuery，pQuery 模块将是一个不错的选择，因为它将 jQuery 的大部分易于使用的功能移植到 Perl 以进行 HTML 解析和抓取。

【讨论】：

@MiffTheFox，+1，感谢 pQuery，我以前从未听说过，也许这对我来说是一个很好的起点。

【解决方案3】：

HTTrack 网站复制器/下载器的功能比任何可用的 Perl 库都多。

【讨论】：

【解决方案4】：

要遍历并在本地保存整个网站，您可以使用wget -r -np http://localhost/manual/（wget 在 Windows 上可用，独立或 Cygwin/MinGW 的一部分）。但是，如果你想同时遍历和数据，Mojolicious 可以用来构建一个简单的并行网络爬虫，非常少依赖：

#!/usr/bin/env perl
use feature qw(say);
use strict;
use utf8;
use warnings qw(all);

use Mojo::UserAgent;

# FIFO queue
my @urls = (Mojo::URL->new('http://localhost/manual/'));

# User agent following up to 5 redirects
my $ua = Mojo::UserAgent->new(max_redirects => 5);

# Track accessed URLs
my %uniq;

my $active = 0;
Mojo::IOLoop->recurring(
    0 => sub {

        # Keep up to 4 parallel crawlers sharing the same user agent
        for ($active .. 4 - 1) {

            # Dequeue or halt if there are no active crawlers anymore
            return ($active or Mojo::IOLoop->stop) unless my $url = shift @urls;

            # Fetch non-blocking just by adding a callback and marking as active
            ++$active;
            $ua->get(
                $url => sub {
                    my (undef, $tx) = @_;

                    say "\n$url";
                    say $tx->res->dom->at('html title')->text;

                    # Extract and enqueue URLs
                    for my $e ($tx->res->dom('a[href]')->each) {
                        # Validate href attribute
                        my $link = Mojo::URL->new($e->{href});
                        next if 'Mojo::URL' ne ref $link;

                        # "normalize" link
                        $link = $link->to_abs($tx->req->url)->fragment(undef);
                        next unless $link->protocol =~ /^https?$/x;

                        # Access every link once
                        next if ++$uniq{$link->to_string} > 1;

                        # Don't visit other hosts
                        next if $link->host ne $url->host;

                        push @urls, $link;
                        say " -> $link";
                    }

                    # Deactivate
                    --$active;
                }
            );
        }
    }
);

# Start event loop if necessary
Mojo::IOLoop->start unless Mojo::IOLoop->is_running;

【讨论】：