使用 perl 从 html 页面解析域答案

【问题标题】：parse domains from html page using perl使用 perl 从 html 页面解析域
【发布时间】：2013-08-20 03:47:38
【问题描述】：

我有一个 html 页面，其中包含以下网址：

<h3><a href="http://site.com/path/index.php" h="blablabla">
<h3><a href="https://www.site.org/index.php?option=com_content" h="vlavlavla">

我要提取：

site.com/path
www.site.org

在<h3><a href=" 和/index.php 之间。

我试过这段代码：

#!/usr/local/bin/perl
use strict;
use warnings;

open (MYFILE, 'MyFileName.txt');
while (<MYFILE>) 
{
  my $values1 = split('http://', $_); #VALUE WILL BE: www.site.org/path/index2.php
  my @values2 = split('index.php', $values1); #VALUE WILL BE: www.site.org/path/ ?option=com_content

    print $values2[0]; # here it must print www.site.org/path/ but it don't
    print "\n";
}
close (MYFILE);

但这给出了一个输出：

它不解析 https 网站。希望你能理解，问候。

【问题讨论】：

您在my $values1 = ... 行中拆分$_ 但此变量没有定义值，除非您在命令行上传递了某些内容。你应该分裂一些你可以积极识别的东西，以了解结果意味着什么。
$_ 由 while (<MYFILE>) 行设置，这是一个常见的 Perl 习惯用法

标签： html perl parsing url dns

【解决方案1】：

您的代码的主要问题是当您在标量上下文中调用 split 时，如您的行中：

my $values1 = split('http://', $_);

它返回由split 创建的列表的大小。见split。

但我认为split 无论如何都不适合这项任务。如果您知道您要查找的值将始终位于 'http[s]://' 和 '/index.php' 之间，那么您只需要在循环中进行正则表达式替换（您还应该更加小心地打开文件。 ..)：

open(my $myfile_fh, '<', 'MyFileName.txt') or die "Couldn't open $!";
while(<$myfile_fh>) {
    s{.*http[s]?://(.*)/index\.php.*}{$1} && print;
}

close($myfile_fh);

您可能需要一个比这更通用的正则表达式，但我认为这将根据您对问题的描述起作用。

【讨论】：

嗨 dms，但我无法将输出保存到文件中。我试过： open(my $sort, 'tt8-4.txt') 或死“无法打开 $！”； while() { (s{.*http[s]?://(.*)/index\.php.*}{$1});打开（保存，“1.txt”）或死“$！” ;打印保存“$1\n”；关闭（保存）； } 关闭（$排序）；但它不起作用
当您打开save 时，您打开它是为了阅读。要打开它进行写作，您需要使用“>”或附加（这是您想要做的），使用“>>”。像这样：open ($save, '>>', '1.txt')。您还应该将open 移到循环之外。

【解决方案2】：

这对我来说就像是模块的工作

通常使用正则表达式解析 HTML 是有风险的。

【讨论】：

【解决方案3】：

dms 在his answer 中解释了为什么在这里使用split 不是最好的解决方案：

它返回标量上下文中的项目数
普通的正则表达式更适合这项任务。

但是，我不认为基于行的输入处理对 HTML 有效，或者使用替换是有意义的（它没有意义，尤其是当模式看起来像 .*Pattern.* 时）。

给定一个 URL，我们可以提取所需的信息，例如

if ($url =~ m{^https?://(.+?)/index\.php}s) {  # domain+path now in $1
  say $1;
}

但是我们如何提取 URL？我会推荐美妙的 Mojolicious 套房。

use strict; use warnings;
use feature 'say';
use File::Slurp 'slurp';  # makes it easy to read files.
use Mojo;

my $html_file = shift @ARGV;  # take file name from command line

my $dom = Mojo::DOM->new(scalar slurp $html_file);

for my $link ($dom->find('a[href]')->each) {
  say $1 if $link->attr('href') =~ m{^https?://(.+?)/index\.php}s;
}

find 方法可以采用 CSS 选择器（此处：所有具有 href 属性的 a 元素）。 each 将结果集展平为一个可以循环的列表。

当我打印到 STDOUT 时，我们可以使用 shell 重定向将输出放入想要的文件中，例如

$ perl the-script.pl html-with-links.html >only-links.txt

将整个脚本作为一条线：

$ perl -Mojo -E'$_->attr("href") =~ m{^https?://(.+?)/index\.php}s and say $1 for x(b("test.html")->slurp)->find("a[href]")->each'

【讨论】：