HTML 内容正则表达式 - perl答案

【问题标题】：HTML content regular expression - perlHTML 内容正则表达式 - perl
【发布时间】：2014-08-17 03:07:42
【问题描述】：

我的 html 内容如下所示：

html code ... </div>content1</div> html code ... 
html code ... </div>content2</div> html code ...

我想从 HTML 中提取 content1/2/3... 作为 content1 new line content2 new line content3 有什么想法吗？提前致谢。

【问题讨论】：

如果您完全研究过您的问题，您会发现数十个甚至数百个帖子告诉您不要使用正则表达式来解析 HTML。有几个非常好的 Perl 模块可以为您解决问题，而正则表达式解决方案很可能迟早会崩溃
感谢提醒

标签： regex perl

【解决方案1】：

这是一个使用Mojo::DOM 的示例，灵感来自this StackOverflow answer：

#!/usr/bin/env perl

use strict ;
use warnings ;

use Mojo::DOM ;

my $html = <<EOHTML;
<!DOCTYPE html>
<html>
<head>
<title>Sample HTML with 2 divs</title>
</head>
<body>
     <div>
        Four score and seven years ago our fathers brought forth on this
        continent a new nation, conceived in liberty, and dedicated to the
        proposition that all men are created equal.
     </div>
     <div>
        Lorem ipsum dolor sit amet, consectetur adipisicing elit,
        sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
     </div>
</body>
</html>
EOHTML

my $dom = Mojo::DOM->new ;

$dom->parse( $html ) ;

for my $div ( $dom->find( 'div' )->each ) {

    print $div->all_text . "\n" ;

}

输出是：

Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

【讨论】：