如何在 PHP 中解析 HTML？答案

【问题标题】：How to parse HTML in PHP?如何在 PHP 中解析 HTML？
【发布时间】：2013-08-23 08:02:51
【问题描述】：

我知道我们可以使用PHP DOM 来使用 PHP 解析 HTML。我在 Stack Overflow 上也发现了很多问题。但我有一个具体的要求。我有一个如下所示的 HTML 内容

<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>

我想解析上面的 HTML 并将内容保存到两个不同的数组中，例如：

$heading 和 $content

$heading = array('Chapter 1','Chapter 2','Chapter 3');
$content = array('This is chapter 1','This is chapter 2','This is chapter 3');

我可以简单地使用 jQuery 来实现这一点。但我不确定，如果这是正确的方法。如果有人能指出我正确的方向，那就太好了。提前致谢。

【问题讨论】：

使用jquery，结构简单。
@Susheel：HTML 内容会更大，因为它是解析docx 文件后的输出
如果你不喜欢 PHP DOM，你可以使用正则表达式。
@LorenzMeyer do not use regular expressions to parse html
@blessed 用于更大的 dom 使用 php 简单的 dom 解析器

标签： php html parsing dom

【解决方案1】：

这是使用 DiDOM 解析 html 的另一种方法，它在速度和内存占用方面显着提高了 better performance。

composer require imangazaliev/didom

<?php

use DiDom\Document;

require_once('vendor/autoload.php');

$html = <<<HTML
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>
HTML;

$document = new Document($html);

// find chapter headings
$elements = $document->find('.Heading1-H');

$headings = [];

foreach ($elements as $element) {
    $headings[] = $element->text();
}

// find chapter texts
$elements = $document->find('.Normal-H');

$chapters = [];

foreach ($elements as $element) {
    $chapters[] = $element->text();
}

echo("Headings\n");

foreach ($headings as $heading) {
    echo("- {$heading}\n");
}

echo("Chapter texts\n");

foreach ($chapters as $chapter) {
    echo("- {$chapter}\n");
}

【讨论】：

【解决方案2】：

试试看PHP Simple HTML DOM Parser

它具有类似于 jQuery 的出色语法，因此您可以通过 ID 或类轻松选择所需的任何元素

// include/require the simple html dom parser file

$html_string = '
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 1</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 1</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 2</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 2</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 3</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 3</span>
    </p>';
$html = str_get_html($html_string);
foreach($html->find('span') as $element) {
    if ($element->class === 'Heading1-H') {
        $heading[] = $element->innertext;
    }else if($element->class === 'Normal-H') {
        $content[] = $element->innertext;
    }
}

【讨论】：

！！注意！！不使用“->innertext”会导致内存泄漏。
与使用 DomDocument 相比，这是一个更简单的选项，并且生成的代码更具可读性。
有没有使用 composer 安装的选项？
作曲家安装is now possible：composer require simplehtmldom/simlehtmldom dev-master和use simplehtmldom\HtmlWeb;

【解决方案3】：

您的一个选择是使用 DOMDocument 和 DOMXPath。它们确实需要一些曲线来学习，但是一旦你这样做了，你就会对你能取得的成就感到非常满意。

在 php.net 中阅读以下内容

http://php.net/manual/en/class.domdocument.php

http://php.net/manual/en/class.domxpath.php

希望这会有所帮助。

【讨论】：

这有损坏 html 的问题
不要。采用。 phps。大教堂。这个答案很旧。 PHPs Dom 与 2020+ HTML 格格不入

【解决方案4】：

// 从 URL 或文件创建 DOM

$html = file_get_html('http://www.google.com/');

// 查找所有图片

foreach($html->find('img') as $element) 
   echo $element->src . '<br>';

// 查找所有链接

foreach($html->find('a') as $element) 
   echo $element->href . '<br>';

【讨论】：

file_get_html ??那是 PHP 函数吗？
file_get_content 是正确的。他从 php simple dom 网站复制过去

【解决方案5】：

我已经使用 domdocument 和 domxpath 来获得解决方案，您可以在以下位置找到它：

<?php
$dom = new DomDocument();
$test='<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>';

$dom->loadHTML($test);
$xpath = new DOMXpath($dom);
    $heading=parseToArray($xpath,'Heading1-H');
    $content=parseToArray($xpath,'Normal-H');

var_dump($heading);
echo "<br/>";
var_dump($content);
echo "<br/>";

function parseToArray($xpath,$class)
{
    $xpathquery="//span[@class='".$class."']";
    $elements = $xpath->query($xpathquery);

    if (!is_null($elements)) {  
        $resultarray=array();
        foreach ($elements as $element) {
            $nodes = $element->childNodes;
            foreach ($nodes as $node) {
              $resultarray[] = $node->nodeValue;
            }
        }
        return $resultarray;
    }
}

直播结果： http://saji89.codepad.org/2TyOAibZ

【讨论】：

我发现这个链接对于学习 XPATH.query 语法非常有用：w3schools.com/xml/xpath_syntax.asp