在 PHP 中解析 HTML 并提取值 [重复]答案

【问题标题】：Parse HTML in PHP and extract value [duplicate]在 PHP 中解析 HTML 并提取值 [重复]
【发布时间】：2016-12-12 02:07:53
【问题描述】：

我正在尝试从网站中提取一些信息。

有一个部分是这样的：

<th>Some text here</th><td>text to extract</td>

我想找到（使用正则表达式或其他解决方案）以some text here 开头的部分并从中提取text to extract。

我尝试使用以下正则表达式解决方案：

$reg = '/<th>Some text here<\/th><td>(.*)<\/td>/'; 
preg_match_all($reg, $content, $result, PREG_PATTERN_ORDER);

print_r($result);

但它只给了我一个空数组：

Array ( [0] => Array ( ) [1] => Array ( ) )

我应该如何构造我的正则表达式来提取想要的值？或者我可以使用什么其他解决方案来提取它？

【问题讨论】：

这工作正常...无法重现您的问题...
可以确认@Bob0t 它工作正常。至少正则表达式是正确的
@mmm：这个解释与现代正则表达式引擎无关（特别是PHP使用的引擎），它是关于计算机科学意义上的“正则表达式”。简而言之，当前问题不是这个问题的重复，因为它谈到了不同的 （如果您尝试将其应用于 PHP、Perl、Ruby、.net 中使用的正则表达式引擎，则解释会出错。 .)
@CasimiretHippolyte 仍然不应该使用正则表达式来解析 html。 php 有它自己的 DOM 解析器。#
好吧，正如我在问题中所说，我不坚持使用正则表达式解决方案。我只需要提取值，无论我使用正则表达式、dom 爬虫等。

标签： php regex

【解决方案1】：

使用 XPath：

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();

$xp = new DOMXPath($dom);

$content = $xp->evaluate('string(//th[.="Some text here"]/following-sibling::*[1][name()="td"])');

echo $content;

XPath 查询详情：

string(  # return a string instead of a node list
    //   # anywhere in the DOM tree
    th   # a th node
    [.="Some text here"] # predicate: its content is "Some text here"
    /following-sibling::*[1] # first following sibling
    [name()="td"] # predicate: must be a td node
)

您的模式不起作用的原因可能是因为 td 内容包含换行符（与点不匹配。）。

【讨论】：

很好的解决方案，谢谢！

【解决方案2】：

您可以为此使用 DOMDocument。

$domd=@DOMDocument::loadHTML($content);
$extractedText=NULL;
foreach($domd->getElementsByTagName("th") as $ele){
    if($ele->textContent!=='Some text here'){continue;}
    $extractedText=$ele->nextSibling->textContent;
    break;
}
if($extractedText===NULL){
//extraction failed
} else {
//extracted text is in $extractedText
}

（正则表达式通常是解析 HTML 的坏工具，正如 cmets 中的某人已经指出的那样）

【讨论】：