如何在某个 HTML Dom 之后获取字符串答案

【问题标题】：How to get the string after a certain a HTML Dom如何在某个 HTML Dom 之后获取字符串
【发布时间】：2018-08-10 10:30:59
【问题描述】：

这里是html：

<td width="551">
<p><strong>Full Time Faculty<br>
<strong></strong>Assistant Professor</strong></p>Doctorate of Business Administration<br><br>
<strong>Phone</strong>: +88 01756567676<br>
<strong>Email</strong>: frank.wade@email.com<br> 
<strong>Office</strong>: NAC739<br>
<br><p><b>Curriculum Vitae</b></p></td>

我想要的输出是：

+88 01756567676

frank.wade@email.com

NAC739

我使用 simple_html_dom 来解析数据。

这是我编写的代码。如果联系信息部分用段落标签包裹，它就可以工作。 (

)

$contact = $facultyData->find('strong[plaintext^=Phone]');
$contact = $contact[0]->parent();
$element = explode("\n", strip_tags($contact->plaintext));

$regex = '/Phone:(.*)/';
if (preg_match($regex, $element[0], $match)) 
    $phone = $match[1];

$regex = '/Email:(.*)/';
if (preg_match($regex, $element[1], $match)) 
    $email = $match[1];

$regex = '/Office:(.*)/';
if (preg_match($regex, $element[2], $match)) 
    $office = $match[1];

有什么方法可以通过标签匹配得到这3行吗？

【问题讨论】：

您可能想改用DOMDocument。

标签： php dom web-crawler simple-html-dom

【解决方案1】：

也许你可以使用像这样的 xpath 函数

$xml = new SimpleXMLElement($DomAsString);
$theText = $xml->xpath('//strong[. ="Phone"]/following-sibling::text()');

删除 ':' 的一些片段，当然还有修复 dom 结构

【讨论】：

【解决方案2】：

或者直接使用正则表达式：

preg_match('|Phone</strong>: [^<]+|', $str, $m) or die('no phone');
$phone = $m[1];

【讨论】：

【解决方案3】：

您真的不需要将其解析为 HTML 或处理 DOM 树。您可以将您的 HTML 字符串分解为多个片段，然后删除每个片段中多余的部分以获得您想要的：

<?php 

$str = <<<str
<td width="551">
<p><strong>Full Time Faculty<br>
<strong></strong>Assistant Professor</strong></p>Doctorate of Business Administration<br><br>
<strong>Phone</strong>: +88 01756567676<br>
<strong>Email</strong>: frank.wade@email.com<br>
<strong>Office</strong>: NAC739<br>
<br><p><b>Curriculum Vitae</b></p></td>
str;

// We explode $str and use '</strong>' as delimiter and get only the part of result that we need
$lines = array_slice(explode('</strong>', $str), 3, 3);
// Define a function to remove extra text from left and right of our so called lines
function stripLine($line) {
    // ltrim ' ;' characters and remove everything after (and including) '<br>'
    return preg_replace('/<br>.*/is', '', ltrim($line, ' :'));
}
$lines = array_map('stripLine', $lines);

print_r($lines);

查看代码输出here。

【讨论】：