php xpath 从 Joomla / hikashop 描述中提取标签内容答案

【问题标题】：php xpath to extract tab content from Joomla / hikashop descriptionsphp xpath 从 Joomla / hikashop 描述中提取标签内容
【发布时间】：2021-10-18 09:31:15
【问题描述】：

[TL;DR] 需要使用 PHP 解析 html 以提取选项卡和内容

我正在从通过 CSV 文件导出的 Joomla / Hikashop 站点迁移数据。选项卡由 P 标记中的内容定义，如下所示

<p> </p>
<p style="text-align: center;"><span style="text-decoration: underline;"><span style="font-size: 14pt;"><strong>Strong Item</strong></span></span></p>
<p> {tab=Description}</p>
<p>This is a default description</p>
<ul>
<li>It has</li>
<li>mixed content</li>
</ul>
<p>{tab=Features} </p>
<ul style="list-style-type: circle;">
<li>It's good</li>
<li>I like it</li>
</ul>
<p>It does what I want</p>
<p> </p>
<p>{/tabs}</p>

我需要提取标签名称后跟内容。

我可以很容易地拉出标签

$crawler->filterXpath('//p[text()[contains(.,"tab=")]]')->each(function ($node) {

但它让我在选项卡之间获取内容。

描述 =

<ul>
<li>It has</li>
<li>mixed content</li>
</ul>

特点=

<ul style="list-style-type: circle;">
<li>It's good</li>
<li>I like it</li>
</ul>
<p>It does what I want</p>
<p> </p>

显然我可以对其进行正则表达式并循环遍历行等。但这很容易出错

谢谢

【问题讨论】：

这是一个示例产品描述，我想用字段名称和选项卡的内容填充 MySQL 数据库
我真的不知道有什么好混淆的，返回的内容是选项卡之间的html，选项卡名称是tab=XXX
“标签”标记是否始终位于文档的最高级别？或者它们可能嵌套在较低的级别？
我认为您需要添加一些额外的逻辑，但这看起来是前进的方向：stackoverflow.com/q/23860883/2943403 和 stackoverflow.com/q/10859703/2943403
谢谢，其中一个链接很有帮助，几乎可以满足我的需要。最后一个元素是一个问题，但如果我在传递之前操纵 html，它应该可以正常工作。稍后将编写一些代码，看看它如何处理真实世界的数据

标签： php symfony xpath joomla text-extraction

【解决方案1】：

感谢 mickmackusa 提供的链接，这些链接有助于将拼图拼凑在一起。

使用链接，我能够在每个选项卡打开之间获取内容

<p>{tabs=newtab}</p>

我的过程是用 tidy 清理 HTML，然后将其加载到新的 DOMDocument 中。

use Symfony\Component\DomCrawler\Crawler;
$html = file_get_contents('E:\Dropbox\laragon\www\scrape\description.txt');
$config = array(
    'indent'         => true,
    'output-xhtml'   => true,
    'show-body-only' => true,
    'drop-empty-paras' => true,
    'wrap'           => 1200
);

// Tidy
$tidy = new tidy;
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();

$doc = new DOMDocument;
$doc->loadHTML($tidy->value);



$crawler = new Crawler($doc);

标签的结束标签是

<p>{/tabs}</p>

这与我的代码不匹配，这意味着它需要一些额外的处理。由于这是一个一次性项目，我做了一个快速修复。

所以我抓取了页面并在关闭标签部分之前添加了一个新的段落元素。它在段落中查找 /tabs，然后实际上添加了一个没有内容的新选项卡部分。

$crawler
    ->filterXpath('//p[text()[contains(.,"/tabs")]]')
    ->each(function (Crawler $crawler) use ($doc) {
        foreach ($crawler as $node) {
            $span = $doc->createElement('p', '{tab=end}');
            $node->parentNode->insertBefore($span, $node);
        }
    });

这会生成 HTML

<p>{tab=end}</p>
<p>{/tabs}</p>

现在我使用 $crawler->html() 提供的编辑后的 html 并查找每个选项卡部分（以 p>{tab=TABNAME}</p> 开头并以 <p>{tab=NEXTTABNAME}</p> 结尾）

我首先得到标题

$tab_headings = $crawler->filterXpath('//p[text()[contains(.,"tab=")]]')->each(function ($node) {
    $matches = [];
    $pattern = '/\{tab=(.*)\}/m';

    if (preg_match($pattern, $node->text(), $matches)) {
        $tab = $matches[1];
    };

    return $tab;
});

我删除了最后一个（我添加的虚拟那个）

array_pop($tab_headings);

我现在可以循环并提取 html，我使用的是 Laravel，因此使用了转储

$tab_count = 0;
foreach ($tab_headings as $tab) {
    dump($tab_headings[$tab_count]);
    $first = $tab_count + 1;
    $next = $tab_count + 2;
    /**
     * Get content between tabs
     */
    $tab_content = $crawler
        ->filterXpath('//p[text()[contains(.,"tab=")]][' . $first . ']/following-sibling::*
        [
        count(.|//p[text()[contains(.,"tab=")]][' . $next . ']/preceding-sibling::*)
        =
        count(//p[text()[contains(.,"tab=")]][' . $next . ']/preceding-sibling::*)
        ]')
        ->each(function ($node) {
            return $node->outerHtml();
        });

    $tab_count++;

    dump($tab_content);
}

我现在插入数据库等..

帮助最大的链接

XPath select all elements between two specific elements

XPath: how to select following siblings until a certain sibling

【讨论】：