从 XML 元素中删除开始和结束空格答案

【问题标题】：Remove starting and ending spaces from XML elements从 XML 元素中删除开始和结束空格
【发布时间】：2011-11-12 08:56:07
【问题描述】：

如何删除 XML 字段前后的所有空格字符？

<data version="2.0">

  <field> 

     1 

  </field>        

  <field something=" some attribute here... "> 

     2  

  </field>

</data>

请注意 1 和 2 之前的间距以及“此处的某些属性...”，我想用 PHP 将其删除。

if(($xml = simplexml_load_file($file)) === false) die();

print_r($xml);

而且数据似乎不是字符串，我需要在每个变量之前附加（字符串）。为什么？

【问题讨论】：

请在stackoverflow.com/questions/8200582/… 上查看我的回答以获取可能的解决方案

标签： php xml string simplexml

【解决方案1】：

由于simplexml_load_file() 将数据读入数组，您可以这样做：

function TrimArray($input){

    if (!is_array($input))
        return trim($input);

    return array_map('TrimArray', $input);
}

【讨论】：

不，它不会将数据读入数组，但会从中创建一个 SimpleXMLElement。并且该对象可以大小写为字符串（当您在其上调用 trim 时会发生这种情况）。

【解决方案2】：

你可能想使用这样的东西：

$str = file_get_contents($file);
$str = preg_replace('~\s*(<([^>]*)>[^<]*</\2>|<[^>]*>)\s*~','$1',$str);
$xml = simplexml_load_string($xml,'SimpleXMLElement', LIBXML_NOCDATA);

我没有尝试过，但您可以在 http://www.lonhosford.com/lonblog/2011/01/07/php-simplexml-load-xml-file-preserve-cdata-remove-whitespace-between-nodes-and-return-json/ 找到更多信息。

请注意，左括号和右括号之间的空格 (<x> _space_ </x>) 和属性 (<x attr=" _space_ ">) 实际上是 XML 文档数据的一部分（与 <x> _space_ <y> 之间的空格相反），所以我建议您使用的来源应该少一点空格的混乱。

【讨论】：

【解决方案3】：

要在 PHP 中执行此操作，您首先必须将文档转换为 DOMDocument，以便您可以通过 DOMXPath 正确处理要在其中规范化空白的节点。 (xpath in) SimpleXMLElement 过于有限，无法足够精确地访问文本节点，因为此操作需要它。

访问叶元素内所有文本节点和所有属性的 Xpath 查询是：

//*[not(*)]/text() | //@*

鉴于$xml 是一个SimpleXMLElement，您可以像以下示例中那样进行空白规范化：

$doc   = dom_import_simplexml($xml)->ownerDocument;
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//*[not(*)]/text()|//@*') as $node) {
    /** @var $node DOMText|DOMAttr */
    $node->nodeValue = trim(preg_replace('~\s+~u', ' ', $node->nodeValue), ' ');
}

您也许可以将其扩展到所有文本节点 (as suggested in related Q&A)，但这可能需要在特定情况下进行文档规范化。由于 Xpath 中的 text() 在 text-nodes 和 Cdata-sections 之间没有区别，因此您可能希望在加载文档时跳过这些类型的节点 (DOMCdataSection) 或将它们扩展为文本节点（使用 the LIBXML_NOCDATA option for那）以获得更有用的结果。

此外，数据似乎不是字符串，我需要在每个变量之前附加（字符串）。为什么？

因为它是SimpleXMLElement类型的对象，如果你想要这样一个对象（元素）的字符串值，你需要将它强制转换为字符串。另请参阅以下参考问题：

Forcing a SimpleXML Object to a string, regardless of context

最后但同样重要的是：当您在 SimpleXMLElement 上使用 print_r 或 var_dump 时，不要相信它：这不是事实。例如。您可以覆盖__toString()，这也可以解决您的问题：

class TrimXMLElement extends SimpleXMLElement
{
    public function __toString()
    {
        return trim(preg_replace('~\s+~u', ' ', parent::__toString()), ' ');
    }
}

$xml = simplexml_load_string($buffer, 'TrimXMLElement');

print_r($xml);

即使通常会应用转换为字符串（例如使用echo），print_r 的输出仍然不会反映这些更改。所以最好不要依赖它，它永远无法显示全貌。

此答案的完整示例代码 (Online Demo)：

<?php
/**
 * Remove starting and ending spaces from XML elements
 *
 * @link https://stackoverflow.com/a/31793566/367456
 */

$buffer = <<<XML
<data version="2.0">

  <field>

     1

  </field>

  <field something=" some attribute here... ">

     2 <![CDATA[ 34 ]]>

  </field>

</data>
XML;

class TrimXMLElement extends SimpleXMLElement implements JsonSerializable
{
    public function __toString()
    {
        return trim(preg_replace('~\s+~u', ' ', parent::__toString()), ' ');
    }

    function jsonSerialize()
    {
        $array = (array) $this;

        array_walk_recursive($array, function(&$value) {
            if (is_string($value)) {
                $value  = trim(preg_replace('~\s+~u', ' ', $value), ' ');
            }
        });

        return $array;
    }
}

$xml = simplexml_load_string($buffer, 'TrimXMLElement', LIBXML_NOCDATA);

print_r($xml);
echo json_encode($xml);

$xml = simplexml_load_string($buffer, null, LIBXML_NOCDATA);

$doc = dom_import_simplexml($xml)->ownerDocument;
$doc->normalizeDocument();
$doc->normalize();

$xpath = new DOMXPath($doc);
foreach ($xpath->query('//*[not(*)]/text()|//@*') as $node) {
    /** @var $node DOMText|DOMAttr|DOMCdataSection */
    if ($node instanceof DOMCdataSection) {
        continue;
    }
    $node->nodeValue = trim(preg_replace('~\s+~u', ' ', $node->nodeValue), ' ');
}

echo $xml->asXML();

【讨论】：