【问题标题】:Unlink images, remove unclosed p's and remove all styles取消链接图像,删除未闭合的 p 并删除所有样式
【发布时间】:2014-11-25 08:06:40
【问题描述】:

我的 Wordpress 帖子有一些问题,我正在尝试使用 DOMDocument 来修复它们。

第一个问题是我的图片(<img><a> 标签内,我想删除<a> 标签。

我还想删除所有未关闭的<p> 标签(没有</p>),我想从所有元素中删除style

我可以发布一些我已经尝试过的代码,但我认为它根本没有帮助,因为我无处可去。我现在只尝试从图像中删除链接,但似乎没有任何效果。我不太了解如何很好地使用 DOMDocument 子元素。

这里您可以看到一个需要修复的 HTML 示例:

<img width="750" height="500" src="http://fancycribs.com/wp-content/uploads/2013/05/Modern-Riverside-Apartment-–-A-Stylish-and-Elegant-Residence-6.jpg" class="attachment-large wp-post-image" alt="Modern Riverside Apartment – A Stylish and Elegant Residence (6)" />        <p>This modern seventh floor riverside apartment is placed in the luxurious and modern Montevetro Building, which is close to Battersea Square with access to Chelsea, Fulham and Kings Road by crossing Battersea Bridge, London. This residence has become one of the iconic buildings in the Battersea area.</p>
<p>It offers spectacular views over the serene tranquility of the river. This apartment offers comfort and luxury throughout its double reception room, three bedrooms, three bathrooms and large decked balcony. The design details are astonishing: mahogany wood floors, original hand painted walls, large floor to ceiling windows offering a spectacular view over the river. The apartment is spacious, the space between living room and dining room is fluid, having continuity. The hall is large and has a lot of storage spaces, having the quality to link rooms one to another. The kitchen space is large and has plenty of storage capacity. It is dressed up in mahogany wood, offering personality and contrast and access to the large balcony.</p>
<p>The master bedroom is a masterpiece of style and elegance, with nice and simple furniture, a bathroom and accompanied by two further double bedrooms, a family bathroom and a shower room. The residence overwhelms you through its luxury and the splendid view.</p>
<p style="text-align: center"><a href="http://fancycribs.com/37216-modern-riverside-apartment-a-stylish-and-elegant-residence.html/modern-riverside-apartment-a-stylish-and-elegant-residence-7" rel="attachment wp-att-39033" class="local-link"><img class="aligncenter size-medium wp-image-39033" alt="Modern Riverside Apartment – A Stylish and Elegant Residence" src="http://fancycribs.com/wp-content/uploads/2013/05/Modern-Riverside-Apartment-–-A-Stylish-and-Elegant-Residence-7-670x446.jpg" width="670" height="446" title="Modern Riverside Apartment – A Stylish and Elegant Residence" /></a></p>
<p style="text-align: center">

后期编辑:

这是我尝试过的,它似乎取消了图像的链接,但只有图像编号 1、3、5、7,而 2、4、6 保持不变。

$html = new DOMDocument;
$html->preserveWhiteSpace = false;
$html->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">'.$content);
foreach($html->getElementsByTagName('a') as $a) {
    if($a->hasChildNodes()) {
        $img = $a->getElementsByTagName('img')->item(0);
        $a->parentNode->replaceChild($img,$a);
    }
}
$text = $html->saveHTML();
echo $text;

谢谢

【问题讨论】:

  • 如果您的 html 被损坏,请使用 htmlpurifier 尝试并清理它。 PHP 的 dom 是 EXTREMELY 挑剔的,充其量只会把你的所有 html 吐出/barf,或者更糟。垃圾进,垃圾出。
  • 我不知道,谢谢。我会为未闭合的

    标签尝试它。但我仍然必须取消链接图像。

标签: php html html-parsing domdocument


【解决方案1】:

我已经设法用 DOMDocument 和 HTML Purifier 做到了。

代码如下:

require_once 'library/HTMLPurifier.auto.php';
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.TidyLevel','heavy');
$config->set('AutoFormat.RemoveEmpty','true');
$config->set('AutoFormat.RemoveEmpty.RemoveNbsp','true');
$purifier = new HTMLPurifier($config);

$clean_html = $purifier->purify($content);
$html = new DOMDocument;
$html->preserveWhiteSpace = false;
$html->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">'.$clean_html);
$as = $html->getElementsByTagName('a');
$ctr = $html->getElementsByTagName('a')->length;
for($i=$ctr;$i>0;--$i) {
    $a = $html->getElementsByTagName('a')->item($i-1);
    if($a->hasChildNodes()) {
        $img = $a->getElementsByTagName('img')->item(0);
        if($img != null) {
            $a->parentNode->replaceChild($img,$a);
        }
    }
}

foreach($html->getElementsByTagName('p') as $p) {
    $p->removeAttribute('style');
}
$text = $html->saveHTML();
echo $text;

【讨论】:

    【解决方案2】:

    您能否尝试运行此代码,看看您是否满意。这会找到&lt;a ...&gt;&lt;img ... and replaces it to just &lt;img ...

    $p = "/<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*<img.*)<\/a>/siU";
    $newHtml = preg_replace($p, '$3', $html , PREG_SET_ORDER );
    

    【讨论】:

    • 我不想对 HTML 使用正则表达式。
    猜你喜欢
    • 2012-02-13
    • 2010-12-28
    • 2019-10-30
    • 2010-12-20
    • 1970-01-01
    • 2013-01-30
    • 1970-01-01
    • 2013-01-05
    • 1970-01-01
    相关资源
    最近更新 更多