尝试将 HTML 转换为 XML 时的链接问题答案

【问题标题】：Issues with links while trying to converting HTML to XML尝试将 HTML 转换为 XML 时的链接问题
【发布时间】：2009-10-24 04:10:13
【问题描述】：

我正在尝试将 html 文件转换为 xml。它在大多数情况下都在工作。我遇到的问题是链接。现在它似乎完全忽略了我的测试文件中的链接。

这里是转换代码：

<?php
ini_set('display_errors', 1); 
ini_set('log_errors', 1); 
ini_set('error_log', dirname(__FILE__) . '/error_log.txt'); 
error_reporting(E_ALL);

function convertToXML()
{

    $titleLength = 35;
    $output = "";
    $date = date("D, j M Y G:i:s T");
    $fi = fopen( "../newsTEST.htm", "r" );
    $fo = fopen( "../newsfeed.xml", "w" );

    //This is the first parts of the XML
    $output .= "<?xml version=\"1.0\"?>\n";
    $output .= "<rss version=\"2.0\">\n";
    $output .= "<channel>\n";
    $output .= "\t<title>Wiggle 100 News</title>\n";
    $output .= "\t<link>http://www.wiggle100.com/news.php</link>\n";
    $output .= "\t<description>Wiggle 100 Daily News</description>\n";
    $output .= "\t<language>en-us</language>\n";
    $output .= "\t<pubDate>". $date ."</pubDate>\n";
    $output .= "\t<managingEditor>wiggle100@gmail.com</managingEditor>\n";
    $output .= "\t<webMaster>josh@jacurren.com</webMaster>\n";

    $article = "";
    $skip = true; //if false will continue to put lines into output until </p>
    $newArticle = false;

    while( !feof($fi) )
    {
        $line = fgets($fi);
        $link = "";

        if( strpos( $line, "<p" ) !== false)
        {
            $pos = strpos( $line, "<p" );
            $line = substr( $line, $pos );

            $pos = strpos( $line, ">" );
            $line = substr( $line, $pos + 1 );

            $skip = false;          
        }

        if( strpos( $line, "</p>" ) !== false )
        {
            $pos = strpos( $line, "</p>" );
            $line = substr( $line, 0, $pos - 1 );

            $newArticle = true;
        }

        //This adds the line to the article
        if( !$skip )
        {
            $article .= $line;
        }

        //This mixes the article, title, link, and date with 
        // XML and puts it into the output
        if( $newArticle )
        {
            //This if is to get rid of stuff like <p>&nbsp;</p>
            if( (strlen($article) > 10) )
            {
                $link = findLink( $article );
                //$article = strip_tags($article);
                $title = substr( $article, 0, $titleLength ) . "...";

                $output .= "\t<item>\n";
                $output .= "\t\t<title>". $title ."</title>\n";
                $output .= "\t\t<link>". $link ."</link>\n";
                $output .= "\t\t<description>". $article . "</description>\n";
                $output .= "\t\t<pubDate>". $date . "</pubDate>\n";
                $output .= "\t</item>\n\n";
            }

            $article = "";
            $line = "";
            $skip = true;
        }
    }

    $output .= "</channel>\n";
    $output .= "</rss>\n";

    fwrite( $fo, $output );

    fclose($fi);
    fclose($fo);

    echo "<br /><br /> News converted to XML";
}

    //*****************************************************************************
    //*****************************************************************************

    //Find and return a link in the input.
    //Else use the a default
    function findLink( $input )
    {   
        $link = "http://www.wiggle100.com/news.php";

        if( strpos( $input, "<a" ) !== false )
        {
            $startpos = strpos( $input, "href" );
            $link = substr( $input, $startpos + 5 );
            $endpos = strpos( $link, ">" );
            $link = substr( $link, 0, $endpos - 2 );
        }
        return $link;
    }


?>

这里是html测试代码：

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> 
<html><head><title>Test Page</title> 
<meta name="GENERATOR" content="MSHTML 8.00.6001.18812"> 
<meta content="text/html; charset=unicode" http-equiv="Content-Type"></head> 
<body bgcolor="#ffffff"> 
<p>&nbsp;</p> 
<p>This is an article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p>This is another article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>This is the 3rd article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p align="center"><font size="6">This is the news for today. Blah Blah Blah!</font> 
<a href="http://www.thedailyreview.com/news/"> 
http://www.thedailyreview.com/news/</a></p> 
</body> 
</html>

这是 XML 输出：

<rss version="2.0"> 
<channel> 
    <title>Wiggle 100 News</title> 
    <link>http://www.wiggle100.com/news.php</link> 
    <description>Wiggle 100 Daily News</description> 
    <language>en-us</language> 
    <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    <managingEditor>wiggle100@gmail.com</managingEditor> 
    <webMaster>josh@jacurren.com</webMaster> 
    <item> 
        <title>This is an article. Blah. Blah. Bla...</title> 
        <link>http://www.wiggle100.com/news.php</link> 
        <description>This is an article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description> 
        <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

    <item> 
        <title>This is another article. Blah. Blah...</title> 
        <link>http://www.wiggle100.com/news.php</link> 
        <description>This is another article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description> 
        <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

    <item> 
        <title>This is the 3rd article. Blah. Blah...</title> 
        <link>http://www.wiggle100.com/news.php</link> 
        <description>This is the 3rd article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description> 
        <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

    <item> 
        <title><font size="6">This is the news for...</title> 
        <link>http://www.wiggle100.com/news.php</link> 
        <description><font size="6">This is the news for today. Blah Blah Blah!</font> 
</description> 
        <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

</channel> 
</rss>

当我取消对 strip_tags() 的注释时，字体标签会消失。

【问题讨论】：

您可以在 php.ini 中使用 html 解析器，而不是将 html 解析为字符串。 onderstekop.nl/articles/114

标签： php html xml hyperlink

【解决方案1】：

我做了一些测试，发现它在输入文件中的单行段落上运行良好，如下例所示。（除了它会将左引号作为 URL 的一部分读取，但这很容易解决。）

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> 
<html><head><title>Test Page</title> 
<meta name="GENERATOR" content="MSHTML 8.00.6001.18812"> 
<meta content="text/html; charset=unicode" http-equiv="Content-Type"></head> 
<body bgcolor="#ffffff"> 
<p>&nbsp;</p> 
<p>This is an article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p>This is another article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>This is the 3rd article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p align="center"><font size="6">This is the news for today. Blah Blah Blah!</font> <a href="http://www.thedailyreview.com/news/"> http://www.thedailyreview.com/news/</a></p> 
</body> 
</html>

【讨论】：

谢谢。这帮助我找到了问题。

【解决方案2】：

问题最终是我在写入 xml 输出后从未将 $newArticle 重置为 false。因此，在 $newArticle 设置为 true 之后（这是在找到 </p> 时），在文章输出之前读取的行数永远不会超过一行。通过在写入输出后将 $newArticle 设置为 false，程序会正确地将行添加到文章中，直到遇到 </p>。

【讨论】：