【问题标题】:Issues with links while trying to converting HTML to XML尝试将 HTML 转换为 XML 时的链接问题
【发布时间】:2009-10-24 04:10:13
【问题描述】:

我正在尝试将 html 文件转换为 xml。它在大多数情况下都在工作。我遇到的问题是链接。现在它似乎完全忽略了我的测试文件中的链接。

这里是转换代码:

<?php
ini_set('display_errors', 1); 
ini_set('log_errors', 1); 
ini_set('error_log', dirname(__FILE__) . '/error_log.txt'); 
error_reporting(E_ALL);

function convertToXML()
{

    $titleLength = 35;
    $output = "";
    $date = date("D, j M Y G:i:s T");
    $fi = fopen( "../newsTEST.htm", "r" );
    $fo = fopen( "../newsfeed.xml", "w" );

    //This is the first parts of the XML
    $output .= "<?xml version=\"1.0\"?>\n";
    $output .= "<rss version=\"2.0\">\n";
    $output .= "<channel>\n";
    $output .= "\t<title>Wiggle 100 News</title>\n";
    $output .= "\t<link>http://www.wiggle100.com/news.php</link>\n";
    $output .= "\t<description>Wiggle 100 Daily News</description>\n";
    $output .= "\t<language>en-us</language>\n";
    $output .= "\t<pubDate>". $date ."</pubDate>\n";
    $output .= "\t<managingEditor>wiggle100@gmail.com</managingEditor>\n";
    $output .= "\t<webMaster>josh@jacurren.com</webMaster>\n";

    $article = "";
    $skip = true; //if false will continue to put lines into output until </p>
    $newArticle = false;

    while( !feof($fi) )
    {
        $line = fgets($fi);
        $link = "";

        if( strpos( $line, "<p" ) !== false)
        {
            $pos = strpos( $line, "<p" );
            $line = substr( $line, $pos );

            $pos = strpos( $line, ">" );
            $line = substr( $line, $pos + 1 );

            $skip = false;          
        }

        if( strpos( $line, "</p>" ) !== false )
        {
            $pos = strpos( $line, "</p>" );
            $line = substr( $line, 0, $pos - 1 );

            $newArticle = true;
        }

        //This adds the line to the article
        if( !$skip )
        {
            $article .= $line;
        }

        //This mixes the article, title, link, and date with 
        // XML and puts it into the output
        if( $newArticle )
        {
            //This if is to get rid of stuff like <p>&nbsp;</p>
            if( (strlen($article) > 10) )
            {
                $link = findLink( $article );
                //$article = strip_tags($article);
                $title = substr( $article, 0, $titleLength ) . "...";

                $output .= "\t<item>\n";
                $output .= "\t\t<title>". $title ."</title>\n";
                $output .= "\t\t<link>". $link ."</link>\n";
                $output .= "\t\t<description>". $article . "</description>\n";
                $output .= "\t\t<pubDate>". $date . "</pubDate>\n";
                $output .= "\t</item>\n\n";
            }

            $article = "";
            $line = "";
            $skip = true;
        }
    }

    $output .= "</channel>\n";
    $output .= "</rss>\n";

    fwrite( $fo, $output );

    fclose($fi);
    fclose($fo);

    echo "<br /><br /> News converted to XML";
}

    //*****************************************************************************
    //*****************************************************************************

    //Find and return a link in the input.
    //Else use the a default
    function findLink( $input )
    {   
        $link = "http://www.wiggle100.com/news.php";

        if( strpos( $input, "<a" ) !== false )
        {
            $startpos = strpos( $input, "href" );
            $link = substr( $input, $startpos + 5 );
            $endpos = strpos( $link, ">" );
            $link = substr( $link, 0, $endpos - 2 );
        }
        return $link;
    }


?>

这里是html测试代码:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> 
<html><head><title>Test Page</title> 
<meta name="GENERATOR" content="MSHTML 8.00.6001.18812"> 
<meta content="text/html; charset=unicode" http-equiv="Content-Type"></head> 
<body bgcolor="#ffffff"> 
<p>&nbsp;</p> 
<p>This is an article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p>This is another article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>This is the 3rd article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p align="center"><font size="6">This is the news for today. Blah Blah Blah!</font> 
<a href="http://www.thedailyreview.com/news/"> 
http://www.thedailyreview.com/news/</a></p> 
</body> 
</html> 

这是 XML 输出:

<rss version="2.0"> 
<channel> 
    <title>Wiggle 100 News</title> 
    <link>http://www.wiggle100.com/news.php</link> 
    <description>Wiggle 100 Daily News</description> 
    <language>en-us</language> 
    <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    <managingEditor>wiggle100@gmail.com</managingEditor> 
    <webMaster>josh@jacurren.com</webMaster> 
    <item> 
        <title>This is an article. Blah. Blah. Bla...</title> 
        <link>http://www.wiggle100.com/news.php</link> 
        <description>This is an article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description> 
        <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

    <item> 
        <title>This is another article. Blah. Blah...</title> 
        <link>http://www.wiggle100.com/news.php</link> 
        <description>This is another article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description> 
        <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

    <item> 
        <title>This is the 3rd article. Blah. Blah...</title> 
        <link>http://www.wiggle100.com/news.php</link> 
        <description>This is the 3rd article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description> 
        <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

    <item> 
        <title><font size="6">This is the news for...</title> 
        <link>http://www.wiggle100.com/news.php</link> 
        <description><font size="6">This is the news for today. Blah Blah Blah!</font> 
</description> 
        <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

</channel> 
</rss> 

当我取消对 strip_tags() 的注释时,字体标签会消失。

【问题讨论】:

标签: php html xml hyperlink


【解决方案1】:

我做了一些测试,发现它在输入文件中的单行段落上运行良好,如下例所示。 (除了它会将左引号作为 URL 的一部分读取,但这很容易解决。)

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> 
<html><head><title>Test Page</title> 
<meta name="GENERATOR" content="MSHTML 8.00.6001.18812"> 
<meta content="text/html; charset=unicode" http-equiv="Content-Type"></head> 
<body bgcolor="#ffffff"> 
<p>&nbsp;</p> 
<p>This is an article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p>This is another article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>This is the 3rd article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p align="center"><font size="6">This is the news for today. Blah Blah Blah!</font> <a href="http://www.thedailyreview.com/news/"> http://www.thedailyreview.com/news/</a></p> 
</body> 
</html>

【讨论】:

  • 谢谢。这帮助我找到了问题。
【解决方案2】:

问题最终是我在写入 xml 输出后从未将 $newArticle 重置为 false。因此,在 $newArticle 设置为 true 之后(这是在找到 &lt;/p&gt; 时),在文章输出之前读取的行数永远不会超过一行。通过在写入输出后将 $newArticle 设置为 false,程序会正确地将行添加到文章中,直到遇到 &lt;/p&gt;

【讨论】:

    猜你喜欢
    • 2021-09-01
    • 1970-01-01
    • 2015-08-15
    • 1970-01-01
    • 1970-01-01
    • 2015-04-22
    • 1970-01-01
    • 2022-01-08
    • 2020-07-10
    相关资源
    最近更新 更多