【问题标题】:How to parse Wikipedia XML with PHP?如何用 PHP 解析维基百科 XML?
【发布时间】:2011-01-29 22:48:07
【问题描述】:

如何用 PHP 解析 Wikipedia XML?我用simplepie尝试过,但我什么也没得到。这是我想获取其数据的链接。

http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml

编辑代码:

<?php
    define("EMAIL_ADDRESS", "youlichika@hotmail.com"); 
    $ch = curl_init(); 
    $cv = curl_version(); 
    $user_agent = "curl ${cv['version']} (${cv['host']}) libcurl/${cv['version']} ${cv['ssl_version']} zlib/${cv['libz_version']} <" . EMAIL_ADDRESS . ">"; 
    curl_setopt($ch, CURLOPT_USERAGENT, $user_agent); 
    curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt"); 
    curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt"); 
    curl_setopt($ch, CURLOPT_ENCODING, "deflate, gzip, identity"); 
    curl_setopt($ch, CURLOPT_HEADER, FALSE); 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); 
    curl_setopt($ch, CURLOPT_HTTPGET, TRUE); 
    curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml"); 
    $xml = curl_exec($ch); 
    $xml_reader = new XMLReader(); 
    $xml_reader->xml($xml, "UTF-8"); 
    echo $xml->api->query->pages->page->rev;
?>

【问题讨论】:

    标签: php xml mediawiki wikipedia-api mediawiki-api


    【解决方案1】:

    我通常结合使用 CURL 和 XMLReader 来解析 MediaWiki API 生成的 XML。

    请注意,您必须在 User-Agent 标头中包含您的电子邮件地址,否则 API 脚本将响应 HTTP 403 Forbidden。

    这是我初始化 CURL 句柄的方法:

    define("EMAIL_ADDRESS", "my@email.com");
    $ch = curl_init();
    $cv = curl_version();
    $user_agent = "curl ${cv['version']} (${cv['host']}) libcurl/${cv['version']} ${cv['ssl_version']} zlib/${cv['libz_version']} <" . EMAIL_ADDRESS . ">";
    curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
    curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt");
    curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt");
    curl_setopt($ch, CURLOPT_ENCODING, "deflate, gzip, identity");
    curl_setopt($ch, CURLOPT_HEADER, FALSE);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
    

    然后您可以使用此代码获取 XML 并在 $xml_reader 中构造一个新的 XMLReader 对象:

    curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
    curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml");
    $xml = curl_exec($ch);
    $xml_reader = new XMLReader();
    $xml_reader->xml($xml, "UTF-8");
    

    编辑:这是一个工作示例:

    <?php
    define("EMAIL_ADDRESS", "youlichika@hotmail.com");
    $ch = curl_init();
    $cv = curl_version();
    $user_agent = "curl ${cv['version']} (${cv['host']}) libcurl/${cv['version']} ${cv['ssl_version']} zlib/${cv['libz_version']} <" . EMAIL_ADDRESS . ">"; 
    curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
    curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt");
    curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt");
    curl_setopt($ch, CURLOPT_ENCODING, "deflate, gzip, identity");
    curl_setopt($ch, CURLOPT_HEADER, FALSE);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
    curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml"); 
    $xml = curl_exec($ch);
    $xml_reader = new XMLReader();
    $xml_reader->xml($xml, "UTF-8");
    
    function extract_first_rev(XMLReader $xml_reader)
    {
        while ($xml_reader->read()) {
            if ($xml_reader->nodeType == XMLReader::ELEMENT) {
                if ($xml_reader->name == "rev") {
                    $content = htmlspecialchars_decode($xml_reader->readInnerXML(), ENT_QUOTES);
                    return $content;
                }
            } else if ($xml_reader->nodeType == XMLReader::END_ELEMENT) {
                if ($xml_reader->name == "page") {
                    throw new Exception("Unexpectedly found `</page>`");
                }
            }
        }
    
        throw new Exception("Reached the end of the XML document without finding revision content");
    }
    
    $latest_rev = array();
    while ($xml_reader->read()) {
        if ($xml_reader->nodeType == XMLReader::ELEMENT) {
            if ($xml_reader->name == "page") {
                $latest_rev[$xml_reader->getAttribute("title")] = extract_first_rev($xml_reader);
            }
        }
    }
    
    function parse($rev)
    {
        global $ch;
    
        curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
        curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=parse&text=" . rawurlencode($rev) . "&prop=text&format=xml");
        sleep(3);
        $xml = curl_exec($ch);
        $xml_reader = new XMLReader();
        $xml_reader->xml($xml, "UTF-8");
    
        while ($xml_reader->read()) {
            if ($xml_reader->nodeType == XMLReader::ELEMENT) {
                if ($xml_reader->name == "text") {
                    $html = htmlspecialchars_decode($xml_reader->readInnerXML(), ENT_QUOTES);
                    return $html;
                }
            }
        }
    
        throw new Exception("Failed to parse");
    }
    
    foreach ($latest_rev as $title => $latest_rev) {
        echo parse($latest_rev) . "\n";
    }
    

    【讨论】:

    • 谢谢@Daniel Trebbien,如果我使用curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; he; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8"); 来代替$user_agent,请介意。
    • {{Wiktionary|re|re-}} '''RE''' may mean: * RE ([[:nl:EDP-Auditing|Register EDP auditor]]), Electronic Data Processing auditor, also IT auditor * [[RE (complexity)]], the set of recursively enumerable languages *... 一样返回的数据不是html格式...
    • @yuli:这可能行得通,但系统管理员不喜欢这样。您需要包括您的电子邮件地址。见:secure.wikimedia.org/wikipedia/meta/wiki/User-Agent_policy
    • @yuli:查询修订时,内容是MediaWiki标记形式。如果您需要将 MediaWiki 内容转换为 HTML,请使用 API 的 parse 操作。例如:https://secure.wikimedia.org/wikipedia/en/w/api.php?action=parse&text={{Project:Sandbox}}&format=xml
    • @Daniel Trebbien,好的,我已将电子邮件替换为网络浏览器。但这一次,没有任何回报。我将我的代码粘贴到我的帖子部分。并且返回数据不是html格式,你知道如何交换它吗?再次感谢。
    【解决方案2】:

    你可以使用simplexml:

    $xml = simplexml_load_file($url);
    

    在此处查看示例:http://php.net/manual/en/simplexml.examples-basic.php

    Dom:

    $xml = new DomDocument;
    $xml->load($url);
    

    XmlReader 用于您不想完全在内存中读取的大型 XML 文档。

    【讨论】:

    • 使用 MediaWiki API 时,您不能只调用 simplexml_load_file 来检索 XML,因为响应将是 HTTP 403 Forbidden。 API 脚本会阻止在 User-Agent 标头中不包含联系信息的请求。
    • @user576875, @LadaRaider, Warning: DOMDocument::load(http://en.wikipedia.org/w/api.php?action=query&amp;generator=allpages&amp;gaplimit=2&amp;gapfilterredir=nonredirects&amp;gapfrom=Re&amp;prop=revisions&amp;rvprop=content&amp;format=xml) [domdocument.load]: failed to open stream: HTTP request failed! HTTP/1.0 403 Forbidden 如何像@Daniel Trebbien 一样添加$user_agent
    • @yuli:在拨打simplexml_load_file之前添加ini_set("user_agent", EMAIL_ADDRESS);
    【解决方案3】:

    你应该看看 php XMLReader 类。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2011-04-22
      • 1970-01-01
      • 2016-07-21
      • 1970-01-01
      • 1970-01-01
      • 2015-05-08
      • 2016-03-27
      相关资源
      最近更新 更多