从 XML 文件中检索 URL 并将数据从 URL 收集到我的数据库 - PHP/cURL/XML答案

【问题标题】：Retrieve URLs from XML file and gather data from the URLs to my database - PHP/cURL/XML从 XML 文件中检索 URL 并将数据从 URL 收集到我的数据库 - PHP/cURL/XML
【发布时间】：2014-08-04 23:24:45
【问题描述】：

XML 包含大约 50,000 个不同的 URL，我试图从中收集数据，然后插入或更新我的数据库。

目前我正在使用这个，哪种工作但由于正在处理大量数据而超时，我该如何提高它的性能：

URLs.xml（最多 50,000 个位置）

    <?xml version="1.0" encoding="utf-8"?>
    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
    <url>
        <loc>http://url.com/122122-rob-jones?</loc>
        <lastmod>2014-05-05T07:12:41+08:00</lastmod>
        <changefreq>monthly</changefreq>
        <priority>0.9</priority>
    </url>
    </urlset>

index.php

    <?php
include 'config.php';
include 'custom.class.php';
require_once('SimpleLargeXMLParser.class.php');
$custom = new custom();

$xml = dirname(__FILE__)."/URLs.xml";

// create a new object
$parser = new SimpleLargeXMLParser();
// load the XML
$parser->loadXML($xml);

$parser->registerNamespace("urlset", "http://www.sitemaps.org/schemas/sitemap/0.9"); 
$array = $parser->parseXML("//urlset:url/urlset:loc"); 

for ($i=0, $n=count($array); $i<$n; $i++){

            $FirstURL=$array[$i];

            $URL = substr($FirstURL, 0, strpos($FirstURL,'?')) . "/";
            $custom->infoc($URL);
    }

custom.class.php（包含位）

    <?php
        public function load($url, $postData='')
        {

                $ch = curl_init();
                curl_setopt($ch, CURLOPT_URL, $url);
                curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
                curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
                curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6");
                curl_setopt($ch, CURLOPT_TIMEOUT, 60);
                curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
                curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
                curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
                curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
                curl_setopt($ch, CURLOPT_AUTOREFERER, true);
                if($postData != '') {
                    curl_setopt($ch, CURLOPT_POST, true);
                    curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);
                    }
                curl_setopt($ch, CURLOPT_HTTPHEADER, array("X-Requested-With: XMLHttpRequest"));
                $result = curl_exec($ch);
                curl_close($ch);
                return $result;




        }

        public function infoc($url) {


        $get_tag = $this->load($url);   


        // Player ID

          $playeridTAG = '/<input type="text" id="player-(.+?)" name="playerid" value="(.+?)" \/>/';
        preg_match($playeridTAG, $get_tag, $playerID);      

        // End Player ID

        // Full Name
            preg_match("/(.+?)-(.+?)\//",$url, $title);
        $fullName = ucwords(preg_replace ("/-/", " ", $title[2]));  
        // End Full Name

        // Total    
        $totalTAG = '/<li>
                    <span>(.+?)<\/span><span class="none"><\/span>              <label>Total<\/label>
                <\/li>/';
        preg_match($totalTAG, $get_tag, $total);        
        // End Total        


        $query = $db->query('SELECT * FROM playerblank WHERE playerID = '.$playerID[1].'');
        if($query->num_rows > 0) {

        $db->query('UPDATE playerblank SET name = "'.$fullName.'", total = "'.$total[1].'" WHERE playerID = '.$playerID[1].'') or die(mysqli_error($db));

echo "UPDATED ".$playerID[1]."";

        }
        else {

        $db->query('INSERT INTO playerblank SET playerID = '.$playerID[1].', name = "'.$fullName.'", total = "'.$total[1].'"') or die(mysqli_error($db));

echo "Inserted ".$playerID[1]."";

        }




        }

?>

从 XML 文件中收集每个 URL (loc) 没有问题，当我尝试使用 cURL 为每个 URL 收集数据时，我不得不等待很长时间。

【问题讨论】：

你试过curl_multi_init()吗？
不，我不确定在使用循环 (for) 时如何使用它。不过，我会进一步研究。
所以您想从 50.000 个 URL 中提取 XML 而无需等待很长时间？请定义“很长时间”。
如果单个请求需要 250 毫秒，其中 50.000 个请求需要 3.5 小时。
curl_multi_init() 是一个不错的选择，或者您可以使用像 NodeJS 这样的非阻塞语言，我使用请求模块与 nodejs 做了一些非常相似的事情，它真的很酷，而且速度很快。

标签： php xml curl

【解决方案1】：

尝试使用 curl_multi。在PHP documentation 中有一个很好的例子

// create both cURL resources
$ch1 = curl_init();
$ch2 = curl_init();

// set URL and other appropriate options
curl_setopt($ch1, CURLOPT_URL, "http://lxr.php.net/");
curl_setopt($ch1, CURLOPT_HEADER, 0);
curl_setopt($ch2, CURLOPT_URL, "http://www.php.net/");
curl_setopt($ch2, CURLOPT_HEADER, 0);

//create the multiple cURL handle
$mh = curl_multi_init();

//add the two handles
curl_multi_add_handle($mh,$ch1);
curl_multi_add_handle($mh,$ch2);

$active = null;
//execute the handles
do {
    $mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);

while ($active && $mrc == CURLM_OK) {
    if (curl_multi_select($mh) != -1) {
        do {
            $mrc = curl_multi_exec($mh, $active);
        } while ($mrc == CURLM_CALL_MULTI_PERFORM);
    }
}

//close the handles
curl_multi_remove_handle($mh, $ch1);
curl_multi_remove_handle($mh, $ch2);
curl_multi_close($mh);

【讨论】：

【解决方案2】：

尝试使用 XML 文件的离线副本，并删除已经更新或插入的 url，然后再次启动脚本，直到离线文件有 url。然后根据需要获取 XML 文件的新副本。

【讨论】：

从 XML 获取 URL 工作正常，问题在于从每个 URL 收集数据。

【解决方案3】：

“加载”函数中的问题：它会阻止执行，直到单个 url 准备好，而您可以轻松地同时加载多个 url。 Here is explanation 知道怎么做。提高性能的最佳方法是并行加载多个（10-20）个 url，并在之前的一个完成后添加新的“动态”加载。 ParallelCurl 可以解决问题，例如：

require_once('parallelcurl.php');

// $max_requests = 10 or more, try to pick best value manually
$parallel_curl = new ParallelCurl($max_requests, $curl_options);

// $array - 50000 urls
$in_urls = array_splice($array, 0, $max_requests);
foreach ($in_urls as $url) {
    $parallel_curl->startRequest($url, 'on_request_done');
}

function on_request_done($content, $url, $ch, $search) {
    // here you can parse $content and save data to DB

    // and add next url for loading
    $next_url = array_shift($array);
    if($next_url) {
        $parallel_curl->startRequest($url, 'on_request_done');
    }
}

// This should be called when you need to wait for the requests to finish.
$parallel_curl->finishAllRequests();

【讨论】：