【问题标题】:Crawl a website, get the links, crawl the links with PHP and XPATH爬取网站,获取链接,使用 PHP 和 XPATH 爬取链接
【发布时间】:2012-04-23 22:00:41
【问题描述】:

我想爬取整个网站,我已经阅读了几个线程,但我无法设法获取第二级数据。

也就是说,我可以从起始页面返回链接,但是我找不到解析链接并获取每个链接内容的方法...

我使用的代码是:

<?php

    //  SELECT STARTING PAGE
      $url = 'http://mydomain.com/';
      $html= file_get_contents($url);

     // GET ALL THE LINKS OF EACH PAGE

         // create a dom object

            $dom = new DOMDocument();
            @$dom->loadHTML($html);

         // run xpath for the dom

            $xPath = new DOMXPath($dom);


         // get links from starting page

            $elements = $xPath->query("//a/@href");
            foreach ($elements as $e) {
            echo $e->nodeValue. "<br />";
            }

     // Parse each page using the extracted links?

 ?>

有人可以帮我举个例子吗?

我将不胜感激!


嗯,谢谢你的回答! 我尝试了一些东西,但还没有得到任何结果——我是编程新手..

您可以在下面找到我的 2 次尝试 - 第一次尝试解析链接,第二次尝试用 Curl 替换 file_get 内容:

 1) 

<?php 
  //  GET STARTING PAGE
  $url = 'http://www.capoeira.com.gr/';
  $html= file_get_contents($url);

  //GET ALL THE LINKS FROM STARTING PAGE

  // create a dom object

    $dom = new DOMDocument();
    @$dom->loadHTML($html);


    // run xpath for the dom

    $xPath = new DOMXPath($dom);

        // get specific elements from the sites

        $elements = $xPath->query("//a/@href");
//PARSE EACH LINK

    foreach($elements as $e) {
          $URLS= file_get_contents($e);
          $dom = new DOMDocument();
          @$dom->loadHTML($html);
          $xPath = new DOMXPath($dom);
          $output = $xPath->query("//div[@class='content-entry clearfix']");
         echo $output ->nodeValue;
        }                           
         ?>

对于上面的代码,我得到 警告:file_get_contents() 期望参数 1 是字符串,对象在 ../example.php 第 26 行给出

2)

    <?php
          $curl = curl_init();
          curl_setopt($curl, CURLOPT_POST, 1);
          curl_setopt($curl, CURLOPT_URL, "http://capoeira.com.gr");
          curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
          $content= curl_exec($curl);
          curl_close($curl);    



          $dom = new DOMDocument();
          @$dom->loadHTML($content);

           $xPath = new DOMXPath($dom);
           $elements = $xPath->query("//a/@href");
            foreach ($elements as $e) {
            echo $e->nodeValue. "<br />";
            }

   ?>

我没有得到任何结果。我试图回应 $content 然后我得到:

您无权访问此服务器上的 /。

此外,在尝试使用 ErrorDocument 处理请求时遇到 413 Request Entity Too Large 错误...

有什么想法吗? :)

【问题讨论】:

  • 您可以将所有内容包装在一个函数中,并对找到的每个链接进行递归调用,但请记住保存访问过的页面以避免无限循环运行
  • 显示其中一个链接的内容或布局、开始以及您尝试过的内容。
  • 您也可能希望在 file_get_contents 上使用 curl,因为它的速度大约是 curl multi 的两倍

标签: php xpath hyperlink web-crawler


【解决方案1】:

您可以尝试以下方法。详情请见this thread

<?php
//set_time_limit (0);
function crawl_page($url, $depth = 5){
$seen = array();
if(($depth == 0) or (in_array($url, $seen))){
    return;
}   
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$result = curl_exec ($ch);
curl_close ($ch);
if( $result ){
    $stripped_file = strip_tags($result, "<a>");
    preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']+"."(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $stripped_file, $matches, PREG_SET_ORDER ); 
    foreach($matches as $match){
        $href = $match[1];
            if (0 !== strpos($href, 'http')) {
                $path = '/' . ltrim($href, '/');
                if (extension_loaded('http')) {
                    $href = http_build_url($href , array('path' => $path));
                } else {
                    $parts = parse_url($href);
                    $href = $parts['scheme'] . '://';
                    if (isset($parts['user']) && isset($parts['pass'])) {
                        $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                    }
                    $href .= $parts['host'];
                    if (isset($parts['port'])) {
                        $href .= ':' . $parts['port'];
                    }
                    $href .= $path;
                }
            }
            crawl_page($href, $depth - 1);
        }
}   
echo "Crawled {$href}";
}   
crawl_page("http://www.sitename.com/",3);
?>

【讨论】:

    【解决方案2】:
    $doc = new DOMDocument; 
    $doc->load('file.htm'); 
    
    $items = $doc->getElementsByTagName('a'); 
    
    foreach($items as $value) { 
        echo $value->nodeValue . "\n"; 
        $attrs = $value->attributes; 
        echo $attrs->getNamedItem('href')->nodeValue . "\n"; 
    }; 
    

    【讨论】:

      【解决方案3】:

      以递归方式从网站中找到链接

      <?php
      
      
      $depth = 1;
      
      print_r(getList($depth));  
      
      
      function getList($depth)  
      {
          $lists = getDepth($depth);
          return $lists; 
       }
      
      function getUrl($request_url)
      {
          $countValid = 0;
          $brokenCount =0;
          $ch = curl_init();
          curl_setopt($ch, CURLOPT_URL, $request_url);
          curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // We want to get the respone
          $result = curl_exec($ch);
          $regex = '|<a.*?href="(.*?)"|';
          preg_match_all($regex, $result, $parts);
          $links = $parts[1];
          $lists = array();
          foreach ($links as $link)
          {
              $url = htmlentities($link);
              $result =getFlag($url);
              if($result == true)
              {
                  $UrlLists["clean"][$countValid] =$url;
                  $countValid++; 
              } 
              else
              {
                  $UrlLists["broken"][$brokenCount]= "broken->".$url;
                  $brokenCount++;
              }  
      
          }
          curl_close($ch);
          return $UrlLists;
      }
      function ZeroDepth($list)
      {
          $request_url = $list;
          $listss["0"]["0"] = getUrl($request_url);
          $lists["0"]["0"]["clean"] = array_unique($listss["0"]["0"]["clean"]);
          $lists["0"]["0"]["broken"] = array_unique($listss["0"]["0"]["broken"]);
          return $lists; 
      }
      
      function getDepth($depth)
      {        
         // $list =OW_URL_HOME;
          $list = "https://example.com";//enter the url of website 
          $lists =ZeroDepth($list);
          for($i=1;$i<=$depth;$i++)
          {
              $l= $i;
              $l= $l-1;
              $depthArray=1;
              foreach($lists[$l][$l]["clean"] as $depthUrl)
              { 
                  $request_url = $depthUrl;
                  $lists[$i][$depthArray]["requst_url"]=$request_url;
                  $lists[$i][$depthArray] = getUrl($request_url);
      
              }  
      
          }
          return $lists;   
      }
      
      function getFlag($url) 
      {
          $url_response = array();
          $curl = curl_init();
          $curl_options = array();
          $curl_options[CURLOPT_RETURNTRANSFER] = true;
          $curl_options[CURLOPT_URL] = $url;
          $curl_options[CURLOPT_NOBODY] = true;
          $curl_options[CURLOPT_TIMEOUT] = 60;
          curl_setopt_array($curl, $curl_options);
          curl_exec($curl);
          $status = curl_getinfo($curl, CURLINFO_HTTP_CODE);
          if ($status == 200) 
          { 
              return true;
          } 
          else 
          {
              return false;
          }
          curl_close($curl);
      }
      ?>` 
      

      【讨论】:

        【解决方案4】:

        请查看以下代码,希望对您有所帮助。

        <?php
        $html = new DOMDocument();
        @$html->loadHtmlFile('http://www.yourdomain.com');
        $xpath = new DOMXPath( $html );
        $nodelist = $xpath->query( "//div[@class='A-CLASS-Name']/h3/a/@href" );
        foreach ($nodelist as $n){
            echo $n->nodeValue."\n<br>";
        }
        ?>
        

        谢谢, 罗杰

        【讨论】:

          【解决方案5】:
          <?php
          $path='http://www.hscripts.com/';
          $html = file_get_contents($path);
          $dom = new DOMDocument();
          @$dom->loadHTML($html);
          // grab all the on the page
          $xpath = new DOMXPath($dom);
          $hrefs = $xpath->evaluate("/html/body//a");
          for ($i = 0; $i < $hrefs->length; $i++ ) {
          $href = $hrefs->item($i);
          $url = $href->getAttribute('href');
          echo $url.'<br />';
          }
          ?>
          

          你可以使用上面的代码来获取所有可能的链接

          【讨论】:

            猜你喜欢
            • 1970-01-01
            • 2013-11-23
            • 1970-01-01
            • 2010-10-20
            • 2015-09-11
            • 2011-02-15
            • 2012-09-24
            • 1970-01-01
            • 1970-01-01
            相关资源
            最近更新 更多