【问题标题】:extract xml from xml embebed in html从嵌入在html中的xml中提取xml
【发布时间】:2013-06-24 08:06:14
【问题描述】:

我正在尝试获取此处显示的 xml http://www.ncbi.nlm.nih.gov/sra/ERX086768?report=FullXml,但这有点棘手,因为他们没有为此提供任何支持。目的是将xml转换为php以便处理xml。

谁能给个提示?

【问题讨论】:

  • 您是否尝试从页面中获取 xml?比如使用file_get_contents()?或者你可以自己复制粘贴,(只需要 xml 代码,没有页面的其余部分)?
  • 我对这种东西还是有点陌生​​,但我想要的是自动获取 xml(我在做一个 web 服务),最好没有 html 标签
  • 奇怪的是,当该站点似乎有大量的数据 API 时,想要从 html 中抓取。最坏的情况可以使用 php simplehtmldom 库并将 html 标签转换为 xml 标签/属性。设置它比找到正确的 REST API 花费更多时间
  • 只是想感谢大家的回答,这是很好的帮助。

标签: php html xml ncbi


【解决方案1】:

真的通过其中的 HTML 呈现的 XML 也不会是 XML。

您要查找的是名为textContent in DOMDocument 的内容。这将只为您提供该 HMTL 中的文本。就像它在浏览器中显示为“文本”一样。

所以您需要做的就是将 HTML 文档加载到 DOMDocument 中。因为它包含错误,所以使用内部错误:

$url = 'http://www.ncbi.nlm.nih.gov/sra/ERX086768?report=FullXml';

$doc = new DOMDocument();
libxml_use_internal_errors(TRUE);
$doc->loadHTMLFile($url);
libxml_use_internal_errors(FALSE);

下一部分暗示了关于被抓取页面的具体知识。在您的情况下,XML 是所有具有类属性 "xml-tag" *followed* 在 ID “结果视图”.

这些标签可以通过 xpath 查询轻松获取,然后将它们的文本内容存储到一个数组中:

$xpath  = new DOMXPath($doc);
$nodes  = $xpath->query('//*[@id="ResultView"]/following-sibling::div[@class="xml-tag"]');
$buffer = array();
foreach ($nodes as $node) {
    $buffer[] = $node->textContent;
}

所以现在剩下的就是创建一个新的DOMDocument 并将该 XML 缓冲区加载到其中,进行一些漂亮的格式化和输出:

$new = new DOMDocument();
$new->preserveWhiteSpace = FALSE;
$new->formatOutput = TRUE;
$new->loadXML(implode('', $buffer));
$new->save('php://output');

这大约 20 行代码会产生以下输出:

<?xml version="1.0"?>
<EXPERIMENT_PACKAGE>
  <EXPERIMENT alias="SC_EXP_7229_8#56" center_name="SC" accession="ERX086768">
    <IDENTIFIERS>
      <PRIMARY_ID>ERX086768</PRIMARY_ID>
      <SUBMITTER_ID namespace="SC">SC_EXP_7229_8#56</SUBMITTER_ID>
    </IDENTIFIERS>
    <TITLE/>
    <STUDY_REF accession="ERP000913" refname="Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977" refcenter="SC">
      <IDENTIFIERS>
        <PRIMARY_ID>ERP000913</PRIMARY_ID>
        <SUBMITTER_ID namespace="SC">Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977</SUBMITTER_ID>
      </IDENTIFIERS>
    </STUDY_REF>
    <DESIGN>
      <DESIGN_DESCRIPTION>Standard</DESIGN_DESCRIPTION>
      <SAMPLE_DESCRIPTOR accession="ERS074283" refname="MR223754-sc-2011-11-18T11:31:44Z-1306470" refcenter="SC">
        <IDENTIFIERS>
          <PRIMARY_ID>ERS074283</PRIMARY_ID>
          <SUBMITTER_ID namespace="SC">MR223754-sc-2011-11-18T11:31:44Z-1306470</SUBMITTER_ID>
        </IDENTIFIERS>
      </SAMPLE_DESCRIPTOR>
      <LIBRARY_DESCRIPTOR>
        <LIBRARY_NAME>4008297</LIBRARY_NAME>
        <LIBRARY_STRATEGY>WGS</LIBRARY_STRATEGY>
        <LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE>
        <LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION>
        <LIBRARY_LAYOUT>
          <PAIRED NOMINAL_LENGTH="250"/>
        </LIBRARY_LAYOUT>
      </LIBRARY_DESCRIPTOR>
      <SPOT_DESCRIPTOR>
        <SPOT_DECODE_SPEC>
          <READ_SPEC>
            <READ_INDEX>0</READ_INDEX>
            <READ_CLASS>Application Read</READ_CLASS>
            <READ_TYPE>Forward</READ_TYPE>
            <BASE_COORD>1</BASE_COORD>
          </READ_SPEC>
          <READ_SPEC>
            <READ_INDEX>1</READ_INDEX>
            <READ_CLASS>Application Read</READ_CLASS>
            <READ_TYPE>Reverse</READ_TYPE>
            <RELATIVE_ORDER follows_read_index="0"/>
          </READ_SPEC>
        </SPOT_DECODE_SPEC>
      </SPOT_DESCRIPTOR>
    </DESIGN>
    <PLATFORM>
      <ILLUMINA>
        <INSTRUMENT_MODEL>Illumina HiSeq 2000</INSTRUMENT_MODEL>
      </ILLUMINA>
    </PLATFORM>
    <PROCESSING/>
  </EXPERIMENT>
  <SUBMISSION accession="ERA119046" center_name="SC" submission_date="2012-04-17T09:29:50Z" alias="ERP000913-sc-20120417-2" lab_name="">
    <IDENTIFIERS>
      <PRIMARY_ID>ERA119046</PRIMARY_ID>
      <SUBMITTER_ID namespace="SC">ERP000913-sc-20120417-2</SUBMITTER_ID>
    </IDENTIFIERS>
  </SUBMISSION>
  <STUDY alias="Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977" center_name="SC" accession="ERP000913">
    <IDENTIFIERS>
      <PRIMARY_ID>ERP000913</PRIMARY_ID>
      <SUBMITTER_ID namespace="SC">Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977</SUBMITTER_ID>
    </IDENTIFIERS>
    <DESCRIPTOR>
      <STUDY_TITLE>Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis</STUDY_TITLE>
      <STUDY_TYPE existing_study_type="Whole Genome Sequencing"/>
      <STUDY_ABSTRACT>http://www.sanger.ac.uk/resources/downloads/bacteria/</STUDY_ABSTRACT>
      <CENTER_PROJECT_NAME>Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis</CENTER_PROJECT_NAME>
      <STUDY_DESCRIPTION>http://www.sanger.ac.uk/resources/downloads/bacteria/
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/</STUDY_DESCRIPTION>
    </DESCRIPTOR>
  </STUDY>
  <SAMPLE alias="MR223754-sc-2011-11-18T11:31:44Z-1306470" center_name="SC" accession="ERS074283">
    <IDENTIFIERS>
      <PRIMARY_ID>ERS074283</PRIMARY_ID>
      <SUBMITTER_ID namespace="SC">MR223754-sc-2011-11-18T11:31:44Z-1306470</SUBMITTER_ID>
    </IDENTIFIERS>
    <SAMPLE_NAME>
      <COMMON_NAME>Streptococcus dysgalactiae subspecies equisimilis</COMMON_NAME>
      <TAXON_ID>119602</TAXON_ID>
      <SCIENTIFIC_NAME>Streptococcus dysgalactiae subsp. equisimilis</SCIENTIFIC_NAME>
    </SAMPLE_NAME>
    <SAMPLE_LINKS>
      <SAMPLE_LINK>
        <ENTREZ_LINK>
          <DB>biosample</DB>
          <ID>859730</ID>
        </ENTREZ_LINK>
      </SAMPLE_LINK>
    </SAMPLE_LINKS>
    <SAMPLE_ATTRIBUTES>
      <SAMPLE_ATTRIBUTE>
        <TAG>Strain</TAG>
        <VALUE>MR223754</VALUE>
      </SAMPLE_ATTRIBUTE>
      <SAMPLE_ATTRIBUTE>
        <TAG>Sample Description</TAG>
        <VALUE/>
      </SAMPLE_ATTRIBUTE>
      <SAMPLE_ATTRIBUTE>
        <TAG>ArrayExpress-StrainOrLine</TAG>
        <VALUE>MR223754</VALUE>
      </SAMPLE_ATTRIBUTE>
      <SAMPLE_ATTRIBUTE>
        <TAG>ArrayExpress-Sex</TAG>
        <VALUE>not applicable</VALUE>
      </SAMPLE_ATTRIBUTE>
      <SAMPLE_ATTRIBUTE>
        <TAG>ArrayExpress-Species</TAG>
        <VALUE>Streptococcus dysgalactiae subspecies equisimilis</VALUE>
      </SAMPLE_ATTRIBUTE>
    </SAMPLE_ATTRIBUTES>
  </SAMPLE>
  <RUN_SET>
    <RUN alias="SC_RUN_7229_8#56" center_name="SC" accession="ERR109334" total_spots="2708543" total_bases="406281450" size="334475592" load_done="true" published="2012-04-27 20:11:35" is_public="true" cluster_name="public" static_data_available="1">
      <IDENTIFIERS>
        <PRIMARY_ID>ERR109334</PRIMARY_ID>
        <SUBMITTER_ID namespace="SC">SC_RUN_7229_8#56</SUBMITTER_ID>
      </IDENTIFIERS>
      <EXPERIMENT_REF refname="SC_EXP_7229_8#56" refcenter="SC" accession="ERX086768">
        <IDENTIFIERS>
          <PRIMARY_ID>ERX086768</PRIMARY_ID>
          <SUBMITTER_ID namespace="SC">SC_EXP_7229_8#56</SUBMITTER_ID>
        </IDENTIFIERS>
      </EXPERIMENT_REF>
      <Pool>
        <Member member_name="" accession="ERS074283" sample_name="MR223754-sc-2011-11-18T11:31:44Z-1306470" spots="2708543" bases="406281450"/>
      </Pool>
    </RUN>
  </RUN_SET>
</EXPERIMENT_PACKAGE>

所以不要重新发明轮子,只需了解现有工具即可。有时它比第一眼看起来更容易。

【讨论】:

  • tnks 很多!我会试试。对不起,如果这个问题是转发的,我确实试图找到如何解决它,但我有点绝望......无论如何 tnks 一次:)
【解决方案2】:

http://php.net/manual/en/class.simplexmlelement.php

它将为您提供一个简单的界面来使用 xml 作为对象。您可能会设置一些属性以解析我想的 cdata 值和属性。要从 Web 服务器获取 xml,请使用 curl 或 file_get_contents 之类的东西。但推荐使用 curl。

【讨论】:

  • 问题是我无法单独获取 xml...我使用 curl 得到的是完整的 html 页面:/
  • 哦,对不起。也许您可以在使用 curl 或 file_get_contents 从页面获取 html 后尝试使用 Dom (php.net/manual/en/book.dom.php)。然而,为了美化 XML 代码,他们在其中放入了大量的 HTML,因此提取这些信息会有点困难。也许一旦你找到包含 xml 的元素,你应该用其他东西替换 b 元素,然后使用条形标签来摆脱 html 标签并重新解析以前的 b 元素。或者我想你也可以遍历 dom。
【解决方案3】:

单击发送数据>获取将您带到另一个页面。以不同格式下载的选项。此网址:http://trace.ncbi.nlm.nih.gov/Traces/sra/?cmd=dload&run_list=ERR109334&format=fasta 似乎以 gzip 格式提供数据。也许您可以在此源上使用GET,而不是尝试从 HTML 中解析 XML?

【讨论】:

  • 该文件是巨大的(13 MB 并且仍在下载,我停止了它)。 xml 不能这么大
  • 可能是FASTA 文件格式?我没有查看该页面上的其他选项,但它似乎是一个直接下载链接...
  • 啊,对不起,那些是序列文件,无论如何它只会停在 1/2Gb(我认为 xD)tnks!
【解决方案4】:

您必须列出所有有效的 HTMl 标记并将它们从网页中删除。例如:

$taglist = ['div', 'b', 'input']; // List the HTML tags here.
$xml= (read in the webpage here);
foreach ($taglist as $tag) {
    $regex = '<' . $tag . '(?: [a-z]+(?:=.+))*?>';
    $xml = preg_replace($regex, '', $xml);

    // Repeat for the closing tag
    $regex = '</' . $tag . '(?: [a-z]+(?:=.+))*?>';
    $xml = preg_replace($regex, '', $xml);
}

完成后,$xml 将包含 XML 作为字符串,PHP 应该能够处理它。

【讨论】:

  • 我尝试过使用 strip_tags,但似乎仍有一些东西通过了……如果我把它全部清理干净,我能把它处理成 xml 吗?
【解决方案5】:

这个班XmlRead可以做到。我也为它设置了 curl 类

卷曲:

 function HeaderProc($response,$Run="",$String=1/*[Is 1 IF Use for String Mode ]*/){
          if($String==1){
             $response=explode("\r\n",$response);  
          }
          $PartHeader=0;
          $out[$PartHeader]=array();
          while(list($key,$val)=each($response)){
              $name='';
              $value='';
              $flag=false;
              for($i=0;$i<strlen($val);$i++){
                  if($val[$i]==":"){
                      $flag=true;
                      for($j=$i+1;$j<strlen($val);$j++){
                        if($val[$i]=="\r" and $val[$i+1]=="\n"){    
                            break;
                        }
                        $value.=$val[$j];
                      }
                      break;
                  }
                  $name.=$val[$i]; 
              }
              if($flag){
                if($name=='' and $value==''){
                    $PartHeader++;  
                }else{
                  if(isset($out[$PartHeader][$name])){
                    if(is_array($out[$PartHeader][$name])){   
                        $out[$PartHeader][$name][]=$value;
                    }else{
                        $T=$out[$PartHeader][$name];
                        $out[$PartHeader][$name]=array();
                        $out[$PartHeader][$name][0]=$T;  
                        $out[$PartHeader][$name][1]=$value;  
                    }
                  }else{
                    $out[$PartHeader][$name]=$value;
                  }
                }
              }else{
                if($name==''){
                    $PartHeader++;  
                }else{
                    if(isset($out[$PartHeader][$name])){ 
                      if(is_array($out[$PartHeader][$name])){   
                        $out[$PartHeader][$name][]=$value;
                      }else{
                        $T=$out[$PartHeader][$name];
                        $out[$PartHeader][$name]=array();
                        $out[$PartHeader][$name][0]=$T;  
                        $out[$PartHeader][$name][1]=$name;  
                      }
                    }else{
                        $out[$PartHeader][$name]=$name; 
                    }
                } 
              }
              if($Run!=""){
                $Run($name,$value);  
              }
          }
          return $out;
}

class cURL { 
    var $headers; 
    var $user_agent; 
    var $compression; 
    var $cookie_file; 
    var $proxy; 
    var $Cookie; 
    function CookieAnalysis($Cookie){//convert str cookie to array cookie 
       //echo $Cookie;
       $this->Cookie=array();
       preg_match("~(.*?)=(.*?);~si",' '.$Cookie.'; ',$M);
       $this->Cookie[trim($M[1])]=trim($M[2]);
       return $this->Cookie;
    }
    function cURL($cookies=false,$cookie='cookies.txt',$compression='gzip',$proxy='') {
         $this->headers[] = 'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
         $this->headers[] = 'Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3'; 
         $this->headers[] = 'Accept-Encoding:gzip,deflate,sdch';
         $this->headers[] = 'Accept-Language:en-US,en;q=0.8';
         $this->headers[] = 'Cache-Control:max-age=0';
         $this->headers[] = 'Connection:keep-alive';
         $this->user_agent = 'User-Agent:Mozilla/5.0 (SepidarSoft [Organic Search Engine Crawler] Linux Edition) AppleWebKit/536.5 (KHTML, like Gecko) SepidarBrowser/1.0.100.52 Safari/536.5';
         $this->compression=$compression; 
         $this->proxy=$proxy; 
         $this->cookies=$cookies; 
         if ($this->cookies == TRUE) $this->cookie($cookie); 
    } 
    function cookie($cookie_file) { 
         if (file_exists($cookie_file)) { 
            $this->cookie_file=$cookie_file; 
         } else { 
            fopen($cookie_file,'w') or $this->error('The cookie file could not be opened. Make sure this directory has the correct permissions');
            $this->cookie_file=$cookie_file; 
            @fclose($this->cookie_file); 
         } 
    }
    function GET($url) { 
         $process = curl_init($url); 
         curl_setopt($process, CURLOPT_HTTPHEADER, $this->headers); 
         curl_setopt($process, CURLOPT_HEADER, 1); 
         curl_setopt($process, CURLOPT_USERAGENT, $this->user_agent); 
         if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEFILE, $this->cookie_file);
         if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEJAR, $this->cookie_file);
         curl_setopt($process,CURLOPT_ENCODING , $this->compression); 
         curl_setopt($process, CURLOPT_TIMEOUT, 30); 
         if ($this->proxy) curl_setopt($process, CURLOPT_PROXY, $this->proxy); 
         curl_setopt($process, CURLOPT_RETURNTRANSFER, 1); 
         curl_setopt($process, CURLOPT_FOLLOWLOCATION, 1); 
         $response = curl_exec($process);
         $header_size = curl_getinfo($process,CURLINFO_HEADER_SIZE);
         $result['Header'] = HeaderProc(substr($response, 0, $header_size),'',1);
         foreach($result['Header'] as $HeaderK=>$HeaderP){
           if(!is_array($HeaderP['Set-Cookie']))continue;
           foreach($HeaderP['Set-Cookie'] as $key=>$val){
             $result['Header'][$HeaderK]['Set-Cookie'][$key]=$this->CookieAnalysis($val);
           }
         }
         $result['Body'] = substr( $response, $header_size );
         $result['HTTP_State'] = curl_getinfo($process,CURLINFO_HTTP_CODE);
         $result['URL'] = curl_getinfo($process,CURLINFO_EFFECTIVE_URL); 
         curl_close($process); 
         return $result; 
    }
    function POST($url,$data) { 
         $process = curl_init($url); 
         curl_setopt($process, CURLOPT_HTTPHEADER, $this->headers); 
         curl_setopt($process, CURLOPT_HEADER, 1); 
         curl_setopt($process, CURLOPT_USERAGENT, $this->user_agent); 
         if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEFILE, $this->cookie_file);
         if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEJAR, $this->cookie_file);
         curl_setopt($process, CURLOPT_ENCODING , $this->compression); 
         curl_setopt($process, CURLOPT_TIMEOUT, 30); 
         if ($this->proxy) curl_setopt($process, CURLOPT_PROXY, $this->proxy); 
         curl_setopt($process, CURLOPT_POSTFIELDS, $data); 
         curl_setopt($process, CURLOPT_RETURNTRANSFER, 1); 
         curl_setopt($process, CURLOPT_FOLLOWLOCATION, 1); 
         curl_setopt($process, CURLOPT_POST, 1);
         $response = curl_exec($process); 
         $header_size = curl_getinfo($process,CURLINFO_HEADER_SIZE);
         $result['Header'] = HeaderProc(substr($response, 0, $header_size),'',1);
         foreach($result['Header'] as $HeaderK=>$HeaderP){
            if(!is_array($HeaderP['Set-Cookie']))continue;
           foreach($HeaderP['Set-Cookie'] as $key=>$val){
             $result['Header'][$HeaderK]['Set-Cookie'][$key]=$this->CookieAnalysis($val);
           }
         }
         $result['Body'] = substr( $response, $header_size );
         $result['HTTP_State'] = curl_getinfo($process,CURLINFO_HTTP_CODE);
         $result['URL'] = curl_getinfo($process,CURLINFO_EFFECTIVE_URL);
         curl_close($process); 
         return $result; 
    }
    function error($error) { 
         echo "<center><div style='width:500px;border: 3px solid #FFEEFF; padding: 3px; background-color: #FFDDFF;font-family: verdana; font-size: 10px'><b>cURL Error</b><br>$error</div></center>";
         die; 
    } 
 } 

XmlRead

 class XmlRead{    
    static function Clean($html){
   $html=preg_replace_callback("~<script(.*?)>(.*?)</script>~si",function($m){
      //print_r($m);
     // $m[2]=preg_replace("/\/\*(.*?)\*\/|[\t\r\n]/s"," ", " ".$m[2]." ");
      $m[2]=preg_replace("~//(.*?)\n~si"," ", " ".$m[2]." ");
      //echo $m[2];
      return "<script ".$m[1].">".$m[2]."</script>";
      }, $html);
  $search = array(
        "/ +/" => " ",
        "/<!–\{(.*?)\}–>|<!–(.*?)–>|[\t\r\n]|<!–|–>|\/\/ <!–|\/\/ –>|<!\[CDATA\[|\/\/ \]\]>|\]\]>|\/\/\]\]>|\/\/<!\[CDATA\[/" => "");
  //$html = preg_replace(array_keys($search), array_values($search), $html);   
  $search = array(
       "/\/\*(.*?)\*\/|[\t\r\n]/s" => "",
       "/ +\{ +|\{ +| +\{/" => "{",
       "/ +\} +|\} +| +\}/" => "}",
       "/ +: +|: +| +:/" => ":",
       "/ +; +|; +| +;/" => ";",
       "/ +, +|, +| +,/" => ","
       );
       $html = preg_replace(array_keys($search), array_values($search), $html);
       preg_match_all('!(<(?:code|pre|script).*>[^<]+</(?:code|pre|script)>)!',$html,$pre);
$html = preg_replace('!<(?:code|pre).*>[^<]+</(?:code|pre)>!', '#pre#', $html);
$html = preg_replace('#<!–[^\[].+–>#', '', $html);
$html = preg_replace('/[\r\n\t]+/', ' ', $html);
$html = preg_replace('/>[\s]+</', '><', $html);
$html = preg_replace('/\s+/', ' ', $html);
if (!empty($pre[0])) {
    foreach ($pre[0] as $tag) {
        $html = preg_replace('!#pre#!', $tag, $html,1);
    }
}
return($html);
}
function loadNprepare($content,$encod='') {
   $content=self::Clean($content);
   //$content=html_entity_decode(html_entity_decode($content));
  // $content=htmlspecialchars_decode($content,ENT_HTML5);
   $this->DataPage='';
   preg_match('~<body(.*?)>(.*?)</body>~si',$content,$M);
   $this->DataPage=$M[2];
   $HTML=$this->DataPage;
   $HTML="<!doctype html><html><head><meta charset=\"utf-8\"><title>Untitled Document</title></head><body>".$HTML."</body></html>";
   $dom= new DOMDocument; 
   $HTML = str_replace("&", "&amp;", $HTML);  // disguise &s going IN to loadXML() 
  // $dom->substituteEntities = true;  // collapse &s going OUT to transformToXML() 
   $dom->recover = TRUE;
   @$dom->loadHTML('<?xml encoding="UTF-8">' .$HTML); 
   // dirty fix
   foreach ($dom->childNodes as $item)
    if ($item->nodeType == XML_PI_NODE)
      $dom->removeChild($item); // remove hack
   $dom->encoding = 'UTF-8'; // insert proper
    return $dom;
}
function GetBYClass($Doc,$ClassName){
    $finder = new DomXPath($Doc);
    return($finder->query("//*[contains(@class, '$ClassName')]"));
}
function extractText($node) {
     if($node==NULL)return false;    
     if (XML_TEXT_NODE === $node->nodeType || XML_CDATA_SECTION_NODE === $node->nodeType) {
         return $node->nodeValue;
     } else if (XML_ELEMENT_NODE === $node->nodeType || XML_DOCUMENT_NODE === $node->nodeType || XML_DOCUMENT_FRAG_NODE === $node->nodeType) {
       if ('script' === $node->nodeName) return '';

     $text = '';
     foreach($node->childNodes as $childNode) {
        $text .= $this->extractText($childNode);
     }
     return $text;
     }
}
function DOMRemove(DOMNode $from) {

    $from->parentNode->removeChild($from);    
 }

}

为您的页面调用类和配置

 $cc = new cURL(); //
 $XmlRead=new XmlRead();
 $Data=$cc->get('http://www.ncbi.nlm.nih.gov/sra/ERX086768?report=FullXml');
     //get page 
 $doc=$XmlRead->loadNprepare($Data['Body']);//load as html
     //remove two part of page related to your page .
 $productspec=$XmlRead->DOMRemove($XmlRead->GetBYClass($doc,'title')->item(0));
 $productspec=$XmlRead->DOMRemove($XmlRead->GetBYClass($doc,'aux')->item(0));
     //select xml part
 $productspec=$XmlRead->GetBYClass($doc,'rprt');
 foreach($productspec as $data)
 {
    $content=html_entity_decode(html_entity_decode($XmlRead->extractText($data)));//decode as entity html 
    print_r($content);  
 }

输出:

 <EXPERIMENT_PACKAGE><EXPERIMENT alias="SC_EXP_7229_8#56"center_name="SC"accession="ERX086768"><IDENTIFIERS><PRIMARY_ID>ERX086768</PRIMARY_ID><SUBMITTER_ID namespace="SC">SC_EXP_7229_8#56</SUBMITTER_ID></IDENTIFIERS><TITLE></TITLE><STUDY_REF accession="ERP000913"refname="Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977"refcenter="SC"><IDENTIFIERS><PRIMARY_ID>ERP000913</PRIMARY_ID><SUBMITTER_ID namespace="SC">Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977</SUBMITTER_ID></IDENTIFIERS></STUDY_REF><DESIGN><DESIGN_DESCRIPTION>Standard</DESIGN_DESCRIPTION><SAMPLE_DESCRIPTOR accession="ERS074283"refname="MR223754-sc-2011-11-18T11:31:44Z-1306470"refcenter="SC"><IDENTIFIERS><PRIMARY_ID>ERS074283</PRIMARY_ID><SUBMITTER_ID namespace="SC">MR223754-sc-2011-11-18T11:31:44Z-1306470</SUBMITTER_ID></IDENTIFIERS></SAMPLE_DESCRIPTOR><LIBRARY_DESCRIPTOR><LIBRARY_NAME>4008297</LIBRARY_NAME><LIBRARY_STRATEGY>WGS</LIBRARY_STRATEGY><LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION><LIBRARY_LAYOUT><PAIRED NOMINAL_LENGTH="250"></PAIRED></LIBRARY_LAYOUT></LIBRARY_DESCRIPTOR><SPOT_DESCRIPTOR><SPOT_DECODE_SPEC><READ_SPEC><READ_INDEX>0</READ_INDEX><READ_CLASS>Application Read</READ_CLASS><READ_TYPE>Forward</READ_TYPE><BASE_COORD>1</BASE_COORD></READ_SPEC><READ_SPEC><READ_INDEX>1</READ_INDEX><READ_CLASS>Application Read</READ_CLASS><READ_TYPE>Reverse</READ_TYPE><RELATIVE_ORDER follows_read_index="0"></RELATIVE_ORDER></READ_SPEC></SPOT_DECODE_SPEC></SPOT_DESCRIPTOR></DESIGN><PLATFORM><ILLUMINA><INSTRUMENT_MODEL>Illumina HiSeq 2000</INSTRUMENT_MODEL></ILLUMINA></PLATFORM><PROCESSING></PROCESSING></EXPERIMENT><SUBMISSION accession="ERA119046"center_name="SC"submission_date="2012-04-17T09:29:50Z"alias="ERP000913-sc-20120417-2"lab_name=""><IDENTIFIERS><PRIMARY_ID>ERA119046</PRIMARY_ID><SUBMITTER_ID namespace="SC">ERP000913-sc-20120417-2</SUBMITTER_ID></IDENTIFIERS></SUBMISSION><STUDY alias="Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977"center_name="SC"accession="ERP000913"><IDENTIFIERS><PRIMARY_ID>ERP000913</PRIMARY_ID><SUBMITTER_ID namespace="SC">Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977</SUBMITTER_ID></IDENTIFIERS><DESCRIPTOR><STUDY_TITLE>Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis</STUDY_TITLE><STUDY_TYPE existing_study_type="Whole Genome Sequencing"></STUDY_TYPE><STUDY_ABSTRACT>http://www.sanger.ac.uk/resources/downloads/bacteria/</STUDY_ABSTRACT><CENTER_PROJECT_NAME>Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis</CENTER_PROJECT_NAME><STUDY_DESCRIPTION>http://www.sanger.ac.uk/resources/downloads/bacteria/This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria),please see http://www.sanger.ac.uk/datasharing/</STUDY_DESCRIPTION></DESCRIPTOR></STUDY><SAMPLE alias="MR223754-sc-2011-11-18T11:31:44Z-1306470"center_name="SC"accession="ERS074283"><IDENTIFIERS><PRIMARY_ID>ERS074283</PRIMARY_ID><SUBMITTER_ID namespace="SC">MR223754-sc-2011-11-18T11:31:44Z-1306470</SUBMITTER_ID></IDENTIFIERS><SAMPLE_NAME><COMMON_NAME>Streptococcus dysgalactiae subspecies equisimilis</COMMON_NAME><TAXON_ID>119602</TAXON_ID><SCIENTIFIC_NAME>Streptococcus dysgalactiae subsp. equisimilis</SCIENTIFIC_NAME></SAMPLE_NAME><SAMPLE_LINKS><SAMPLE_LINK><ENTREZ_LINK><DB>biosample</DB><ID>859730</ID></ENTREZ_LINK></SAMPLE_LINK></SAMPLE_LINKS><SAMPLE_ATTRIBUTES><SAMPLE_ATTRIBUTE><TAG>Strain</TAG><VALUE>MR223754</VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>Sample Description</TAG><VALUE></VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>ArrayExpress-StrainOrLine</TAG><VALUE>MR223754</VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>ArrayExpress-Sex</TAG><VALUE>not applicable</VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>ArrayExpress-Species</TAG><VALUE>Streptococcus dysgalactiae subspecies equisimilis</VALUE></SAMPLE_ATTRIBUTE></SAMPLE_ATTRIBUTES></SAMPLE><RUN_SET><RUN alias="SC_RUN_7229_8#56"center_name="SC"accession="ERR109334"total_spots="2708543"total_bases="406281450"size="334475592"load_done="true"published="2012-04-27 20:11:35"is_public="true"cluster_name="public"static_data_available="1"><IDENTIFIERS><PRIMARY_ID>ERR109334</PRIMARY_ID><SUBMITTER_ID namespace="SC">SC_RUN_7229_8#56</SUBMITTER_ID></IDENTIFIERS><EXPERIMENT_REF refname="SC_EXP_7229_8#56"refcenter="SC"accession="ERX086768"><IDENTIFIERS><PRIMARY_ID>ERX086768</PRIMARY_ID><SUBMITTER_ID namespace="SC">SC_EXP_7229_8#56</SUBMITTER_ID></IDENTIFIERS></EXPERIMENT_REF><Pool><Member member_name=""accession="ERS074283"sample_name="MR223754-sc-2011-11-18T11:31:44Z-1306470"spots="2708543"bases="406281450"></Member></Pool></RUN></RUN_SET></EXPERIMENT_PACKAGE>

【讨论】:

  • "不在对象上下文中使用 $this" sy buti 在第 205 行的 preg_replace_callback 上出现此错误(函数 loadNprepare)
  • 因为你的php版本太旧没问题我更新这部分代码
  • 恕我直言,你不应该使用这个类。它做得太多而没有解释任何东西,并且代码质量如此糟糕。最重要的是,没有必要。普通的 DOMDocument 和 DOMXpath 可以正常工作,请参阅我的回答:stackoverflow.com/a/15863656/367456 - 它只需要一小部分代码即可完成这项工作。甚至正确地格式化输出。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2016-02-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多