【问题标题】:Extract meta-description content using DOMdocument() & xpath in php在 php 中使用 DOMdocument() 和 xpath 提取元描述内容
【发布时间】:2016-04-25 14:09:17
【问题描述】:

我正在尝试从页面中提取元描述内容并显示在搜索结果中。 但是,它的显示如下:

content="Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus。"

而我只想要:

Lorem ipsum dolor sit amet,consectetuer adipiscing elit。 Aenean commodo ligula eget dolor。埃涅马萨。 Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus。

猜猜看,我的代码有什么问题?

代码:

  $doc = new DOMDocument();
  @$doc->loadHTMLFile($page_path);
  $xpath = new DOMXPath($doc);

  $body = $xpath->query('//meta[@name="description"]/@content');
  $page_title = @$doc->getElementsByTagName('title')->item(0)->textContent;
  $page_title = $page_title ? $page_title : $page_path;
  $page_body = html2text($doc->saveXml($body->item(0)));// this is meta-description, which i want

  Functions :

  function html2text($html)
  {
  $text = $html;
  static $search = array(
  '@<script.+?</script>@usi',  // Strip out javascript content
  '@<style.+?</style>@usi',    // Strip style content
  '@<!--.+?-->@us',            // Strip multi-line comments including CDATA
  '@</?[a-z].*?\>@usi',         // Strip out HTML tags
  );
  $text = preg_replace($search, ' ', $text);
  /*
  * normalize common entities
  */
  $text = normalizeEntities($text);
  /*
  * decode other entities
  */
  $text = html_entity_decode($text, ENT_QUOTES, 'utf-8');
  /*
  * normalize possibly repeated newlines, tabs, spaces to spaces
  */
  $text = preg_replace('/\s+/u', ' ', $text);
  $text = trim($text);
  return $text;
  }


  /**
  * Replace encoded and double encoded entities to equivalent unicode character
  * @param string $text
  * @return string - the same as $text but without encoded entries
  * @access public
  */
  function normalizeEntities($text)
  {
  static $find = array();
  static $repl = array();
  if (!count($find)) {
  /*
  * build $find and $replace from map one time
  */
  $map = array(
  array('\'', 'apos', 39, 'x27'), // Apostrophe
  array('\'', '‘', 'lsquo', 8216, 'x2018'), // Open single quote
  array('\'', '’', 'rsquo', 8217, 'x2019'), // Close single quote
  array('"', '“', 'ldquo', 8220, 'x201C'), // Open double quotes
  array('"', '”', 'rdquo', 8221, 'x201D'), // Close double quotes
  array('\'', '‚', 'sbquo', 8218, 'x201A'), // Single low-9 quote
  array('"', '„', 'bdquo', 8222, 'x201E'), // Double low-9 quote
  array('\'', '′', 'prime', 8242, 'x2032'), // Prime/minutes/feet
  array('"', '″', 'Prime', 8243, 'x2033'), // Double prime/seconds/inches
  array(' ', 'nbsp', 160, 'xA0'), // Non-breaking space
  array('-', '‐', 8208, 'x2010'), // Hyphen
  array('-', '–', 'ndash', 8211, 150, 'x2013'), // En dash
  array('--', '—', 'mdash', 8212, 151, 'x2014'), // Em dash
  array(' ', ' ', 'ensp', 8194, 'x2002'), // En space
  array(' ', ' ', 'emsp', 8195, 'x2003'), // Em space
  array(' ', ' ', 'thinsp', 8201, 'x2009'), // Thin space
  array('*', '•', 'bull', 8226, 'x2022'), // Bullet
  array('*', '‣', 8227, 'x2023'), // Triangular bullet
  array('...', '…', 'hellip', 8230, 'x2026'), // Horizontal ellipsis
  array('°', 'deg', 176, 'xB0'), // Degree
  array('€', 'euro', 8364, 'x20AC'), // Euro
  array('¥', 'yen', 165, 'xA5'), // Yen
  array('£', 'pound', 163, 'xA3'), // British Pound
  array('©', 'copy', 169, 'xA9'), // Copyright Sign
  array('®', 'reg', 174, 'xAE'), // Registered Sign
  array('™', 'trade', 8482, 'x2122') // TM Sign
  );
  foreach ($map as $e) {
  for ($i = 1; $i < count($e); ++$i) {
  $code = $e[$i];
  if (is_int($code)) {
  // numeric entity
  $regex = "/&(amp;)?#0*$code;/";
  } elseif (preg_match('/^.$/u', $code)/* one unicode char*/) {
  // single character
  $regex = "/$code/u";
  } elseif (preg_match('/^x([0-9A-F]{2}){1,2}$/i', $code)) {
  // hex entity
  $regex = "/&(amp;)?#x0*" . substr($code, 1) . ";/i";
  } else {
  // named entity
  $regex = "/&(amp;)?$code;/";
  }
  $find[] = $regex;
  $repl[] = $e[0];
  }
  }
  }
  return preg_replace($find, $repl, $text);
  }

【问题讨论】:

    标签: php xml xpath xml-parsing domdocument


    【解决方案1】:

    您正在将属性节点保存为 XML。不!只需读取它的值。

    属性节点 (DOMAttr) 具有返回属性值的属性值。属性值为文本值。

    $html = <<<'HTML'
    <meta name="description" content="Some description">
    HTML;
    
    $document = new DOMDocument();
    $document->loadHTML($html);
    $xpath = new DOMXPath($document);
    
    $description = $xpath->evaluate('//meta[@name="description"]/@content')->item(0);
    var_dump($description->value);
    

    输出:

    string(16) "Some description"
    

    但是 Xpath 可以直接将值作为字符串返回。只需转换结果(在 Xpath 中)。这仅适用于DOMXpath::evaluate()DOMXpath::query() 只能返回节点列表。

    $description = $xpath->evaluate('string(//meta[@name="description"]/@content)');
    var_dump($description);
    

    输出:

    string(16) "Some description"
    

    它也适用于其他节点。如果你投射一个元素节点(比如body),只会返回它的文本内容,像&amp;copy; 这样的实体会被解码。

    $html = <<<'HTML'
    <html>
      <head>
        <title>The Title</title>
        <meta name="description" content="Some description">
      </head>
      <body>
        <p>Some content &amp; entities &copy;</p>
      </body>
    </html>
    HTML;
    
    $document = new DOMDocument();
    $document->loadHTML($html);
    $xpath = new DOMXPath($document);
    
    $title = $xpath->evaluate('string(//head/title)');
    $description = $xpath->evaluate('string(//meta[@name="description"]/@content)');
    $content = $xpath->evaluate('string(//body)');
    
    var_dump($title, $description, $content);
    

    输出:

    string(9) "The Title"
    string(16) "Some description"
    string(36) "
        Some content & entities ©
      "
    

    【讨论】:

    • 我用过 - 评估。它仍然给我 content="Some description",而我只需要 - 一些描述,通过对 content="" 符号说再见
    • 我想你复制的原始源将 body 元素保存为 XML 并使用字符串函数,试图反序列化它。这不是必需的,只需将值读取为字符串即可。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-05-05
    • 1970-01-01
    • 2012-07-08
    • 1970-01-01
    • 2016-02-14
    相关资源
    最近更新 更多