【问题标题】:How to read JSON from website url using DOMDocument如何使用 DOMDocument 从网站 url 读取 JSON
【发布时间】:2018-10-13 16:37:16
【问题描述】:

在下面的代码中,我尝试读取application/ld+json JSON 并获取ratingValue

使用$rate 的当前 url (https://www.facebook.com/Dermaks) 结果应该是:5

如果您访问此网址(在查看源代码模式下的第 4 行以上),您将能够看到我想阅读的 JSON:

<script type="application/ld+json"> {
    "\u0040context":"http:\/\/schema.org",
    "\u0040type":"LocalBusiness",
    "name":"Kosmetyka Profesjonalna Dermaks",
    "address": {
        "\u0040type": "PostalAddress", "streetAddress": "DERMAKS, ul. Hempla 4\/34a", "addressLocality": "Lublin, Poland", "addressRegion": "Lublin Voivodeship", "postalCode": "20-008"
    }
    ,
    "aggregateRating": {
        "\u0040type": "AggregateRating", "ratingValue": 5, "ratingCount": 2
    }
}

</script>

<script type="application/ld+json"> {
    "\u0040context":"http:\/\/schema.org",
    "\u0040type":"Review",
    "name":"",
    "reviewBody":"Profesjonalna  i przy tym bardzo,bardzo mi\u0142a obs\u0142uga. Zabiegi na bardzo wysokim poziomie. POLECAM next dw\u00f3ch zda\u0144!!!!!!!",
    "itemReviewed": {
        "\u0040type": "LocalBusiness", "name": "Kosmetyka Profesjonalna Dermaks", "sameAs": "https:\/\/www.facebook.com\/Dermaks\/"
    }
    ,
    "reviewRating": {
        "\u0040type": "Rating", "ratingValue": 5
    }
    ,
    "author": {
        "\u0040type": "Person", "name": "Malgorzata Mordo\u0144"
    }
}

</script>

如何修复以下代码?

$url = 'https://www.facebook.com/Dermaks';

function get_data($url, $timeout = 15, $header = array(), $options = array()) {
    if (!function_exists('curl_init')) {
    return file_get_contents($url);
  } elseif (!function_exists('file_get_contents')) {
    return '';
  }
    if (empty($options)) {
        $options = array(
            CURLOPT_URL => $url,
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_SSL_VERIFYPEER => false,
            CURLOPT_FOLLOWLOCATION => true,
            CURLOPT_IPRESOLVE => CURL_IPRESOLVE_V4,
            CURLOPT_TIMEOUT => $timeout
        );
    }
    if (empty($header)) {
        $header = array(
            "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*\/*;q=0.5",
            "Accept-Language: en-us,en;q=0.5",
            "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7",
            "Cache-Control: must-revalidate, max-age=0",
            "Connection: keep-alive",
            "Keep-Alive: 300",
            "Pragma: public"
        );
    }
    if ($header != 'NO_HEADER') {
        $options[CURLOPT_HTTPHEADER] = $header;
    }
    $ch = curl_init();
    curl_setopt_array($ch, $options);
    $data = curl_exec($ch);
    curl_close($ch);
    return $data;
}

$html = get_data($url);
$doc = new DOMDocument();
@$doc->loadHTML($html);
$scripts = $doc->getElementsByTagName('script');
for ($i = 0; $i < $scripts->length; ++$i) {
  $script = $scripts->item($i);
  if ($script->getAttribute('type') == 'application/ld+json') {
    $rate = $script->getAttribute('ratingValue');
  }
}

echo $rate;
// result should be: 5

【问题讨论】:

    标签: php json domdocument


    【解决方案1】:

    您必须运行每个 $scriptjson_decode

    我假设您只需要 aggregateRating 值,因为 ratingValue 有几个元素。

    if ($script->getAttribute('type') == 'application/ld+json') {
      // Load as an array
      $entity = json_decode($script->nodeValue, true);
      if (($entity['type'] == '@LocalBusiness') && isset($entity['aggregateRating'])) {
        $rate = $entity['aggregateRating']['ratingValue'];
        break;
      }
    }
    

    顺便说一下,DOMDocumentloadHTMLFile 方法应该能够在适当的php.ini 配置下自行获取 url:

    $doc->loadHTMLFile($url);
    

    【讨论】:

    • 使用您的代码,我可以看到 echo $i 元素是从 0 到 13,但仍然无法从 application/ld+json 类型中获取任何内容。
    • @HubertFurka 是的,你是对的。也许是因为该信息是通过 javascript 加载后插入的。或者因为他们希望你使用他们的 api。问题是您获取的$html 不包含那些script 标签
    猜你喜欢
    • 1970-01-01
    • 2013-03-31
    • 1970-01-01
    • 1970-01-01
    • 2016-06-25
    • 2022-12-21
    • 1970-01-01
    • 2011-03-10
    • 2020-11-17
    相关资源
    最近更新 更多