在 PHP 中抓取页面答案

【问题标题】：Scraping a page in PHP在 PHP 中抓取页面
【发布时间】：2019-01-08 10:14:30
【问题描述】：

我想使用来自 footballstats.com 的 Php Simple Dom 解析器抓取一些数据，但我不能，因为总是在加载正常页面之前出现 cookie 页面。如何绕过cookie页面？我的代码是这样的：

<?php
    include_once('../scrapper/scrapper.php');
    $url = 'https://www.soccerstats.com/matches.asp';
    $html = file_get_html($url);

    $stats = array();
    foreach($html->find('table') as $table) {
        $stats[] = $table->outertext;
    }
    $results = implode(",", $stats);    

    echo $results; 
?>

【问题讨论】：

你应该只用一个p重命名你的刮刀。

标签： php parsing dom

【解决方案1】：

快速浏览https://www.soccerstats.com/matches.asp 页面表明，“cookie 页面”的真正作用是 它需要用户单击一个按钮，当单击该按钮时，它只会设置一个 cookie cookiesok值yes，如该页面的源代码所示：

<button class="button button3" onclick=" setCookielocal('cookiesok', 'yes', 365)"><font size='4'>I agree. Continue to website.</font></button>

所以，我们需要做的是以某种方式让 PHP 获取带有此 cookie 集的页面。

由于您使用的是 https://sourceforge.net/projects/simplehtmldom/ 库及其函数 file_get_html()，我查看了该函数的源代码，发现它确实在幕后使用了 file_get_contents() function - 同时它允许我们传递我们自己的“上下文”，我们可以通过stream_context_create() function 创建它。

简而言之，stream_context_create() 允许我们创建一个包含所需 cookies 的上下文，以便在 file_get_html() 函数中使用。

最终代码：

<?php

    include_once '../scrapper/scrapper.php';

    // Options for the context we're about to create.
    $options = [
        "http" => [
            "header" => "Cookie: cookiesok=yes\r\n",
        ],
    ];

    // Context we're going to pass to the file_get_html() function.
    $context = stream_context_create($options);

    $url = 'https://www.soccerstats.com/matches.asp';
    $html = file_get_html($url, false, $context);

    $stats = array();
    foreach($html->find('table') as $table) {
        $stats[] = $table->outertext;
    }
    $results = implode(",", $stats);

    echo $results;

【讨论】：