简单的 Html DOM 缓存答案

【问题标题】：Simple Html DOM Caching简单的 Html DOM 缓存
【发布时间】：2011-12-15 09:01:03
【问题描述】：

我正在使用简单的 HTML DOM 来抓取（经许可）一些网站。我基本上用统计数据抓取了大约 50 个不同的网站，每天大约更新四次。

正如您所想象的那样，进行抓取需要时间，因此我需要通过一些缓存来加快该过程。

我的愿景是：

DATA-PRESENTATION.php // 显示所有结果的地方

SCRAPING.php // 完成这项工作的代码

我想以每天执行 4 次的方式在 SCRAPING.PHP 上设置一个 cron 作业，并将所有数据保存在缓存中，然后 DATA-PRESENTATION.PHP 将请求这些数据，从而更快地为用户提供体验.

我的问题是如何实现这个缓存的东西？我是 PHP 的新手，我一直在阅读教程，但它们不是很有帮助，而且只有一些，所以我无法真正学习如何去做。

我知道其他解决方案可能正在实施数据库，但我不想这样做。另外，我一直在阅读有关 memcached 之类的高端解决方案，但该站点非常简单且仅供个人使用，因此我不需要那种东西。

谢谢！！

SCRAPING.PHP

<?php
include("simple_html_dom.php");

// Labour stats
$html7 = file_get_html('http://www.website1.html');
$web_title = $html7->find(".title h1");
$web_figure = $html7->find(".figures h2");

?>

DATA-PRESENTATION.PHP

 <div class="news-pitch">
 <h1>Webiste: <?php echo utf8_encode($web_title[0]->plaintext); ?></h1>
 <p>Unemployment rate: <?php echo utf8_encode($web_figure[0]->plaintext); ?></p>
 </div>

最终代码！非常感谢@jerjer 和@PaulD.Waite，如果没有你们的帮助，我真的无法完成这项工作！

文件：

1- DataPresentation.php //这里我显示请求到Cache.html的数据

2- Scraping.php //这里我先抓取网站，然后将结果保存到Cache.html

3- Cache.html //这里保存的是抓取结果

我在 Scraping.php 上设置了一个 Cron 作业，告诉它每次都覆盖 Cache.html。

1- DataPresentation.php

<?php
include("simple_html_dom.php");

$html = file_get_html("cache/test.html");
$title = $html->find("h1");
echo $title[0]->plaintext;
?>

2- Scraping.php

<?php
include("simple_html_dom.php");

// by adding "->find("h1")" I speed up things as it only retrieves the information I'll be using and not the whole page.
$filename = "cache/test.html";
$content = file_get_html ('http://www.website.com/')->find("h1");
file_put_contents($filename, $content);
?>

3- Cache.html

<h1>Current unemployment 7,2%</h1>

它会立即加载，通过这种方式设置，我可以确保始终有一个 Caché 文件要加载。

【问题讨论】：

你可以使用文件而不是数据库进行缓存

标签： php caching web-scraping

【解决方案1】：

这是一个基于文件的缓存示例：

<?php
    // Labour stats
    $filename = "cache/website1.html";
    if(!file_exists($filename)){
        $content = file_get_contents('http://www.website1.html');
        file_put_contents($filename, $content);
    }

    $html7 = file_get_html($filename);
    $web_title = $html7->find(".title h1");
    $web_figure = $html7->find(".figures h2");

?>

【讨论】：

你需要添加一些代码让它每天刷新 4 次。如果我正确理解该代码，它会抓取网站一次，然后永远加载缓存的文件。例如，cron 作业可以在每次运行时删除缓存文件。
同意你的观点，保罗，使缓存文件无效是 cron 作业的任务
非常感谢大家！没有你的帮助，我真的无法完成这件事！我发布了最终代码！

【解决方案2】：

尝试使用 Zend_Framework 中的 Zend_Cache 库。使用起来非常简单：

function loadHtmlWithCache($webAddress){

    $frontendOptions = array(
       'lifetime' => 7200, // cache lifetime of 2 hours
       'automatic_serialization' => true
    );

    $backendOptions = array(
        'cache_dir' => './tmp/' // Directory where to put the cache files
    );

    // getting a Zend_Cache_Core object
    $cache = Zend_Cache::factory('Core',
                                 'File',
                                 $frontendOptions,
                                 $backendOptions);

    if( ($result = $cache->load($webAddress)) === false ) {


       $html7 = file_get_html($webAddress);
       $web_title = $html7->find(".title h1");
       $web_figure = $html7->find(".figures h2");
       $cache->save($webAddress,array('title'=>$web_title,'figure' => $web_figure));

    } else {

        // cache hit! shout so that we know
        $web_title = $result['title'];
        $web_figure = $result['figure'];

    }

}

【讨论】：