在 php 和 mysql 中从维基百科中提取内容答案

【问题标题】：Extracting content from wikipedia in php and mysql在 php 和 mysql 中从维基百科中提取内容
【发布时间】：2012-11-17 07:06:14
【问题描述】：

我有一个网页，其中包含来自维基百科的精选文章的所有链接，我提取了所有这些文章的标题、描述和关键字。但是我有一个问题，当网络爬虫开始提取文章内容时，我的数据库中的字段描述和关键字仍然是空的。

如何提取维基百科文章的描述和关键词？

网络爬虫是用php和mysql编写的，这是实际代码：

<?php
error_reporting(E_ALL | E_STRICT);
set_time_limit(0);
$server_link = mysql_connect("localhost", "root", "");
if (!$server_link) {
    die("Fall&oacute; la Conexi&oacute;n " . mysql_error());
}
$db_selected = mysql_select_db("test", $server_link);
if (!$db_selected) {
    die("No se pudo seleccionar la Base de Datos " . mysql_error());
}
@mysql_query("SET NAMES 'utf8'");
function storeLink($titulo, $descripcion, $url, $keywords) {
    $query = "INSERT INTO webs (webTitulo, webDescripcion, weburl, webkeywords) VALUES ('$titulo', '$descripcion', '$url', '$keywords')";
    mysql_query($query) or die('Error, fallÃ³ la inserciÃ³n de datos');
}
function extraer($url, $prof, $patron) {
    $userAgent = 'Interredu';
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_HTTPHEADER, array(("Accept-Language: es-es,en")));
    curl_setopt($ch, CURLOPT_FAILONERROR, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
    curl_setopt($ch, CURLOPT_MAXREDIRS, 2);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $html = curl_exec($ch);
    saveUrl($url, $prof, $patron, $html);
    if (!$html) {
        echo "<br />cURL error number:" . curl_errno($ch);
        echo "<br />cURL error:" . curl_error($ch);
    }
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    $xpath = new DOMXPath($dom);
    $hrefs = $xpath->evaluate("/html/body//a");
    for ($i = 0;$i < $hrefs->length;++$i) {
        $href = $hrefs->item($i);
        $url2 = $href->getAttribute('href');
        $var = strstr($url2, '#', true);
        if ($var !== false) {
            $url2 = $var;
        }

        if (strpos($url2, $patron) === false) {
            continue;
        }

        if ($url2 != $url && $url2 != '') {
            $busqueda = mysql_query("SELECT weburl FROM webs WHERE weburl='$url2'");
            $cantidad = mysql_num_rows($busqueda);
            if (1500 >= $prof && 0 == $cantidad) {
                extraer($url2, ++$prof, $patron);
            }
        }
    }
}
function saveUrl($url, $prof, $patron, $html) {
    $retorno = false;
    $pos = strpos($url, $patron);
    if ($prof >= 1) {
        preg_match_all("(<title>(.*)<\/title>)siU", $html, $title);
        $metas = get_meta_tags($url, 1);
        $titulo = html_entity_decode($title[1][0], ENT_QUOTES, 'UTF-8');
        $descripcion = isset($metas["description"])?$metas["description"] : '';
        $keywords = isset($metas["keywords"])?$metas["keywords"] : '';
    if (empty($descripcion)){
obtenerMetaDescription($html);
    }
    if (empty($keywords)){
preg_match_all("#<\s*h1[^>]*>[^<]+</h1>#is", $html, $encabezado);
    preg_match_all("#<\s*b[^>]*>[^<]+</b>#is", $html, $negrita);
    preg_match_all("#<\s*i[^>]*>[^<]+</i>#is", $html, $italica);
    foreach($encabezado[0] as $encabezado){
    $h1 = $encabezado;
    }
    foreach($negrita[0] as $negrita){
    $bold = $negrita;
    }
    foreach($italica[0] as $italica){
    $italic = $italica;
    }
    $keys = $bold." ".$h1." ".$italic." ";
    $keywords = substr(strip_tags($keys), 0, 200);
}
        storeLink($titulo, $descripcion, $url, $keywords, $prof);
        $retorno = true;
    }
    return $retorno;
}
function obtenerMetaDescription($text) {
    preg_match_all('#<p>(.*)</p>#Us', $html, $parraf);
    foreach($parraf[1] as $parraf){
    $descripcion = substr(strip_tags($parraf), 0, 200);
    }
    }
$url = "http://www.mywebsite.com/wikiarticles";
$patron = "http://es.wikipedia.org/wiki/";
$prof = 1500;
libxml_use_internal_errors(true);
extraer($url, 1, $patron);
$errores = libxml_get_errors();
libxml_clear_errors();
mysql_close();
?>

谢谢大家，您好。

【问题讨论】：

标签： php mysql database web-crawler extract

【解决方案1】：

这种情况下的一般方法

首先要做的是定位错误

检查一些已知维基百科网址的不同位置（$descripcion、$metas、$parraf）变量的内容（您可以手动检查）
这可以让您找出变量的正确位置和错误位置

那么你可以得出以下可能的结论：

代码中的每个变量都是正确的：您的 mysql-insert 方法中存在一些问题
某些变量未设置，即使它应该是：代码中特定位置的错误

这种方法如何适用于您的情况

元描述似乎没有在维基百科上使用（至少在我查看的文章中）
因此，应该调用 obtenerMetaDescription()
所以我用这样一个小例子尝试了这个方法：

代码：

function obtenerMetaDescription($text) {
    preg_match_all('#<p>(.*)</p>#Us', $html, $parraf);
    foreach($parraf[1] as $parraf){
        $descripcion = substr(strip_tags($parraf), 0, 200);
        var_dump($descripcion);
    }
}

$html = file_get_contents('https://de.wikipedia.org/wiki/Ehrenmal_Marienfeld');
obtenerMetaDescription($html);

PHP 输出为：PHP Notice: Undefined variable: html in test.php on line 4

适合您情况的解决方案

您使用了$html，即使它作为$text 传递给函数。简单的变量问题。

可能的其他问题

在同一函数中仔细检查分配给$descripcion。您在 for 循环 中将<p> 的内容分配给$descripcion。每次都用旧值覆盖旧值。我无法想象这是一种预期的行为。我猜你想实现以下两者之一：

只取第一段：如果!empty()，只使用$parraf[1][0]
将所有文本连接成一个大文本：使用.=字符串连接运算符

【讨论】：

嗨！，在字段关键字显示“数组”和描述字段是空白的。我对代码进行了以下更改：img707.imageshack.us/img707/6672/codeerror.jpg 问候和感谢。