使用简单的 Dom 解析器获取 wiki 信息框的内容答案

【问题标题】：using Simple Dom parser to Get content of wiki infobox使用简单的 Dom 解析器获取 wiki 信息框的内容
【发布时间】：2014-03-08 03:17:06
【问题描述】：

我尝试使用简单的 Dom Parser 显示 Wikipedia 信息框的内容，但这给我带来了问题。这是代码。`

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>
<body>
<?php
//The folder where you uploaded simple_html_dom.php
require_once('simple_html_dom.php');

//Wikipedia page to parse
$html = file_get_html('https://en.wikipedia.org/wiki/Burger_King');

foreach ( $html->find ( 'table[class=infobox vcard]' ) as $element ) {

    $cells = $element->find('td');

    $i = 0;

    foreach($cells as $cell) {


        $left[$i] = $cell->plaintext;

        if (!(empty($left[$i]))) {

            $i = $i + 1;

        }

    }


    $cells = $element->find('th');

    $i = 0;

    foreach($cells as $cell) {

        $right[$i] = $cell->plaintext;

        if (!(empty($right[$i]))) {

            $i = $i + 1;

        }

    }


print_r ($right);

echo "<br><br><br>";

print_r ($left);

//If you want to know what kind of industry burger king is
//echo "Burger king is $right[2], $left[2]

}


?>

</body>
</html>

该代码不适用于https://en.wikipedia.org/wiki/United_Kingdom 等任何其他页面，它只能使用https://en.wikipedia.org/wiki/Burger_King。这是错误消息我收到致命错误：在第 16 行调用 C:\wamp\www\MyApps\Inbox.php 中非对象上的成员函数 find()

【问题讨论】：

simple_html_dom.php 你的 php 文件里有吗？
是在我的本地文件夹中
HTTPS 包装器是否启用？
不！但我现在找到了如何使用stackoverflow.com/questions/2305954/… 启用它。
谢谢。它可以工作，但只有en.wikipedia.org/wiki/Burger_King。知道为什么它不适用于其他页面，例如 en.wikipedia.org/wiki/India

标签： php

【解决方案1】：

1：此代码对您不起作用，因为您正试图在国家页面 class="信息框地理电子名片”。

2：因此这不是唯一的问题，因为您肯定会耗尽内存。

替换

$html = file_get_html('https://en.wikipedia.org/wiki/United_Kingdom');

与：

$url = 'https://en.wikipedia.org/wiki/United_Kingdom';

$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);

$html = new simple_html_dom();
$html->load($curl_scraped_page, true, false);

你应该得到类似的东西

Fatal error: Out of memory (allocated XXX) (tried to allocate 40 bytes) 
in /simple_html_dom.php on line 1544

3：如果您能够解决以前的问题，您还必须更新您的代码，这可能无法正常工作

编辑 1：

我最喜欢避免这个问题的方法是使用谷歌缓存，它有一个“纯文本”版本。这通常避免了存储大量数据的需要，这是无法使您的代码正常工作的原因之一。主要的缺点是 Google 缓存不知道这与 th有关，所以里面的内容就消失了。

我会寻找替代方案，同时这里是代码 XD

<?php

require_once('simple_html_dom.php');
//$html = file_get_html('https://en.wikipedia.org/wiki/United_Kingdom');

    //q = website to fetch, leave "cache:"
    $url = 'http://webcache.googleusercontent.com/search?strip=1&q=cache:en.wikipedia.org/wiki/United_Kingdom';

    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $curl_scraped_page = curl_exec($ch);

    $html = new simple_html_dom();
    $html->load($curl_scraped_page, true, false);


//echo $html;


foreach ( $html->find ( 'table[class=infobox geography vcard]' ) as $element ) {


    $cells = $element->find('td');

    $i = 0;

    foreach($cells as $cell) {


        $left[$i] = $cell->plaintext;

        if (!(empty($left[$i]))) {

            $i = $i + 1;

        }

    }


print_r ($left);

}


?>

如果我帮助了你（我确定我确实帮助了你），请标记为最佳答案并竖起大拇指：P

【讨论】：

【解决方案2】：

我发现错误来自表[class=infobox vcard]，这仅检索到class=Infobox的表的内容

【讨论】：