PHP网页抓取答案

【问题标题】：PHP web scrapingPHP网页抓取
【发布时间】：2012-08-06 11:27:35
【问题描述】：

我使用php网页抓取，我想从下面的html代码中获取周日的价格（3.65）：

     <tr class="odd">
       <td >
           <b>Sunday</b> Info
           <div class="test">test</div>
       </td>
       <td>
       &euro; 3.65 *

       </td>
    </tr>

但我没有找到最好的正则表达式来做到这一点...... 我使用这个 php 代码：

    <?php
        $data = file_get_contents('http://www.test.com/');

        preg_match('/<tr class="odd"><td ><b>Sunday</b> Info<div class="test">test<\/div><\/td><td>&euro; (.*) *<\/td><\/tr>/i', $data, $matches);
        $result = $matches[1];
    ?>

但没有结果... 正则表达式有什么问题？（我认为是因为新的行/空格？）

【问题讨论】：

regex on "€ ([0-9.]*) " 来获取价格。如果它是其中之一，你可以先 split() 它。还要注意特殊的正则表达式字符，比如价格后面明显的 *！
但我也需要使用“星期日”，因为还有其他日子...
/星期日(.*)€ ([0-9.]*)/s 会给我最长的答案，有没有办法得到最短的答案？如果可能的话，那可能会奏效......
如果您无权从网站上抓取，请不要这样做。如果您有权限，请索要 XML 格式的价目表提要，该提要专为数据提取而设计。

标签： php regex web-scraping

【解决方案1】：

不要使用正则表达式，HTML 不是正则的。

改为使用 DOM 树解析器，例如 DOMDocument。这个documentation 可以帮到你。

/s 开关应该可以帮助您使用原始的正则表达式，尽管我没有尝试过。

【讨论】：

【解决方案2】：

问题是标签之间的空格。有换行符、制表符和/或空格。

您的正则表达式与它们不匹配。

您还需要为多行设置 preg_match！

我认为使用 xpath 进行抓取更容易。

【讨论】：

【解决方案3】：

尝试用 '' 替换换行符，然后再次执行正则表达式。

【讨论】：

【解决方案4】：

这样试试：

$uri = ('http://www.test.com/');
$get = file_get_contents($uri);

$pos1 = strpos($get, "<tr class=\"odd\"><td ><b>Sunday</b> Info<div class=\"test\">test</div></td><td>&euro;");
$pos2 = strpos($get, "*</td></tr>", $pos1);
$text = substr($get,$pos1,$pos2-$pos1);
$text1 = strip_tags($text);

【讨论】：

【解决方案5】：

使用 PHP DOMDocument 对象。我们将解析网页中的 HTML DOM 数据

    $dom = new DOMDocument();
    $dom->loadHTML($data);

    $trs = $dom->getElementsByTagName('tr'); // this gives us all the tr elements on the webpage

    // loop through all the tr tags
    foreach($trs as $tr) {
        // until we get one with the class 'odd' and has a b tag value of SUNDAY
        if ($tr->getAttribute('class') == 'odd' && $tr->getElementsByTagName('b')->item(0)->nodeValue == 'Sunday') {
            // now set the price to the node value of the second td tag
            $price = trim($tr->getElementsByTagName('td')->item(1)->nodeValue);
            break;
        }

    }

不使用 DOMDocument 进行网页抓取，有点繁琐，您可以使用 SimpleHtmlDomParser，它是开源的。

【讨论】：