如何从抓取的网页/文章中提取标题和内容？答案

【问题标题】：How to extract the headline and content from a crawled web page / article?如何从抓取的网页/文章中提取标题和内容？
【发布时间】：2010-05-08 11:06:18
【问题描述】：

我需要一些关于如何检测已抓取页面的标题和内容的指南。自从我开始研究这个爬虫以来，我一直在看到一些非常奇怪的前端代码工作。

【问题讨论】：

标签： parsing web-crawler

【解决方案1】：

你可以试试Simple HTML DOM Parser。它使用一种语法来查找类似于 jQuery 的特定元素。

他们有一个关于如何抓取 Slashdot 的示例：

// Create DOM from URL
$html = file_get_html('http://slashdot.org/');

// Find all article blocks
foreach($html->find('div.article') as $article) {
    $item['title']     = $article->find('div.title', 0)->plaintext;
    $item['intro']    = $article->find('div.intro', 0)->plaintext;
    $item['details'] = $article->find('div.details', 0)->plaintext;
    $articles[] = $item;
}

print_r($articles);

【讨论】：

再想一想，这肯定是一个四水四氢复制品。嗯嗯。
好吧，我找不到任何相关的文章。我不是在寻找 HTML 解析器，而是在寻找区分标题和文本与其他垃圾的方法。
<td><table cellSpacing=0 cellPadding=0 width="100%" border=0><tbody><tr><td align=right width="95%" style="border-color:#3333DD; font-family:Times New Roman, Times, serif; font-weight:bold;color:#003399; font-size:22px; text-align:center; overflow:hidden;"><b> --- 这是我们目标网站上标题的起始标签。
@gAMBOO 哦，我的。这将非常艰难，尤其是考虑到结构可能每天都在变化。在这种情况下，我建议与目标站点交谈，看看是否有更好的方法来获取数据（例如 XML 或 RSS 格式）。
RSS 不可靠。其中很多根本不支持 RSS，而且在支持 RSS 的中，很多都截断了文本。