Python 有什么类似于 readability.js 的吗？答案

【问题标题】：Is there anything for Python that is like readability.js?Python 有什么类似于 readability.js 的吗？
【发布时间】：2011-02-24 15:51:57
【问题描述】：

我正在寻找与 Arc90 的 readability.js 大致相当的 Python 包/模块/函数等

http://lab.arc90.com/experiments/readability

http://lab.arc90.com/experiments/readability/js/readability.js

这样我就可以给它一些 input.html 并且结果是该 html 页面的“main text”的清理版本。我想要这个，以便我可以在服务器端使用它（不像只在浏览器端运行的 JS 版本）。

有什么想法吗？

PS：我已经尝试过 Rhino + env.js 并且该组合有效，但性能无法接受，清理大部分 html 内容需要几分钟:(（仍然找不到为什么会有如此大的性能差异） .

【问题讨论】：

标签： javascript python html-content-extraction heuristics

【解决方案1】：

为什么不尝试使用 Google V8/Node.js 而不是 Rhino？它的速度应该可以接受。

【讨论】：

env.js 是否在 V8/Node.js 上运行，以便我拥有类似浏览器的环境？

【解决方案2】：

我认为BeautifulSoup 是python 最好的HTML 解析器。但是您仍然需要弄清楚网站的“主要”部分是什么。

如果您只解析单个域，这相当简单，但要找到适用于任何网站的模式并不容易。

也许您可以将 readability.js 方法移植到 python？

【讨论】：

【解决方案3】：

我过去对此进行了一些研究，最终在 Python 中实现了this approach [pdf]。我实现的最终版本在应用算法之前也做了一些清理，比如删除 head/script/iframe 元素、隐藏元素等，但这是它的核心。

这是一个带有（非常）简单的“链接列表”鉴别器实现的函数，它试图删除链接与文本比例较大的元素（即导航栏、菜单、广告等）：

def link_list_discriminator(html, min_links=2, ratio=0.5):
    """Remove blocks with a high link to text ratio.

    These are typically navigation elements.

    Based on an algorithm described in:
        http://www.psl.cs.columbia.edu/crunch/WWWJ.pdf

    :param html: ElementTree object.
    :param min_links: Minimum number of links inside an element
                      before considering a block for deletion.
    :param ratio: Ratio of link text to all text before an element is considered
                  for deletion.
    """
    def collapse(strings):
        return u''.join(filter(None, (text.strip() for text in strings)))

    # FIXME: This doesn't account for top-level text...
    for el in html.xpath('//*'):
        anchor_text = el.xpath('.//a//text()')
        anchor_count = len(anchor_text)
        anchor_text = collapse(anchor_text)
        text = collapse(el.xpath('.//text()'))
        anchors = float(len(anchor_text))
        all = float(len(text))
        if anchor_count > min_links and all and anchors / all > ratio:
            el.drop_tree()

在我使用的测试语料库中，它实际上运行良好，但要实现高可靠性需要大量调整。

【讨论】：

【解决方案4】：

我们刚刚在 repustate.com 上推出了一个新的自然语言处理 API。使用 REST API，您可以清理任何 HTML 或 PDF 并仅取回文本部分。我们的 API 是免费的，所以请随意使用。它是在python中实现的。检查一下并将结果与 readability.js 进行比较 - 我想您会发现它们几乎 100% 相同。

【讨论】：

嗯，看起来很有希望！ ;-) 我会试一试。有什么硬性限制吗？我每天可以处理多少页等？
哇，我刚刚用你的网站输入了一些网址，它完美地提取了文章。

【解决方案5】：

hn.py 通过Readability's blog。 App Engine 应用程序Readable Feeds 使用了它。

我已将其捆绑为 pip 可安装模块：http://github.com/srid/readability

【讨论】：

与现在可用的版本相比，这似乎是一个非常古老的可读性版本：0.4 vs. 1.7.1。有更新的机会吗？

【解决方案6】：

请尝试我的 fork https://github.com/buriy/python-readability，它速度快并且具有最新 javascript 版本的所有功能。

【讨论】：