Python，解析html答案

【问题标题】：Python, parsing htmlPython，解析html
【发布时间】：2012-11-25 09:35:14
【问题描述】：

感谢本网站的好心用户，我对如何使用 re 作为非标准 python 模块的替代方案有了一些想法，这样我的脚本就可以在最小的悬垂下工作。今天，我一直在尝试解析模块。我遇到了beautifulsoup..这一切都很棒，但我不明白。

出于教育目的，我想从http://yify-torrents.com/browse-movie 中删除以下信息（请不要告诉我使用网络爬虫，我不是要爬取整个网站 - 只需从这个页面来了解解析模块是如何工作的！）

电影名称质量种子链接

这些项目有 22 个，我希望它们按顺序存储在列表中，即。项目_1，项目_2。而这些列表需要包含这三个项目。例如：

item_1 = ["James Bond: Casino Royale (2006)", "720p", "http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent"]
item_2 = ["Pitch Perfect (2012)", "720p", "http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent"]

然后，为了简单起见，我只想将每个项目打印到控制台。然而，为了让事情变得更加困难，这些项目在页面上没有标识符，所以 info.需要严格排序。这一切都很好，但我得到的只是每个列表项包含的整个源，或者是空项！一个示例项目分隔符如下：

<div class="browse-info">
    <span class="info">
        <h3><a href="http://yify-torrents.com/movie/James_Bond_Casino_Royale_2006">James Bond: Casino Royale (2006)</a></h3>
        <p><b>Size:</b> 1018.26 MB</p>
        <p><b>Quality:</b> 720p</p>
        <p><b>Genre:</b> Action | Crime</p>
        <p><b>IMDB Rating:</b> 7.9/10</p>
            <span>
                <p class="peers"><b>Peers:</b> 698</p>
                <p class="peers"><b>Seeds:</b> 356</p>
            </span>
    </span>
    <span class="links">
        <a href="http://yify-torrents.com/movie/James_Bond_Casino_Royale_2006" class="std-btn-small mright">View Info<span></span></a>
        <a href="http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent" class="std-btn-small mleft torrentDwl" data-movieID="2620" data-torrentID="2812">Download<span></span></a>
    </span> 
</div>

有什么想法吗？有人可以给我一个如何做到这一点的例子吗？我不确定漂亮的汤是否能满足我的所有要求！ PS。抱歉英语不好，这不是我的第一语言。

【问题讨论】：

每当您开始考虑带有名为x_1、x_2 等变量的代码时，这通常表明您应该真正使用Python 列表，在这种情况下名为x .列表将使您的脚本更加健壮，因为您可以添加新元素并且列表将根据需要增长。如果您将程序设计为具有 4 个元素的一页，然后解析具有 6 个元素的不同页面，您的 x_1 方案将要求您更改输入循环，并且可能还需要更改打印循环。但是如果你已经编码使用一个列表，那么它会适应其他页面而不需要任何改变。

标签： python parsing html-parsing beautifulsoup

【解决方案1】：

from bs4 import BeautifulSoup
import urllib2

f=urllib2.urlopen('http://yify-torrents.com/browse-movie')
html=f.read()
soup=BeautifulSoup(html)


In [25]: for i in soup.findAll("div",{"class":"browse-info"}):
    ...:     name=i.find('a').text
    ...:     for x in i.findAll('b'):
    ...:         if x.text=="Quality:":
    ...:             quality=x.parent.text
    ...:     link=i.find('a',{"class":"std-btn-small mleft torrentDwl"})['href']
    ...:     print [name,quality,link]
    ...:     
[u'James Bond: Casino Royale (2006)', u'Quality: 720p', 'http://yify-torrents.com/download/start/James_Bond_Casino_Royale_2006.torrent']
[u'Pitch Perfect (2012)', u'Quality: 720p', 'http://yify-torrents.com/download/start/Pitch_Perfect_2012.torrent']
...

或者得到你想要的输出：

In [26]: for i in soup.findAll("div",{"class":"browse-info"}):
    ...:     name=i.find('a').text
    ...:     for x in i.findAll('b'):
    ...:         if x.text=="Quality:":
    ...:             quality=x.parent.find(text=True, recursive=False).strip()
    ...:     link=i.find('a',{"class":"std-btn-small mleft torrentDwl"})['href']
    ...:     print [name,quality,link]

【讨论】：

谢谢@root！这段代码正是我所追求的，你的造型技巧使它很容易理解和解释。然而，令我惊讶的是，我没有得到任何输出。汤变量包含所需的标记，但 for 循环不会吐出任何内容。我正在使用bs3（python2.5，所以别无选择）这会有所作为吗？再次感谢！ :)
这很可能不适用于 bs3。稍后我会尝试看看，因为我目前没有bs3。 2.5 也很老了，如果可能的话，你应该升级到 2.7 或至少 2.6。
感谢您的帮助！别担心检查它，我有另一台安装了 2.7 的计算机 - 我整天都在通过 ssh 连接到我的 WD 电视，完全忘记了我安装了现代 LINUX 操作系统！图形用户界面和所有。我现在就换。我可以看到，当它点击“汤”时，几乎没有留下任何代码。非常感谢你的帮助！ :)

【解决方案2】：

根据您的要求，我粘贴了解析器的简单示例。如您所见，它使用 lxml。使用 lxml，您有两种使用 DOM 树的方法，其中一种是 xpath，第二种是 css 选择器我更喜欢 xpath。

import lxml.html
import decimal
import urllib

def parse():
    url = 'https://sometotosite.com'
    doc = lxml.html.fromstring(urllib.urlopen(url).read())
    main_div = doc.xpath("//div[@id='line']")[0]
    main = {}
    tr = []
    for el in main_div.getchildren():
    if el.xpath("descendant::a[contains(@name,'tn')]/text()"):
        category = el.xpath("descendant::a[contains(@name,'tn')]/text()")[0]
        main[category] = ''
        tr = []
    else:
        for element in el.getchildren():
            if '&#8212' in lxml.html.tostring(element):
                tr.append(element)
                print category, tr
parse()

LXML official site

【讨论】：