【问题标题】:Python iterate through section using lxmlPython 使用 lxml 遍历部分
【发布时间】:2013-02-01 20:26:32
【问题描述】:

我有一个网页,我目前正在使用 BeautifulSoup 解析它,但它很慢,所以我决定尝试 lxml,因为我阅读它非常快。

无论如何,我正在努力让我的代码遍历我想要的部分,不知道如何使用 lxml,而且我找不到明确的文档。

无论如何,这是我的代码:

import urllib, urllib2
from lxml import etree

def wgetUrl(target):
    try:
        req = urllib2.Request(target)
        req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
        response = urllib2.urlopen(req)
        outtxt = response.read()
        response.close()
    except:
        return ''
    return outtxt

newUrl = 'http://www.tv3.ie/3player'

data = wgetUrl(newUrl)
parser = etree.HTMLParser()
tree   = etree.fromstring(data, parser)

for elem in tree.iter("div"):
    print elem.tag, elem.attrib, elem.text

这会返回所有 DIV,但我如何指定只遍历 dev id='slider1'?

div {'style': 'position: relative;', 'id': 'slider1'} None

这不起作用:

for elem in tree.iter("slider1"):

我知道这可能是一个愚蠢的问题,但我想不通..

谢谢!

* 编辑 **

在您添加此代码的帮助下,我现在得到以下输出:

for elem in tree.xpath("//div[@id='slider1']//div[@id='gridshow']"):
    print elem[0].tag, elem[0].attrib, elem[0].text
    print elem[1].tag, elem[1].attrib, elem[1].text
    print elem[2].tag, elem[2].attrib, elem[2].text
    print elem[3].tag, elem[3].attrib, elem[3].text
    print elem[4].tag, elem[4].attrib, elem[4].text

输出:

a {'href': '/3player/show/392/57922/1/Tallafornia', 'title': '3player | Tallafornia, 11/01/2013. The Tallafornia crew are back, living in a beachside villa in Santa Ponsa, Majorca. As the crew settle in, the egos grow bigger than ever and cause tension'} None
h3 {} None
span {'id': 'gridcaption'} The Tallafornia crew are back, living in a beachside vill...
span {'id': 'griddate'} 11/01/2013
span {'id': 'gridduration'} 00:27:52

这太棒了,但我错过了上面的 a 标签的一部分。解析器会不会正确处理代码?

我没有收到以下信息:

<img alt="3player | Tallafornia, 11/01/2013. The Tallafornia crew are back, living in a beachside villa in Santa Ponsa, Majorca. As the crew settle in, the egos grow bigger than ever and cause tension" src='http://content.tv3.ie/content/videos/0378/tallaforniaep2_fri11jan2013_3player_1_57922_180x102.jpg' class='shadow smallroundcorner'></img>

任何想法为什么它不拉这个?

再次感谢,非常有用的帖子..

【问题讨论】:

    标签: python parsing iteration lxml


    【解决方案1】:

    您可以按如下方式使用 XPath 表达式:

    for elem in tree.xpath("//div[@id='slider1']"):
    

    例子:

    >>> import urllib2
    >>> import lxml.etree
    >>> url = 'http://www.tv3.ie/3player'
    >>> data = urllib2.urlopen(url)
    >>> parser = lxml.etree.HTMLParser()
    >>> tree = lxml.etree.parse(data,parser)
    >>> elem = tree.xpath("//div[@id='slider1']")
    >>> elem[0].attrib
    {'style': 'position: relative;', 'id': 'slider1'}
    

    您需要更好地分析您正在处理的页面的内容(一个很好的方法是使用带有 Firebug 插件的 Firefox)。

    您尝试获取的&lt;img&gt; 标签实际上是&lt;a&gt; 标签的子标签:

    >>> for elem in tree.xpath("//div[@id='slider1']//div[@id='gridshow']"):
    ...    for elem_a in elem.xpath("./a"):
    ...       for elem_img in elem_a.xpath("./img"):
    ...          print '<A> HREF=%s'%(elem_a.attrib['href'])
    ...          print '<IMG> ALT="%s"'%(elem_img.attrib['alt'])
    <A> HREF=/3player/show/392/58784/1/Tallafornia
    <IMG> ALT="3player | Tallafornia, 01/02/2013. A fresh romance blossoms in the Tallafornia house. Marc challenges Cormac to a 'bench off' in the gym"
    <A> HREF=/3player/show/46/58765/1/Coronation-Street
    <IMG> ALT="3player | Coronation Street, 01/02/2013. Tyrone bumps into Kirsty in the street and tries to take Ruby from her pram"
    ../..
    

    【讨论】:

      【解决方案2】:

      这就是我为自己工作的方式,我不确定这是否是最好的方法,所以欢迎 cmets:

      import urllib2, re
      from lxml import etree
      from datetime import datetime
      
      def wgetUrl(target):
          try:
              req = urllib2.Request(target)
              req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
              response = urllib2.urlopen(req)
              outtxt = response.read()
              response.close()
          except:
              return ''
          return outtxt
      
      start = datetime.now()
      
      newUrl = 'http://www.tv3.ie/3player' # homepage
      
      data = wgetUrl(newUrl)
      parser = etree.HTMLParser()
      tree   = etree.fromstring(data, parser)
      
      for elem in tree.xpath("//div[@id='slider1']//div[@id='gridshow'] | //div[@id='slider1']//div[@id='gridshow']//img[@class='shadow smallroundcorner']"):
          if elem.tag == 'img':
              img = elem.attrib.get('src')
              print 'img: ', img
      
          if elem.tag == 'div':
              show = elem[0].attrib.get('href')
              print 'show: ', show
              titleData = elem[0].attrib.get('title')
      
              match=re.search("3player\s+\|\s+(.+),\s+(\d\d/\d\d/\d\d\d\d)\.\s*(.*)", titleData) 
              title=match.group(1)
              print 'title: ', title
      
              description = match.group(3)
              print 'description: ', description
      
              date = elem[3].text
              duration = elem[4].text
              print 'date: ', date
              print 'duration: ', duration
      
      end = datetime.now()
      print 'time took was ', (end-start)
      

      时间安排非常好,尽管与 BeautifulSoup 相比并没有我预期的大差异..

      【讨论】:

        猜你喜欢
        • 2016-11-12
        • 1970-01-01
        • 2014-11-15
        • 1970-01-01
        • 2018-07-29
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多