【问题标题】:lxml strip_tags results in AttributeErrorlxml strip_tags 导致 AttributeError
【发布时间】:2014-11-21 17:53:21
【问题描述】:

我需要清理一个 html 文件,例如删除多余的“跨度”标签。如果“跨度”与 css 文件中字体粗细和字体样式的父节点格式相同(我将其转换为字典以便更快地查找),则“跨度”被认为是多余的。

html 文件如下所示:

<p class="Title">blablabla <span id = "xxxxx">bla</span> prprpr <span id = "yyyyy"> jj </span> </p>
<p class = "norm">blalbla <span id = "aaaa">ttt</span> sskkss <span id = "bbbbbb"> aa </span> </p>

我已经存入字典的 css 样式:

{'xxxxx':'font-weight: bold; font-size: 8.0pt; font-style: oblique', 
 'yyyyy':'font-weight: normal; font-size: 9.0pt; font-style: italic', 
 'aaaa': 'font-weight: bold; font-size: 9.0pt; font-style: italic', 
 'bbbbbb': 'font-weight: normal; font-size: 9.0pt; font-style: normal', 
 'Title': 'font-style: oblique; text-align: center; font-weight: bold', 
 'norm': 'font-style: normal; text-align: center; font-weight: normal'}

所以,鉴于&lt;p Title&gt;&lt;span id xxxxx&gt;,以及&lt;p norm&gt;&lt;span bbbbbb&gt; 在css 字典中的字体粗细和字体样式具有相同的格式,我想得到以下结果:

<p class= "Title">blablabla bla prprpr <span id = "yyyyy"> jj </span> </p>
<p class = "norm">blalbla <span id = "aaaa">ttt</span> sskkss aa </span> </p>

另外,我可以通过查看它们的 id 来删除一些跨度:如果它包含“af” - 我删除它们而不查看字典。

所以,在我的脚本中有:

from lxml import etree
from asteval import Interpreter

tree = etree.parse("filename.html")

aeval = Interpreter()
filedic = open('dic_file', 'rb')
fileread = filedic.read()
new_dic = aeval(fileread)

def no_af(tree):

  for badspan in tree.xpath("//span[contains(@id, 'af')]"):
      badspan.getparent().remove(badspan)

  return tree

def no_normal():
    no_af(tree)

  for span in tree.xpath('.//span'):
      span_id = span.xpath('@id')

      for x in span_id:
          if x in new_dic:
               get_style = x
               parent = span.getparent()
               par_span =parent.xpath('@class')
               if par_span:
                     for ID in par_span:
                        if ID in new_dic:

                           get_par_style = ID
                           if 'font-weight' in new_dic[get_par_style] and 'font-style' in new_dic[get_par_style]:

                              if 'font-weight' in new_dic[get_style] and 'font-style' in new_dic[get_style]:

                                 if new_dic[get_par_style]['font-weight']==new_dic[get_style]['font-weight'] and new_dic[get_par_style]['font-style']==new_dic[get_style]['font-style']:

                                     etree.strip_tags(parent, 'span')

    print etree.tostring(tree, pretty_print =True, method = "html", encoding = "utf-8")

这会导致:

AttributeError: 'NoneType' object has no attribute 'xpath'

而且我知道正是“etree.strip_tags(parent, 'span')” 行导致了错误,因为当我将其注释掉并在任何其他行之后进行 print smth - 一切正常。

另外,我不确定使用这个 etree.strip_tags(parent, 'span') 是否能满足我的需要。如果在父级内部有几个具有不同格式的跨度怎么办。无论如何,这个命令会剥离所有这些跨度吗?我实际上只需要在“for span in tree.xpath('.//span'):”中剥离一个跨度,即当前的跨度,它是在函数的开头获取的。

我整天都在看这个错误,我觉得我忽略了一些东西......我非常需要你的帮助!

【问题讨论】:

  • 这是在您的span 元素中滥用id 属性。您可能会“在野外”找到它,但除非每个跨度都是唯一的,否则 class 是正确的说明符,而不是 id
  • 每个 span 的 id 都是唯一的,尽管 p 中的类不是唯一的。
  • 你一个对,应该是bbbbb!对不起((

标签: python xpath lxml


【解决方案1】:

lxml 很棒,但它提供了一个相当低级的“etree”数据结构,并且没有内置最广泛的编辑操作集。您需要的是“展开”操作,您可以将其应用于单个元素,以将其文本、任何子元素及其“尾部”保留在树中,但不能保留元素本身。这是这样一个操作(加上需要的辅助函数):

def noneCat(*args):
    """
    Concatenate arguments. Treats None as the empty string, though it returns
    the None object if all the args are None. That might not seem sensible, but
    it works well for managing lxml text components.
    """
    for ritem in args:
        if ritem is not None:
            break
    else:
        # Executed only if loop terminates through normal exhaustion, not via break
        return None

    # Otherwise, grab their string representations (empty string for None)
    return ''.join((unicode(v) if v is not None else "") for v in args)


def unwrap(e):
    """
    Unwrap the element. The element is deleted and all of its children
    are pasted in its place.
    """
    parent = e.getparent()
    prev = e.getprevious()

    kids = list(e)
    siblings = list(parent)

    # parent inherits children, if any
    sibnum = siblings.index(e)
    if kids:
        parent[sibnum:sibnum+1] = kids
    else:
        parent.remove(e)

    # prev node or parent inherits text
    if prev is not None:
        prev.tail = noneCat(prev.tail, e.text)
    else:
        parent.text = noneCat(parent.text, e.text)

    # last child, prev node, or parent inherits tail
    if kids:
        last_child = kids[-1]
        last_child.tail = noneCat(last_child.tail, e.tail)
    elif prev is not None:
        prev.tail = noneCat(prev.tail, e.tail)
    else:
        parent.text = noneCat(parent.text, e.tail)
    return e

现在您已经完成了分解 CSS 的部分工作,并确定了一个 CSS 选择器 (span#id) 是否表明您想要考虑对另一个选择器 (p.class) 的冗余规范。让我们扩展它并将其包装成一个函数:

cssdict = { 'xxxxx':'font-weight: bold; font-size: 8.0pt; font-style: oblique',
            'yyyyy':'font-weight: normal; font-size: 9.0pt; font-style: italic',
            'aaaa': 'font-weight: bold; font-size: 9.0pt; font-style: italic',
            'bbbbbb': 'font-weight: normal; font-size: 9.0pt; font-style: normal',
            'Title': 'font-style: oblique; text-align: center; font-weight: bold',
            'norm': 'font-style: normal; text-align: center; font-weight: normal'
          }

RELEVANT = ['font-weight', 'font-style']

def parse_css_spec(s):
    """
    Decompose CSS style spec into a dictionary of its components.
    """
    parts = [ p.strip() for p in s.split(';') ]
    attpairs = [ p.split(':') for p in parts ]
    attpairs = [ (k.strip(), v.strip()) for k,v in attpairs ]
    return dict(attpairs)

cssparts = { k: parse_css_spec(v) for k,v in cssdict.items() }
# pprint(cssparts)

def redundant_span(span_css_name, parent_css_name, consider=RELEVANT):
    """
    Determine if a given span is redundant with respect to its parent,
    considering sepecific attribute names. If the span's attributes
    values are the same as the parent's, consider it redundant.
    """
    span_spec = cssparts[span_css_name]
    parent_spec = cssparts[parent_css_name]
    for k in consider:
        # Any differences => not redundant
        if span_spec[k] != parent_spec[k]:
            return False
    # Everything matches => is redundant
    return True

好的,准备工作完成了,该是主要节目了:

import lxml.html
from lxml.html import tostring

source = """
<p class="Title">blablabla <span id = "xxxxx">bla</span> prprpr <span id = "yyyyy"> jj </span> </p>
<p class = "norm">blalbla <span id = "aaaa">ttt</span> sskkss <span id = "bbbbbb"> aa </span> </p>
"""

h = lxml.html.document_fromstring(source)

print "<!-- before -->"
print tostring(h, pretty_print=True)
print

for span in h.xpath('//span[@id]'):
    span_id = span.attrib.get('id', None)
    parent_class = span.getparent().attrib.get('class', None)
    if parent_class is None:
        continue
    if redundant_span(span_id, parent_class):
        unwrap(span)

print "<!-- after -->"
print tostring(h, pretty_print=True)

产量:

<!-- before-->
<html><body>
<p class="Title">blablabla <span id="xxxxx">bla</span> prprpr <span id="yyyyy"> jj </span> </p>
<p class="norm">blalbla <span id="aaaa">ttt</span> sskkss <span id="bbbbbb"> aa </span> </p>
</body></html>


<!-- after -->
<html><body>
<p class="Title">blablabla bla prprpr <span id="yyyyy"> jj </span> </p>
<p class="norm">blalbla <span id="aaaa">ttt</span> sskkss  aa  </p>
</body></html>

更新

再想一想,你不需要unwrap。我使用它是因为它在我的工具箱中很方便。您可以通过使用标记清除方法和etree.strip_tags 来完成它,如下所示:

for span in h.xpath('//span[@id]'):
    span_id = span.attrib.get('id', None)
    parent_class = span.getparent().attrib.get('class', None)
    if parent_class is None:
        continue
    if redundant_span(span_id, parent_class):
        span.tag = "JUNK"
etree.strip_tags(h, "JUNK")

【讨论】:

  • 哇,太棒了!我明天试试。到目前为止,我有一个问题:在“span_id = span.attrib.get('id', None)”这样的行中,None 指的是什么?
  • 这是为了防止没有id 属性的可能性。在这种情况下,因为 XPath 表达式仅指定具有 id 属性的节点,所以设置该保护并不是非常重要的。但是,parent 的类似保护 很重要,因为不能保证此时它会具有 class 属性。在这个例子中情况并非如此,但是您解析的 HTML 越通用,本质上做出的保证就越少,这样的保护就越重要。
  • 感谢您的解释!那么,“span_id = span.attrib.get('id', None)”可以读作“获取span元素的id属性,如果有id attr”?
  • 没错。你也可以span.attrib['id']。如果 id 属性不存在,请准备好捕捉KeyError
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2018-12-22
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多