【问题标题】:remove all data attributes with etree from all elements从所有元素中删除所有带有 etree 的数据属性
【发布时间】:2019-08-01 13:26:58
【问题描述】:

所以我正在尝试清理一些 HTML。我有以下功能:

def clean_html(self, html):
    replaced_html = html.decode('utf-8').replace('<', ' <')

    tree = etree.HTML(replaced_html)
    etree.strip_elements(tree, 'script', 'style', 'img', 'noscript', 'svg')

    for el in tree.xpath('//*[@style]'):
        el.attrib.pop('style')

    for el in tree.xpath('//*[@class]'):
        el.attrib.pop('class')

    for el in tree.xpath('//*[@id]'):
        el.attrib.pop('id')

    etree.strip_tags(tree, etree.Comment)
    return etree.tostring(tree, encoding='unicode', method='html')

我希望也删除所有 data-attributes 例如

<li data-direction="ltr" '
         'data-listposition="center" data-data-id="dataItem-ifz7cqbs" '
         'data-state="menu idle link notMobile">sky</li>

但是我不知道这些属性(上面只是一个例子)。

所以我希望将上面的内容转换为 &lt;li&gt;sky&lt;/li&gt; 并在页面上的每个元素上运行。

在上面的代码中,我可以删除idclass 之类的简单内容,但我不确定如何处理动态属性data-*。可能是正则表达式?

编辑

我应该澄清一下输入。我上面的例子展示了&lt;li&gt; 标签的使用。但实际输入是页面的整个 html,所以它会是这样的:

<html>
  <ul>
    <li data-i="sdfdsf">something</li>
    <li data-i="dsfd">something</li>
  </ul>
  <p data-para="cvcv">content</p>
 <div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ifp35za1","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black LinkedIn Icon","uri":"6ea5b4a88f0b4f91945b40499aa0af00.png","width":200,"height":200,"alt":"Black LinkedIn Icon","link":{"type":"ExternalLink","id":"dataItem-ig84dp5v","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"https://www.linkedin.com/in/beth-liu-aba2b487?trk=hp-identity-name","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="https://www.pinterest.com/agencyb/" target="_blank"  > <div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ijxtrrjj","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black Pinterest Icon","uri":"8f6f59264a094af0b46e9f6c77dff83e.png","width":200,"height":200,"alt":"Black Pinterest Icon","link":{"type":"ExternalLink","id":"dataItem-ikg674xm","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"https://www.pinterest.com/agencyb/","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="http://www.twitter.com/lubecka" target="_blank"  > <div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ifp3554u","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black Twitter Icon","uri":"c7d035ba85f6486680c2facedecdcf4d.png","description":"","width":200,"height":200,"alt":"Black Twitter Icon","link":{"type":"ExternalLink","id":"dataItem-ifp3554u1","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"http://www.twitter.com/lubecka","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="https://www.instagram.com/" target="_blank">
<html>

【问题讨论】:

  • 您的最终输出是否总是这样 &lt;li&gt;some text&lt;/li&gt; 或者这些标签会有所不同?
  • 标签会有所不同@JackFleeting,也可以是divspan
  • 我已经编辑了问题以进一步澄清输入

标签: python python-3.x scrapy lxml elementtree


【解决方案1】:

假设“数据属性”的名称总是以“data-”开头,您可以这样删除它们:

for el in tree.xpath("//*"):
    for attr in el.attrib:
        if attr.startswith("data-"):
            el.attrib.pop(attr)

【讨论】:

  • +1 您还可以将第一个 xpath 更改为 //*[@*[starts-with(name(),'data-')]],以仅处理属性以 data- 开头的元素。
  • 非常感谢,这正是我想要的。
【解决方案2】:

你可以像这样清除属性

import re def strip_attribute(data): p = re.compile('data-[^=]*="[^"]*"') print(p) return p.sub('', data) print(strip_attribute('with attribute'))

【讨论】:

  • 谢谢,@kubarik,但它不适用于我在问题中制作的更新的 html 示例
  • @kurupt_89,不确定,但也许是这样:(data-[^=]*=(').*?(')+)|(data-[^=]*=(").*?(")+)
【解决方案3】:

也许这就是你要找的东西:

from lxml import etree

code = """
 <html>
   <ul>
    <li data-i="sdfdsf">something</li>
    <li data-i="dsfd">something</li>
  </ul>
    <p data-para="cvcv">content</p> 
</html>

"""

xml = etree.XML(code)
elements = list(xml.iter())
for element in elements:
   if len(element.text.strip())>0:
      print('<'+element.tag+'>'+element.text+'</'+element.tag+'>')

输出:

<li>something</li>
<li>something</li>
<p>content</p>

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-05-05
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多