通过正则表达式拆分 HTML答案

【问题标题】：Split HTML by Regex通过正则表达式拆分 HTML
【发布时间】：2018-12-02 12:07:51
【问题描述】：

所以我有这个 HTML：

div class="price" itemprop="offers" itemscope itemtype="http://schema.org Offer"

我正在尝试将其拆分为类似这样的列表：

[class="price", itemprop="offers", itemscope, itemtype="http://schema.org Offer"]

但我不确定如何拆分itemscope 的部分。

我当前的正则表达式看起来像这样(\s.*?\"\s*.*?\s*\")，但是这个问题是当我将它拆分成一个列表时，itemscope 和 itemtype="http:/ /schema.org Offer" 只是一个元素，所以我的列表将是这样的：

[class="price", itemprop="offers", itemscope itemtype="http://schema.org Offer"]

知道如何解决这个问题吗？

【问题讨论】：

I wouldn't recommend regex。请改用BeautifulSoup。
我已经将 BS 用于其他用途。我在这里尝试做的是将类似的 HTML 标记转换为 XPath 以使某些东西自动化。为了做到这一点，我需要拆分那个 HTML 标签
可以在 BeautifulSoup 中获取属性列表，见answer。
查看这个问题，了解为什么正则表达式不是最好的工具：stackoverflow.com/questions/6751105/…

标签： python html regex

【解决方案1】：

lxml 包提供了一些很好的方法来处理 HTML 元素上的 xpath 和属性。

这是一个例子：

from io import StringIO
from lxml import etree

html = '<div class="price" itemprop="offers" itemscope itemtype="http://schema.org Offer"></div>'

tree = etree.parse(StringIO(html), etree.HTMLParser())
doc = tree.getroot()

xpaths = [tree.getpath(element) for element in doc.iter()]

print(xpaths)

attributes_ = ([(f'@{att}', node.attrib[att]) for att in node.attrib]
               for node in doc.iter())
attributes = [item for item in attributes_ if item]
print(attributes)

输出：

['/html', '/html/body', '/html/body/div']

[[('@class', 'price'), ('@itemprop', 'offers'), ('@itemscope', ''), ('@itemtype', 'http://schema.org Offer') ]]

【讨论】：

【解决方案2】：

如果您不想使用 Beautiful Soup，Python 包含具有 HTML 解析器的the html.parser module。这是一个如何使用它的示例。

（我将示例 HTML 更改为正确定义的 div。）

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    data = dict()

    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)
        for class_name, value in attrs:
            print(f'{class_name}: {value}')
            self.data[class_name] = value

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser()
html = '<div class="price" itemprop="offers" itemscope itemtype="http://schema.org Offer"></div>'
parser.feed(html)
print(parser.data)

输出：

遇到一个开始标签：div

等级：价格

itemprop：优惠

项目范围：无

项目类型：http://schema.org 报价

遇到一个结束标签：div

{'class': 'price', 'itemprop': 'offers', 'itemscope': 无, 'itemtype': 'http://schema.org Offer'}

【讨论】：

是的...除了它需要更多的工作而且没有那么健壮...您可能需要考虑将data 存储在对象实例上而不是类上 - 如果以后可能会导致意外否则会重复使用...
请定义“不那么健壮”。您指的是对损坏的 HTML 的处理吗？
是的……我就是这个意思