如何使用 Python 解析 CDATA 中的 html？答案

【问题标题】：How to parse html inside CDATA using Python?如何使用 Python 解析 CDATA 中的 html？
【发布时间】：2021-03-01 21:41:32
【问题描述】：

我从一个如下所示的网站获取一个 XML 对象：

<?xml version=\'1.0\' encoding=\'UTF-8\'?>
<partial-response id="j_id1">
    <changes>
        <update id="loginForm:tabelaProcessos">
            <![CDATA[<tr data-ri="5" class="ui-widget-content ui-datatable-odd" role="row"><td role="gridcell" style="word-break:break-all;"><span style="font-size:7pt;text-align: center;" title="XPT">08454.8100</span></td><td role="gridcell"><span style="font-size:7pt;" title="tDFvo">ARÁ</span></td><td role="gridcell"><span style="font-size:7pt;" title="PDSDo">TA15A</span></td><td role="gridcell"><span style="font-size:7pt;" title="P125ão">MINIRAL</span></td><td role="gridcell"><span style="font-size:7pt;" title="A12o">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="O4545ão">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="A45So">- </span></td><td role="gridcell"><span style="font-size:7pt;" title="ASD1vo">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="D45el">18/02/2021 04:35:30</span></td><td role="gridcell"><span style="font-size:7pt;" title="Idto">405833357</span></td></tr>]]>
        </update>
        <update id="j_id1:javax.faces.ViewState:0">
            <![CDATA[-8530455S7417:3382887371AS10732]]>
        </update>
        <extension ln="primefaces" type="args">{"totalRecords":1}</extension>
    </changes>
</partial-response>

我需要解析 CDATA 中的表行。我尝试将其用作lxml.html.fromstring() 的输入，但提供的输出忽略了 CDATA 内容。有什么方法可以使用 lxml 或其他 Python lib 获取 CDATA 中的所有内容？

【问题讨论】：

标签： python-3.x xml lxml

【解决方案1】：

使用 BeautifulSoup。 CData 是 NavigableString 的子类。

import bs4

data = """<?xml version=\'1.0\' encoding=\'UTF-8\'?>
<partial-response id="j_id1">
    <changes>
        <update id="loginForm:tabelaProcessos">
            <![CDATA[<tr data-ri="5" class="ui-widget-content ui-datatable-odd" role="row"><td role="gridcell" style="word-break:break-all;"><span style="font-size:7pt;text-align: center;" title="XPT">08454.8100</span></td><td role="gridcell"><span style="font-size:7pt;" title="tDFvo">ARÁ</span></td><td role="gridcell"><span style="font-size:7pt;" title="PDSDo">TA15A</span></td><td role="gridcell"><span style="font-size:7pt;" title="P125ão">MINIRAL</span></td><td role="gridcell"><span style="font-size:7pt;" title="A12o">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="O4545ão">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="A45So">- </span></td><td role="gridcell"><span style="font-size:7pt;" title="ASD1vo">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="D45el">18/02/2021 04:35:30</span></td><td role="gridcell"><span style="font-size:7pt;" title="Idto">405833357</span></td></tr>]]>
        </update>
        <update id="j_id1:javax.faces.ViewState:0">
            <![CDATA[-8530455S7417:3382887371AS10732]]>
        </update>
        <extension ln="primefaces" type="args">{"totalRecords":1}</extension>
    </changes>
</partial-response>"""

soup = bs4.BeautifulSoup(data, 'html.parser')

for cd in soup.findAll(text=True):
    if isinstance(cd, bs4.CData):
        print('CData contents: %r' % cd)

参考：https://www.crummy.com/software/BeautifulSoup/bs4/doc/#comments-and-other-special-strings

【讨论】：