【问题标题】:How to parse html inside CDATA using Python?如何使用 Python 解析 CDATA 中的 html?
【发布时间】:2021-03-01 21:41:32
【问题描述】:

我从一个如下所示的网站获取一个 XML 对象:

<?xml version=\'1.0\' encoding=\'UTF-8\'?>
<partial-response id="j_id1">
    <changes>
        <update id="loginForm:tabelaProcessos">
            <![CDATA[<tr data-ri="5" class="ui-widget-content ui-datatable-odd" role="row"><td role="gridcell" style="word-break:break-all;"><span style="font-size:7pt;text-align: center;" title="XPT">08454.8100</span></td><td role="gridcell"><span style="font-size:7pt;" title="tDFvo">ARÁ</span></td><td role="gridcell"><span style="font-size:7pt;" title="PDSDo">TA15A</span></td><td role="gridcell"><span style="font-size:7pt;" title="P125ão">MINIRAL</span></td><td role="gridcell"><span style="font-size:7pt;" title="A12o">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="O4545ão">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="A45So">- </span></td><td role="gridcell"><span style="font-size:7pt;" title="ASD1vo">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="D45el">18/02/2021 04:35:30</span></td><td role="gridcell"><span style="font-size:7pt;" title="Idto">405833357</span></td></tr>]]>
        </update>
        <update id="j_id1:javax.faces.ViewState:0">
            <![CDATA[-8530455S7417:3382887371AS10732]]>
        </update>
        <extension ln="primefaces" type="args">{"totalRecords":1}</extension>
    </changes>
</partial-response>

我需要解析 CDATA 中的表行。我尝试将其用作lxml.html.fromstring() 的输入,但提供的输出忽略了 CDATA 内容。有什么方法可以使用 lxml 或其他 Python lib 获取 CDATA 中的所有内容?

【问题讨论】:

    标签: python-3.x xml lxml


    【解决方案1】:

    使用 BeautifulSoup。 CData 是 NavigableString 的子类。

    import bs4
    
    data = """<?xml version=\'1.0\' encoding=\'UTF-8\'?>
    <partial-response id="j_id1">
        <changes>
            <update id="loginForm:tabelaProcessos">
                <![CDATA[<tr data-ri="5" class="ui-widget-content ui-datatable-odd" role="row"><td role="gridcell" style="word-break:break-all;"><span style="font-size:7pt;text-align: center;" title="XPT">08454.8100</span></td><td role="gridcell"><span style="font-size:7pt;" title="tDFvo">ARÁ</span></td><td role="gridcell"><span style="font-size:7pt;" title="PDSDo">TA15A</span></td><td role="gridcell"><span style="font-size:7pt;" title="P125ão">MINIRAL</span></td><td role="gridcell"><span style="font-size:7pt;" title="A12o">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="O4545ão">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="A45So">- </span></td><td role="gridcell"><span style="font-size:7pt;" title="ASD1vo">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="D45el">18/02/2021 04:35:30</span></td><td role="gridcell"><span style="font-size:7pt;" title="Idto">405833357</span></td></tr>]]>
            </update>
            <update id="j_id1:javax.faces.ViewState:0">
                <![CDATA[-8530455S7417:3382887371AS10732]]>
            </update>
            <extension ln="primefaces" type="args">{"totalRecords":1}</extension>
        </changes>
    </partial-response>"""
    
    soup = bs4.BeautifulSoup(data, 'html.parser')
    
    for cd in soup.findAll(text=True):
        if isinstance(cd, bs4.CData):
            print('CData contents: %r' % cd)
    

    参考:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#comments-and-other-special-strings

    【讨论】:

      猜你喜欢
      • 2013-03-28
      • 2013-02-04
      • 1970-01-01
      • 1970-01-01
      • 2014-10-24
      • 2017-02-15
      • 1970-01-01
      • 2015-09-07
      • 2013-06-27
      相关资源
      最近更新 更多