如何从 Response.xpath 中排除特定标签（<br>）？答案

【问题标题】：How Can I Exclude specific Tag(<br>) from Response.xpath?如何从 Response.xpath 中排除特定标签（<br>）？
【发布时间】：2021-09-03 00:34:20
【问题描述】：

下面是一些示例源 html，我想获取一个字符串（或列表）结果。

<font class="news">
    <table border="0" cellspacing="0" cellpadding="0" align="right">
        <tr>
            <td style="padding-left:10px; padding-bottom:5px;">
                <a href="../1.jpg" target="_blank" onfocus='this.blur()'>
                    <img src="../pic1/small_16239927831.jpg" width="300" >
                </a>
            </td>
        </tr>
    </table>
    AAA<br><br>
    BBB<br><br>
    CCC<br>
</font>

我可以得到一些结果

response.xpath('//font[@class="body_news"]/text()')

或

response.xpath('//font[@class="body_news"]/text()').extract()

但是，结果有很多 \n 或 \n\t ，我只想得到 "AAA BBB CCC" 或 ['AAA','BBB','CCC'] 。

我也用过normalize-space()，但是不行。如何排除这些换行符或制表符？

['AAA', '\n\t\t', '\n\n\t\t', 'BBB', '\n\t\t', 'CCC', '\n\t' ]

【问题讨论】：

您的问题格式不正确。规范化空间应该可以完成这项工作。你能分享源代码吗？

标签： python dom xpath extract

【解决方案1】：

这个 XPath：

normalize-space(//font[@class='news'])

给出这个结果：

AAA BBB CCC

【讨论】：

这回答了你的问题吗？