Python 抓取 - 具有多个条件的 XPath 语法答案

【问题标题】：Python scraping - XPath syntax with multiple conditionsPython 抓取 - 具有多个条件的 XPath 语法
【发布时间】：2020-03-28 17:59:26
【问题描述】：

我正在编写一个简单的刮板来从 Kayak 中提取航班价格 - 我正在使用 XPath 刮取多个数据项（持续时间、航空公司、价格等）并将每个数据项存储在 15 个值的列表中（Kayak 页面上的结果数） .

我的问题是“价格”变量 scrape 返回的值超过 15 个，因为除了“最佳”结果之外，它还提取了额外的显示结果（参见屏幕截图 - RHS 上的大字体与底部 LHS 中的两个报价） .

我已将问题范围缩小到以下几点：

1) 提取两个值的总体（工作）XPath 是：

'//a[@class="booking-link "]/span[@class="price option-text"]/span[@class = "price-text"]'

2) 区分主价和附加价的关键在于@id字符串，这两种价格的@id都是

(i) 部分随机生成，
(ii) 在这两种情况下都包含“-price-text”并且
(iii) 仅在附加价格中包含“额外信息”，

例如：
- 主要价格：//*[@id="pck6-mb-aE-1d84916e1b2-price-text"]
- 附加价格：//*[@id="NB5A-extra-info-hmb-tE-15ae5bd2e33-price-text"]

我如何编写一个只提取主要价格的 XPath，即过滤掉任何在@id 中包含“extra-info”字符串的 XPath？我尝试了几种方法（下面的示例），但似乎无法正确使用语法。任何帮助表示赞赏，谢谢！

尝试的示例：

'//a[@class="booking-link "]/span[@class="price option-text"]/span[@class = "price-text" and not[contains(@id,"extra-info")]]'

'//a[@class="booking-link "]//span[@class="price option-text"]//span[[not[contains(@id,"extra-info")]//span[contains(@id,"-price-text")]]'

'//a[@class="booking-link "]/span[@class="price option-text"]/span[len(@id==33)]'

enter image description here

【问题讨论】：

标签： python-3.x selenium xpath web-scraping

【解决方案1】：

尝试类似：

//a[@class="booking-link "]/span[@class="price option-text"]/span[@class="price-text"][not(contains(@id,"extra-info"))]

【讨论】：

【解决方案2】：

你也可以使用祖先来获取价格列表，试试下面的解决方案

//span[@class='custom-text'][contains(text(),'View Deal')]/ancestor::div[@class="multibook-dropdown"]//span[@class = "price-text"]

【讨论】：