如何在beautifulsoup中使用-soup-contains获得选择的下一个兄弟姐妹答案

【问题标题】：How to get the two next siblings of selection with -soup-contains in beautifulsoup如何在beautifulsoup中使用-soup-contains获得选择的下一个兄弟姐妹
【发布时间】：2021-11-13 19:47:29
【问题描述】：

意图我正在从Wikipedia 中提取有关所有国家/地区的数据。我希望我的解析器足够通用，适用于所有国家/地区。

假设我现在正在从所有国家/地区提取 GDP (PPP)。在Wikipedia 中，它们被放置在一个 infoBox 表中。问题是 GDP(PPP) 在表中被分成 3 个不同的行。

这是结构：

   <th scope="row" class="infobox-label">
                              <a href="/wiki/Gross_domestic_product" title="Gross domestic product">GDP</a>&#160;
                              <style data-mw-deduplicate="TemplateStyles:r886047488">.mw-parser-output .nobold{font-weight:normal}</style>
                              <span class="nobold">(<a href="/wiki/Purchasing_power_parity" title="Purchasing power parity">PPP</a>)</span>
                           </th>
                           <td class="infobox-data">2020&#160;estimate</td>
                        </tr>
                        <tr class="mergedrow">
                           <th scope="row" class="infobox-label">
                              <div class="ib-country-fake-li">•&#160;Total</div>
                           </th>
                           <td class="infobox-data"><img alt="Increase" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" decoding="async" title="Increase" width="11" height="11" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" data-file-width="300" data-file-height="300" /> $1.391 trillion<sup id="cite_ref-IMFWEOEG_10-0" class="reference"><a href="#cite_note-IMFWEOEG-10">&#91;10&#93;</a></sup>&#32;(<a href="/wiki/List_of_countries_by_GDP_(PPP)" title="List of countries by GDP (PPP)">20th</a>)</td>
                        </tr>
                        <tr class="mergedbottomrow">
                           <th scope="row" class="infobox-label">
                              <div class="ib-country-fake-li">•&#160;Per capita</div>
                           </th>
                           <td class="infobox-data"><img alt="Increase" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png" decoding="async" title="Increase" width="11" height="11" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/17px-Increase2.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/22px-Increase2.svg.png 2x" data-file-width="300" data-file-height="300" /> $14,023<sup id="cite_ref-IMFWEOEG_10-1" class="reference"><a href="#cite_note-IMFWEOEG-10">&#91;10&#93;</a></sup>&#32;(<a href="/wiki/List_of_countries_by_GDP_(PPP)_per_capita" title="List of countries by GDP (PPP) per capita">92nd</a>)</td>
                        </tr>

这是我目前尝试过的：

site= "http://en.wikipedia.org/wiki/Brazil"
country = requests.get(site)
countryPage = BeautifulSoup(country.content, "html.parser")
infoBox = countryPage.find("table", class_="infobox ib-country vcard")
#find GDP PPP
tds = infoBox.select('th:-soup-contains("PPP") + tr')
print(tds)

问题尽管使用 '+ tr' 作为 CSS 选择器，但该代码会打印 GDP PPP 本身的行，而不是后面的行。

谁能告诉我我做错了什么？如何选择使用 CSS 选择器找到的行之后的表行？

【问题讨论】：

标签： python web-scraping beautifulsoup css-selectors wikipedia

【解决方案1】：

我不认为你可以用 css 选择器实现你想要的。您必须以一种或另一种方式存储行或获取行的索引。如果将select 结果转换为生成器，则可以使用next

trs = (tr for tr in soup.select('tr'))
for tr in trs:
    if 'PPP' in tr.text:
        print(next(trs).text)
        print(next(trs).text)
>>> • Total $3.328 trillion[8] (8th)
>>> • Per capita $15,642[8] (84th)

【讨论】：

这真的很有帮助。您如何建议将其概括为在信息框中提取该国家/地区的所有特征？我是要一个一个写还是有更聪明的方法？
聪明的方法是使用 wikidata，但这很难（有专门处理这些问题的 stackexchange：opendata.stackexchange.com）。抓取信息框是随机的，但更容易（如果您已经知道要检索的数据，请使用字典）。这是你的电话
太好了，谢谢。

【解决方案2】：

beautifulsoup4 4.9.3 - 要选择下一个兄弟<tr>，您可以使用：

soup.select_one('tr:has(th:-soup-contains("PPP"))~tr')

或者你想要他们两个：

soup.select('tr:has(th:-soup-contains("PPP"))~tr')[:2]

获取文本：

[x.text for x in soup.select('tr:has(th:-soup-contains("PPP"))~tr')[:2]]

【讨论】：

这返回了一个错误。
NotImplementedError：仅实现了以下伪类：nth-of-type。
你的库是最新的 - 也许你应该更新它们。