Beautifulsoup 将一个单元格的内容刮到另一个单元格旁边答案

【问题标题】：Beautifulsoup scrape content of a cell beside another oneBeautifulsoup 将一个单元格的内容刮到另一个单元格旁边
【发布时间】：2017-06-12 16:38:13
【问题描述】：

我正在尝试抓取除另一个我知道名称的单元格之外的单元格的内容，例如“Staatsform”、“Amtssprache”、“Postleitzahl”等。在图片中，所需的内容总是在正确的单元格中。

基本代码如下，但我坚持下去：

source_code = requests.get('https://de.wikipedia.org/wiki/Hamburg')
plain_text = source_code.text                       
soup = BeautifulSoup(plain_text, "html.parser")     
stastaform = soup.find(text="Staatsform:")...???

提前非常感谢！

【问题讨论】：

请包含描述两个感兴趣单元格的 HTML 片段。
您只需要单元格中的文本，还是其他内容？

标签： python web-scraping beautifulsoup wikipedia

【解决方案1】：

我想谨慎地将搜索限制在英语维基百科中所谓的“信息框”中。因此，我首先搜索了标题“Basisdaten”，要求它是一个th 元素。也许不完全确定，但更有可能是。发现我在“Basisdaten”下寻找tr 元素，直到找到另一个tr 包括（假定不同的）标题。在这种情况下，我搜索“Postleitzahlen:”，但这种方法可以找到“Basisdaten”和下一个标题之间的任何/所有项目。

PS：我还应该提到if not current.name的原因。我注意到一些只包含新行的行，BeautifulSoup 将其视为字符串。它们没有名称，因此需要在代码中对其进行特殊处理。

import requests
import bs4
page = requests.get('https://de.wikipedia.org/wiki/Hamburg').text
soup = bs4.BeautifulSoup(page, 'lxml')
def getInfoBoxBasisDaten(s):
    return str(s) == 'Basisdaten' and s.parent.name == 'th'

basisdaten = soup.find_all(string=getInfoBoxBasisDaten)[0]

wanted = 'Postleitzahlen:'
current = basisdaten.parent.parent.nextSibling
while True:
    if not current.name: 
        current = current.nextSibling
        continue
    if wanted in current.text:
        items = current.findAll('td')
        print (items[0])
        print (items[1])
    if '<th ' in str(current): break
    current = current.nextSibling

结果如下：两个独立的td 元素，根据要求。

<td><a href="/wiki/Postleitzahl_(Deutschland)" title="Postleitzahl (Deutschland)">Postleitzahlen</a>:</td>
<td>20095–21149,<br/>
22041–22769,<br/>
<a href="/wiki/Neuwerk_(Insel)" title="Neuwerk (Insel)">27499</a></td>

【讨论】：

如果我使用 BeautifulSoup.get_text() 删除 html 脚本等，这似乎对我有用。但不幸的是，我在此站点上收到错误：https://de.wikipedia.org/wiki/Bremen。你知道它是什么吗？
我刚刚查看了两个页面的 wiki 代码（在 Bearbeiten 视图中）。它们采用完全不同的方法来格式化页面，因此 HTML 是不同的。除了高中，我没有德语。我现在看到不来梅页面中有一个“信息框”，但汉堡没有。这与英语维基百科中的情况相同。如果你想抓取它，那么你需要能够识别你正在处理的格式类型并进行相应的处理。

【解决方案2】：

这在大多数情况下都有效：

def get_content_from_right_column_for_left_column_containing(text):
    """return the text contents of the cell adjoining a cell that contains `text`"""

    navigable_strings = soup.find_all(text=text)

    if len(navigable_strings) > 1:
        raise Exception('more than one element with that text!')

    if len(navigable_strings) == 0:

        # left-column contents that are links don't have a colon in their text content...
        if ":" in text:
            altered_text = text.replace(':', '')

        # but `td`s and `th`s do.
        else: 
            altered_text = text + ":"

        navigable_strings = soup.find_all(text=altered_text)

    try:
        return navigable_strings[0].find_parent('td').find_next('td').text
    except IndexError:
        raise IndexError('there are no elements containing that text.')

【讨论】：