使用 BeautifulSoup 获取 span 中的 span 的文本答案

【问题标题】：Get the text of a span in a span using BeautifulSoup使用 BeautifulSoup 获取 span 中的 span 的文本
【发布时间】：2020-09-09 23:51:27
【问题描述】：

我正在尝试使用本网站上的 Beautiful Soup 从网站取回 City、Country 和 Region： https://www.geodatatool.com/en/?ip=82.47.160.231 （别担心那不是我的 IP；虚拟 ip）

这就是我正在尝试的：

    url = "https://www.geodatatool.com/en/?ip="+ip
    
    # Getting site's data in plain text..
    sourceCode = requests.get(url)
    plainText = sourceCode.text
    soup = BeautifulSoup(plainText)
    
    tags = soup('span')
    # Parsing data.
    data_item = soup.body.findAll('div','data-item')
    
    #bold_item = data_item.findAll('span')
    for tag in tags:
        print(tag.contents)

我只是得到一个包含所有跨度内容的数组。试图缩小范围以满足我的具体需求，但这不会很快发生。

有人可以帮我解决这个问题吗？

【问题讨论】：

标签： web-scraping beautifulsoup

【解决方案1】：

这应该可行。基本上，我们找到所有具有 class: 'data-item' 的 div，然后在这里我们正在寻找 2 个跨度，其中第一个跨度是城市：、国家：等，第二个跨度包含数据。

data_items = soup.findAll('div', {'class': 'data-item'})

# Country
country = data_items[2].findAll('span')[1].text.strip()

# City 
city = data_items[5].findAll('span')[1].text.strip()

# Region
country = data_items[4].findAll('span')[1].text.strip()

一般来说这是可行的，但如果网站显示不同的数据或每次搜索对数据排序的方式不同，我们可能希望使代码更健壮一些。我们可以通过使用正则表达式来查找国家、城市和地区字段来做到这一点。解决方案如下：

# Country
country = soup.find(text=re.compile('country', re.IGNORECASE)).parent.parent.findAll('span')[1].text.strip()

# City 
city = soup.find(text=re.compile('city', re.IGNORECASE)).parent.parent.findAll('span')[1].text.strip()

# Region
region = soup.find(text=re.compile('region', re.IGNORECASE)).parent.parent.findAll('span')[1].text.strip()

我们尝试在 HTML 代码中查找模式“国家”、“城市”或“地区”。然后抓取它们的父对象 2 次以获得与之前代码块中的 data_items 相同的结果，并执行相同的操作以得到答案。

【讨论】：

它告诉我列表索引超出范围的国家/地区。而且您将区域设置为国家两次，哈哈。
@LifeOfJona 试试第二种方法，需要重新导入

【解决方案2】：

使用 css 选择器更容易做到这一点：

data_items = soup.select('div.sidebar-data div.data-item')
targets = ['Country:','City:','Region:']
for item in data_items:
    if item.select('span.bold')[0].text in targets:
        print(item.select('span.bold')[0].text, item.select('span')[1].text.strip())

输出：

Country: United Kingdom
Region: England
City: Plymouth

【讨论】：