使用 Beautiful Soup 获取 span 标题属性答案

【问题标题】：Using Beautiful Soup to get span title attribute使用 Beautiful Soup 获取 span 标题属性
【发布时间】：2018-06-29 09:22:52
【问题描述】：

我是 python 和 Beautiful soup 的新手，但我正在开发一个网络爬虫，它将从该网站获取数据：

http://yiimp.eu/site/tx?address=DFc6oo4CAemHF4KerLG39318E1KciTs742

网页非常简单，基本上只是一个表格，所以我只是想抓住表格中的每个字段。我的问题是，对于第一个字段，我试图实际获取 span title 中的日期，而不是显示的实际值。我可以获取span titles 的列表，或者我可以从其他两个字段中获取其他信息，但我无法同时获取跨度标题和其他两个字段。下面是我正在尝试完成的示例：

2018-01-20 03:37:00
3.90135252
8ece3baba44382eec3d62fa76b5beba98ae398f81ad2d77556b95c3c1a739b4f

相反，到目前为止我能做的最好的是

{'title': '2018-01-20 03:57:00'}
2h ago
{'title': '2018-01-20 03:57:00'}
3.90135252
{'title': '2018-01-20 03:57:00'}
8ece3baba44382eec3d62fa76b5beba98ae398f81ad2d77556b95c3c1a739b4f

这很接近，但不幸的是它重复了标题时间，将标题标签留在了输出中，实际上它只是为每条记录重复相同的日期和时间。实现我正在寻找的结果的最佳方法是什么？

这是我的代码

import requests
import time
from bs4 import BeautifulSoup

theurl = "http://yiimp.eu/site/tx?address=DFc6oo4CAemHF4KerLG39318E1KciTs742"
thepage = requests.get(theurl, headers={'User-Agent':'MyAgent'})
soup = BeautifulSoup(thepage.text, "html.parser")


for table in soup.findAll('td'):
    print(table.text)
    for time in soup.findAll('span'):
        print(time.attrs)
        count =  1
        if count == 1:
            count ==0
            break

【问题讨论】：

标签： python html web-scraping beautifulsoup

【解决方案1】：

试试这个从所有行中获取值：

for row in soup.find_all('tr', {'class': 'ssrow', 'style': None}):
    time = row.find('span')['title']
    amount = row.find('td', {'align': 'right'}).find('b').text
    tx = row.find('a').text
    # Print these values however you want.

检查第一行的代码：

row = soup.find('tr', {'class': 'ssrow', 'style': None})
time = row.find('span')['title']
amount = row.find('td', {'align': 'right'}).find('b').text
tx = row.find('a').text
print(time, amount, tx)

输出：

2018-01-20 06:56:43 4.42507599 d142445fd36e6a141a18071110faa8f6f3f9f8a42de888a149d8aa9416fe83ce

说明：

所有行都包含在<tr> 标记中，但第一个<tr> 标记用于标题。为了过滤掉它，我添加了属性'class': 'ssrow'，因为所有其他行都具有该属性。但是，如果您可以看到最后一行，则它的<tr> 标签包含style="border-top: 2px solid #eee;" 的总数。为了过滤掉它，我添加了'style': None。

【讨论】：