BeautifulSoup：在源代码中抓取具有相同属性集的不同数据集答案

【问题标题】：BeautifulSoup: Scraping different data sets having same set of attributes in the source codeBeautifulSoup：在源代码中抓取具有相同属性集的不同数据集
【发布时间】：2015-09-22 10:00:03
【问题描述】：

我正在使用BeautifulSoup 模块从 Twitter 帐户中获取关注者总数和推文总数。但是，当我尝试检查网页上各个字段的元素时，我发现这两个字段都包含在同一组 html 属性中：

关注者

<a class="ProfileNav-stat ProfileNav-stat--link u-borderUserColor u-textCenter js-tooltip js-nav u-textUserColor" data-nav="followers" href="/IAmJericho/followers" data-original-title="2,469,681 Followers">
          <span class="ProfileNav-label">Followers</span>
          <span class="ProfileNav-value" data-is-compact="true">2.47M</span>
</a>

推文计数

    <a class="ProfileNav-stat ProfileNav-stat--link u-borderUserColor u-textCenter js-tooltip js-nav" data-nav="tweets" tabindex="0" data-original-title="21,769 Tweets">
                <span class="ProfileNav-label">Tweets</span>
                <span class="ProfileNav-value" data-is-compact="true">21.8K</span>
</a>

我写的挖矿脚本：

import requests
import urllib2
from bs4 import BeautifulSoup

link = "https://twitter.com/iamjericho"
r = urllib2.urlopen(link)
src = r.read()
res = BeautifulSoup(src)
followers = ''
for e in res.findAll('span', {'data-is-compact':'true'}):
    followers = e.text

print followers

但是，由于推文总数和关注者总数都包含在同一组 HTML 属性中，即在带有class = "ProfileNav-value" 和 data-is-compact = "true" 的 span 标记内，我只得到结果运行上述脚本返回的关注者总数。

我怎么可能从 BeautifulSoup 中提取包含在相似 HTML 属性中的两组信息？

【问题讨论】：

附带说明，抓取 twitter 等网站通常违反其服务条款。使用他们的 api 可能会更好。
@Craicerjack 好吧，老实说，这是一个普遍的问题。在从网站上抓取信息时，在类似情况下会怎么做？

标签： python python-2.7 web-scraping beautifulsoup python-requests

【解决方案1】：

在这种情况下，实现它的一种方法是检查data-is-compact="true"对于您要提取的每条数据只出现两次，而且您知道tweets是第一个，followers是第二个，所以您可以按相同顺序列出这些标题，并使用zip 将它们加入一个元组以同时打印两者，例如：

import urllib2
from bs4 import BeautifulSoup

profile = ['Tweets', 'Followers']

link = "https://twitter.com/iamjericho"
r = urllib2.urlopen(link)
src = r.read()
res = BeautifulSoup(src)
followers = ''
for p, d in zip(profile, res.find_all('span', { 'data-is-compact': "true"})):
    print p, d.text

它产生：

Tweets 21,8K                                                                                                                                                                                                                                                                   
Followers 2,47M

【讨论】：