【问题标题】:scraping a table content with beautifulsoup 4用 Beautifulsoup 4 抓取表格内容
【发布时间】:2020-09-06 04:45:35
【问题描述】:

我正在尝试抓取this page 中的“TWITTER STATS Summary”表。

这是我的代码

rank_page = 'https://socialblade.com/twitter/user/bill%20gates'
request = urllib2.Request(rank_page, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'})
page = urllib2.urlopen(request)
soup = BeautifulSoup(page, 'html.parser')

channels = soup.find('div', attrs={'id': 'socialblade-user-content'}).find_all('div', recursive=False)[10:]

for row in channels:
    date = row.find('div', attrs={'style': 'width: 80px; float: left;'})
print date

但我在终端中得到None。我只想获取表格中的日期(DATE FOLLOWERS FOLLOWING MEDIA)。我知道如何继续并将它们保存在 excel 中,但我很难找到 div 和文本。感谢您的帮助

【问题讨论】:

  • 需要python2吗?
  • 你应该升级
  • 并不总是可能的,所以问题仍然存在:需要 python 2 吗?
  • python 2 没关系,只要能用就行

标签: python web-scraping beautifulsoup


【解决方案1】:

使用 Python3 和 requests 库:

import requests
from bs4 import BeautifulSoup

rank_page = 'https://socialblade.com/twitter/user/bill%20gates'
r = requests.get(rank_page, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'})
soup = BeautifulSoup(r.content, 'html.parser')

d = soup.select_one('div:has(>div:contains("Date")) + div')

all_data = []
for div in d.find_all('div', recursive=False):
    row = div.get_text(strip=True, separator='|').split('|')
    if len(row) == 8:
        all_data.append(row)

#pretty print to screen:
print(('{:<20}'*8).format('Date', 'Day', 'Followers(chng)', 'Followers', 'Following(chng)', 'Following', 'Media(chng)', 'Media'))
for row in all_data:
    print(('{:<20}'*8).format(*row))

打印:

Date                Day                 Followers(chng)     Followers           Following(chng)     Following           Media(chng)         Media               
2020-05-06          Wed                 --                  50,310,276          --                  218                 --                  3,309               
2020-05-07          Thu                 +20,293             50,330,569          --                  218                 --                  3,309               
2020-05-08          Fri                 +17,884             50,348,453          --                  218                 +1                  3,310               
2020-05-09          Sat                 +21,294             50,369,747          --                  218                 --                  3,310               
2020-05-10          Sun                 +19,186             50,388,933          --                  218                 --                  3,310               
2020-05-11          Mon                 +19,892             50,408,825          --                  218                 --                  3,310               
2020-05-12          Tue                 +16,876             50,425,701          --                  218                 --                  3,310               
2020-05-13          Wed                 +18,973             50,444,674          --                  218                 +1                  3,311               
2020-05-14          Thu                 +16,764             50,461,438          --                  218                 --                  3,311               
2020-05-15          Fri                 +16,554             50,477,992          --                  218                 +1                  3,312               
2020-05-16          Sat                 +17,031             50,495,023          --                  218                 --                  3,312               
2020-05-17          Sun                 +14,046             50,509,069          --                  218                 --                  3,312               
2020-05-18          Mon                 +14,394             50,523,463          --                  218                 --                  3,312               
2020-05-19          Tue                 +9,208              50,532,671          --                  218                 +1                  3,313               

编辑(保存为 csv 文件):

#saving to csv:
import csv

with open('output.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in all_data:
        writer.writerow(row)

生成文件 output.csv(来自 LibreOffice 的屏幕截图):

【讨论】:

  • up1 以使用 css 选择器。
  • 非常感谢。如何将其保存为 CSV 文件?我刚试过,它给了我a bytes-like object is required, not 'str'
猜你喜欢
  • 1970-01-01
  • 2020-09-28
  • 1970-01-01
  • 1970-01-01
  • 2011-03-11
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多