【问题标题】:BeautifulSoup scraping wunderground dataBeautifulSoup 抓取 wunderground 数据
【发布时间】:2015-12-03 14:10:14
【问题描述】:

我正在学习如何使用 Nathan Yau 的“Visualize This”一书来抓取数据。我正在尝试抓取 2009 年的 Wunderground,但出现此错误。它说它超出范围,但我不明白为什么。

 Traceback (most recent call last):
   File "get-weather-data.py", line 24, in <module>
     dayTemp = soup.findAll(attrs={"class":"nobr"})[5].span.string
 IndexError: list index out of range     


import urllib2
from bs4 import BeautifulSoup

# Create/open a file called wunder.txt (which will be a comma-delimited file)
f = open('wunder-data.txt', 'w')

# Iterate through months and day
for m in range(1, 13):
  for d in range(1, 32):

  # Check if already gone through month
  if (m == 2 and d > 28):
    break
  elif (m in [4, 6, 9, 11] and d > 30):
    break

  # Open wunderground.com url
  url = "http://www.wunderground.com/history/airport/KBUF/2009/" + str(m) + "/" + str(d) + "/DailyHistory.html"
  page = urllib2.urlopen(url)

  # Get temperature from page
  soup = BeautifulSoup(page)
  # dayTemp = soup.body.nobr.b.string
  dayTemp = soup.findAll(attrs={"class":"nobr"})[5].span.string

  # Format month for timestamp
  if len(str(m)) < 2:
    mStamp = '0' + str(m)
  else:
    mStamp = str(m)

  # Format day for timestamp
  if len(str(d)) < 2:
    dStamp = '0' + str(d)
  else:
    dStamp = str(d)

  # Build timestamp
  timestamp = '2009' + mStamp + dStamp

  # Write timestamp and temperature to file
  f.write(timestamp + ',' + dayTemp + '\n')

# Done getting data! Close file.
f.close()

【问题讨论】:

  • 这个nobr类是从哪里来的,你要提取哪个温度?

标签: python web-scraping beautifulsoup html-parsing screen-scraping


【解决方案1】:

一天的天气历史页面上没有带有class="nobr" 的元素。

如果您想获得实际平均温度,我会按照以下方式找到它:

dayTemp = soup.find("span", text="Mean Temperature").parent.find_next_sibling("td").get_text(strip=True)

如果使用md 打印,输出将是:

1 1 14°F
1 2 28°F
1 3 19°F
...

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2017-02-15
    • 2021-03-06
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多