【问题标题】:Looping through scraped data and outputting the result遍历抓取的数据并输出结果
【发布时间】:2016-02-26 13:25:28
【问题描述】:

我正在尝试访问 BBC 足球结果网站以获取球队、射门、进球、卡片和事件。我目前有 3 个团队数据传递到 URL。

我用 Python 编写脚本并使用 Beautiful soup bs4 包。将结果输出到屏幕时,打印​​第一队,第一队和第二队,然后是第一队,第二队和第三队。所以第一个团队实际上被打印了 3 次,当我试图让 3 个团队只打印一次时。

一旦我解决了这个问题,我会将结果写入文件。我将团队数据添加到数据框中,然后添加到列表中(我不确定这是否是最好的方法)。 我确定是否与for 循环有关,但我不确定如何解决该问题。 代码:

from bs4 import BeautifulSoup
import urllib2
import pandas as pd


out_list = []
for numb in('EFBO839787', 'EFBO839786', 'EFBO815155'):

url = 'http://www.bbc.co.uk/sport/football/result/partial/' + numb + '?teamview=false'
teams_list = []
inner_page = urllib2.urlopen(url).read()
soupb = BeautifulSoup(inner_page, 'lxml')

for report in soupb.find_all('td', 'match-details'):
            home_tag = report.find('span', class_='team-home')
            home_team = home_tag and ''.join(home_tag.stripped_strings)

            score_tag = report.find('span', class_='score')
            score = score_tag and ''.join(score_tag.stripped_strings)

            shots_tag = report.find('span', class_='shots-on-target')
            shots = shots_tag and ''.join(shots_tag.stripped_strings)

            away_tag = report.find('span', class_='team-away')
            away_team = away_tag and ''.join(away_tag.stripped_strings)

            df = pd.DataFrame({'away_team' : [away_team], 'home_team' : [home_team], 'score' : [score],  })
            out_list.append(df)

for shots in soupb.find_all('td', class_='shots'):

              home_shots_tag = shots.find('span',class_='goal-count-home')
              home_shots = home_shots_tag and ''.join(home_shots_tag.stripped_strings)

              away_shots_tag = shots.find('span',class_='goal-count-away')
              away_shots = away_shots_tag and ''.join(away_shots_tag.stripped_strings)

              dfb = pd.DataFrame({'home_shots': [home_shots], 'away_shots' : [away_shots] })
              out_list.append(dfb)

for incidents in soupb.find("table", class_="incidents-table").find("tbody").find_all("tr"):

                   home_inc_tag = incidents.find("td", class_="incident-player-home")
                   home_inc = home_inc_tag and ''.join(home_inc_tag.stripped_strings)

                   type_inc_goal_tag = incidents.find("td", "span", class_="incident-type goal")
                   type_inc_goal = type_inc_goal_tag and ''.join(type_inc_goal_tag.stripped_strings)

                   type_inc_tag = incidents.find("td", class_="incident-type")
                   type_inc = type_inc_tag and ''.join(type_inc_tag.stripped_strings)

                   time_inc_tag = incidents.find('td', class_='incident-time')
                   time_inc = time_inc_tag and ''.join(time_inc_tag.stripped_strings)

                   away_inc_tag = incidents.find('td', class_='incident-player-away')
                   away_inc = away_inc_tag and ''.join(away_inc_tag.stripped_strings)

                   df_incidents = pd.DataFrame({'home_player' : [home_inc],'event_type' : [type_inc_goal],'event_time': [time_inc],'away_player' : [away_inc]})

                   out_list.append(df_incidents)


print "end"

print out_list

我是 python 和堆栈溢出的新手,任何关于格式化我的问题的建议也很有用。

提前致谢!

【问题讨论】:

  • 您的缩进已关闭,因此循环无法正确对齐。请修复它。我还建议您阅读PEP8
  • 这看起来像一个打印问题,您在什么缩进级别打印 out_list ?它应该在 zero 缩进处,一直到代码的左侧。要么这样,要么你想将 out_list 移入最顶层的 for 循环,以便在每次迭代后重新分配它。
  • 感谢@ffledgling,这是问题所在,我是 python 新手,不明白它是如何工作的。谢谢
  • 我已将此作为答案添加,如果对您有用,请接受。

标签: python web-scraping beautifulsoup


【解决方案1】:

这 3 个 for 循环应该在你的主 for 循环中。

out_list = []
for numb in('EFBO839787', 'EFBO839786', 'EFBO815155'):
  url = 'http://www.bbc.co.uk/sport/football/result/partial/' + numb + '?teamview=false'
  teams_list = []
  inner_page = urllib.request.urlopen(url).read()
  soupb = BeautifulSoup(inner_page, 'lxml')

  for report in soupb.find_all('td', 'match-details'):
              # your code as it is

  for shots in soupb.find_all('td', class_='shots'):
              # your code as it is

  for incidents in soupb.find("table", class_="incidents-table").find("tbody").find_all("tr"):
              # your code as it is

效果很好 - 只显示一个团队一次。

这是第一个 for 循环的输出:

[{'score': ['1-3'], 'away_team': ['Man City'], 'home_team': ['Dynamo Kiev']}, 
{'score': ['1-0'], 'away_team': ['Zenit St P'], 'home_team': ['Benfica']}, 
{'score': ['1-2'], 'away_team': ['Boston United'], 'home_team': ['Bradford Park Avenue']}]

【讨论】:

    【解决方案2】:

    这看起来像是一个打印问题,您打印 out_list 的缩进级别是多少?

    它应该是零缩进,在你的代码中一直到左边。

    要么这样,要么您想将 out_list 移到最顶层的 for 循环中,以便在每次迭代后重新分配它。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2013-07-05
      • 1970-01-01
      • 1970-01-01
      • 2021-11-20
      • 2021-12-07
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多