【问题标题】:Python scraping data online, but the csv file doesn't show correct format of dataPython在线抓取数据,但csv文件未显示正确的数据格式
【发布时间】:2019-01-02 04:09:34
【问题描述】:

我正在尝试做一些小数据抓取工作,因为我想做一些数据分析。对于数据,我从foxsports获得,url链接也包含在代码中。步骤在评论部分解释。如果可能,您可以粘贴并运行。

对于数据,我想跳过2013-2018赛季的网页,并在网页上抓取表格中的所有数据。所以我的代码在这里:

import requests
from lxml import html
import csv

# Set up the urls for Bayern Muenchen's Team Stats starting from 2013-14 
Season
# up to 2017-18 Season
# The data stores in the foxsports websites
urls = ["https://www.foxsports.com/soccer/bayern-munich-team-stats?competition=4&season=2013&category=STANDARD", 
        "https://www.foxsports.com/soccer/bayern-munich-team-stats? competition=4&season=2014&category=STANDARD",
        "https://www.foxsports.com/soccer/bayern-munich-team-stats? competition=4&season=2015&category=STANDARD",
        "https://www.foxsports.com/soccer/bayern-munich-team-stats? competition=4&season=2016&category=STANDARD",
        "https://www.foxsports.com/soccer/bayern-munich-team-stats? competition=4&season=2017&category=STANDARD"
]

seasons = ["2013/2014","2014/2015", "2015/2016", "2016/2017", "2017/2018"]

data = ["Season", "Team", "Name", "Games_Played", "Games_Started", "Minutes_Played", "Goals", "Assists", "Shots_On_Goal", "Shots", "Yellow_Cards", "Red_Cards"]

csvFile = "bayern_munich_team_stats_2013_18.csv"
# Having set up the dataframe and urls for various season standard stats, we
# are going to examine the xpath of the same player Lewandowski's same data feature
# for various pages (namely the different season pages)
# See if we can find some pattern

# 2017-18 Season Name xpath:
#   //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]/td[1]/div/a/span[1]
# 2016-17 Season Name xpath:
#   //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]/td[1]/div/a/span[1]
# 2015-16 Season Name xpath:
#   //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]/td[1]/div/a/span[1]

# tr xpath 17-18:
#   //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]
# tr xpath 16=17:
#   //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]
# tr xpath 15-16:
#   //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]

# For a single season's team stats, the tbody and tr relationship is like:
#   //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody
#   //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]
#   //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[2]

# lewandowski
#   //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]/td[1]/div/a/span[1]
# Wagner
#   //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[2]/td[1]/div/a/span[1]
# ********
# for each row with player names, the name proceeds with tr[num], num += 1 gives
# new name in a new row.
# ********


i = 0
for url in urls:
    print(url)
    response = requests.get(url)
    result = html.fromstring(response.content)
    j = 1
    for tr in result.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr'):
        # Except for season and team, we open foxsports webpage for the given team, here
        # Bayern Munich, and the given season, here starting from 13-14, and use F12 to
        # view page elements, look for tbody of the figure table, then copy the corresponding
        # xpath to here. Adjust the xpath as described above.

        season = seasons[i] # seasons[i] changes with i, but stays the same for each season
        data.append(season)
        team = ["FC BAYERN MUNICH"] # this doesn't change since we are extracting solely Bayern
        data.append(team)
        name =  tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[1]/div/a/span[1]' %j )
        data.append(name)
        gamep = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[2]' %j )
        data.append(gamep)
        games = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[3]' %j )
        data.append(games)
        mp =    tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[4]' %j )
        data.append(mp)
        goals = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[5]' %j )
        data.append(goals)
        assists = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[6]' %j )
        data.append(assists)
        shots_on_goal = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[7]' %j )
        data.append(shots_on_goal)
        shots = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[8]' %j )
        data.append(shots)
        yellow = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[9]' %j )
        data.append(yellow)
        red=    tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[10]' %j )
        data.append(red)
        # update j for next row of player
        j += 1
    # update i
    i += 1


with open(csvFile, "w") as file:
    writer = csv.writer(file)
    writer.writerow(data)

print("Done")

我尝试使用 data.extend([season, name, team, ...]) 但结果仍然相同,所以我只是在此处附加了所有内容。 csv 文件的内容不是我所期望的,正如您在图片中看到的那样:

我不太确定哪里出错了,它显示结果“XXXXXX#####的元素跨度”,我仍然是编程的新手。如果有人能帮助我解决这个问题,我将不胜感激,这样我就可以继续这个小项目,这只是为了教育目的。 非常感谢您的时间和帮助!

【问题讨论】:

  • writer.writerow(data) 会将所有数据写入单行
  • 你应该在 for 循环中包含 writer.writerow(data)

标签: python web-scraping screen-scraping


【解决方案1】:

这是你可以做的

我以前也这样做过

import csv
with open(output_file, 'w', newline='') as csvfile:
            field_names = ['f6s_profile', 'linkedin_profile', 'Name', 'job_type', 'Status']
            writer = csv.DictWriter(csvfile, fieldnames=field_names)
            writer.writerow(
                {'profile': 'profile', 'profile1': 'profile1',
                 'Name': 'Name', 'job_type': 'Job Type', 'Status': 'Status'})

            for raw in data2:

            .data = []
            .# get you data using selenium
            .# data.append()
            .
                writer.writerow(
                                {'profile': data[0], 'profile1': data[1],
                                 'Name': name_person, 'job_type': data[2], 'Status': status})

第一个writer.writerow 将是您的标题,field_names 仅用作将数据填充到特定列的键

要获取[<Element td at 0x151ca980638>] 的值,您可以使用 data.append(name.text)

你也可以这样做 在你的 xpath 之后添加.text

name =  tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[1]/div/a/span[1]' %j ).text
data.append(name)

【讨论】:

  • 我无法提供完整的代码。但这是你应该如何使用writerow
  • @Nihal 我知道,我的意思是你的代码在这里,别担心。谢谢!
  • 嗨,Nihal,方法奏效了!但还有一个问题,我的意思是我得到了所有带有 [] 元素的数据单元。你知道为什么会这样吗?使用 xpath() 时我应该做些什么吗?谢谢
  • 你可以使用data.append(name.text)
  • 好的,谢谢!我会试试看。抱歉,我的 laotop 宕机了...现在使用 phonr..
猜你喜欢
  • 1970-01-01
  • 2018-02-17
  • 1970-01-01
  • 2022-01-16
  • 2020-07-18
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多