【问题标题】:How do I create new columns by combining data in existing columns?如何通过合并现有列中的数据来创建新列?
【发布时间】:2020-01-24 16:25:13
【问题描述】:

我有一个包含 5 列的数据集,请原谅格式:

id     Price    Service Rater Name  Cleanliness
401013357   5   3   A   1
401014972   2   1   A   5
401022510   3   4   B   2
401022510   5   1   C   9
401022510   3   1   D   4
401022510   2   2   E   2

我希望每个 ID 只有一行。因此,我需要为每个评分者的姓名和评分类别(例如评分者姓名价格、评分者姓名服务、评分者姓名清洁度)创建列,每一个都在自己的列中。谢谢。

我已经探索了 groupby,但不知道如何将它们操作到新列中。谢谢!

Here's the code and data I'm actually using:

import requests
from pandas import DataFrame
import pandas as pd


linesinfo_url = 'https://api.collegefootballdata.com/lines?year=2018&seasonType=regular'
linesresp = requests.get(linesinfo_url)

dflines = DataFrame(linesresp.json())
#nesteddata in lines like game info
#setting game ID as index
dflines.set_index('id', inplace=True)

a = linesresp.json()
#defining a as the response to our get request for this data, in JSON format 
buf = []
#i believe this creates a receptacle for nested data I'm extracting from json
for game in a:
    for line in game['lines']:
        game_dict = dict(id=game['id'])
        for cat in ('provider', 'spread','formattedSpread', 'overUnder'):
           game_dict[cat] = line[cat]
        buf.append(game_dict)

dflinestable = pd.DataFrame(buf)
dflinestable.set_index(['id', 'provider'])

从这里,我得到

                              formattedSpread  overUnder  spread
id        provider                                                
401013357 consensus                   UMass -21       68.0   -21.0
401014972 consensus                  Rice -22.5       58.5   -22.5
401022510 Caesars          Colorado State -17.5       57.5   -17.5
          consensus          Colorado State -17       57.5   -17.0
          numberfire         Colorado State -17       58.5   -17.0
          teamrankings       Colorado State -17       58.0   -17.0
401013437 numberfire                 Wyoming -5       47.0     5.0
          teamrankings               Wyoming -5       47.0     5.0
401020671 consensus            Ball State -19.5       61.5   -19.5
401019470 Caesars                     UCF -22.5        NaN    22.5
          consensus                   UCF -22.5        NaN    22.5
          numberfire                    UCF -24       70.0    24.0
          teamrankings                  UCF -24       70.0    24.0
401013328 numberfire            Minnesota -21.5       47.0   -21.5
          teamrankings          Minnesota -21.5       49.0   -21.5

我正在寻找的结果是 4 个不同的提供者中的每一个都有三列,因此它是 caesars_formattedSpread、caesars_overUnder、Caesars spread、numberfire_formattedSpread、numberfire_overUnder、numberfire_spread 等。

当我按照建议运行 unstack 时,我没有得到我期望的结果。相反,我得到:

formattedSpread  0                  UMass -21
                 1                 Rice -22.5
                 2       Colorado State -17.5
                 3         Colorado State -17
                 4         Colorado State -17
                 5         Colorado State -17
                 6                 Wyoming -5
                 7                 Wyoming -5
                 8           Ball State -19.5
                 9                  UCF -22.5
                 10                 UCF -22.5
                 11                   UCF -24
                 12                   UCF -24

【问题讨论】:

  • 你的预期输出是什么?
  • 您尝试过什么,预期的结果是什么?能否请您提供更多信息!
  • @WeNYoBen - 查看编辑。
  • @SimonFink 如上所述,我进行了重大修改。可能是试图过于简单化。
  • 聚会迟到了,但预期的输出是多少?我只看到“不正确”的输出。

标签: python pandas group-by pandas-groupby


【解决方案1】:

* 已编辑,基于已编辑的问题 *

鉴于您的数据框是df

df = df.set_index(['id', 'Rater Name']) # Make it a Multi Index
df_unstacked = df.unstack()

您编辑的代码的问题在于您没有将dflinestable.set_index(['id', 'provider']) 分配给任何东西。因此,当您随后使用dflinestable.unstack() 时,您将取消堆叠原始dflinestable

所以你的整个代码应该是:

import requests
import pandas as pd


linesinfo_url = 'https://api.collegefootballdata.com/lines?year=2018&seasonType=regular'
linesresp = requests.get(linesinfo_url)

dflines = pd.DataFrame(linesresp.json())
#nesteddata in lines like game info
#setting game ID as index
dflines.set_index('id', inplace=True)

a = linesresp.json()
#defining a as the response to our get request for this data, in JSON format 
buf = []
#i believe this creates a receptacle for nested data I'm extracting from json
for game in a:
    for line in game['lines']:
        game_dict = dict(id=game['id'])
        for cat in ('provider', 'spread','formattedSpread', 'overUnder'):
           game_dict[cat] = line[cat]
        buf.append(game_dict)

dflinestable = pd.DataFrame(buf)
dflinestable.set_index(['id', 'provider'], inplace=True) # Add inplace=True
dflinestable_unstacked = dflinestable.unstack() # unstack (you could also reassign to the same df

# Flatten columns to single level, in the order as described
dflinestable_unstacked.columns = [f'{j}_{i}' for i, j in dflinestable_unstacked.columns]

这将为您提供一个类似(缩写)的 DataFrame:

          Caesars_formattedSpread  ... teamrankings_spread
id                                 ...                    
401012246             Alabama -24  ...               -23.5
401012247            Arkansas -34  ...                 NaN
401012248               Auburn -1  ...                -1.5
401012249                     NaN  ...                 NaN
401012250             Georgia -44  ...                 NaN

【讨论】:

  • 认为我终于开始掌握问题格式的窍门了——见上文。还没有完全解决。
  • 问题是你没有将 'df.set_index()` 重新分配给任何东西。因此,您然后将 df 与原始索引分开。请参阅上面的编辑代码。
  • 非常有帮助。谢谢。
猜你喜欢
  • 1970-01-01
  • 2015-06-25
  • 1970-01-01
  • 2017-06-23
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多