如何通过合并现有列中的数据来创建新列？答案

【问题标题】：How do I create new columns by combining data in existing columns?如何通过合并现有列中的数据来创建新列？
【发布时间】：2020-01-24 16:25:13
【问题描述】：

我有一个包含 5 列的数据集，请原谅格式：

id     Price    Service Rater Name  Cleanliness
401013357   5   3   A   1
401014972   2   1   A   5
401022510   3   4   B   2
401022510   5   1   C   9
401022510   3   1   D   4
401022510   2   2   E   2

我希望每个 ID 只有一行。因此，我需要为每个评分者的姓名和评分类别（例如评分者姓名价格、评分者姓名服务、评分者姓名清洁度）创建列，每一个都在自己的列中。谢谢。

我已经探索了 groupby，但不知道如何将它们操作到新列中。谢谢！

Here's the code and data I'm actually using:

import requests
from pandas import DataFrame
import pandas as pd


linesinfo_url = 'https://api.collegefootballdata.com/lines?year=2018&seasonType=regular'
linesresp = requests.get(linesinfo_url)

dflines = DataFrame(linesresp.json())
#nesteddata in lines like game info
#setting game ID as index
dflines.set_index('id', inplace=True)

a = linesresp.json()
#defining a as the response to our get request for this data, in JSON format 
buf = []
#i believe this creates a receptacle for nested data I'm extracting from json
for game in a:
    for line in game['lines']:
        game_dict = dict(id=game['id'])
        for cat in ('provider', 'spread','formattedSpread', 'overUnder'):
           game_dict[cat] = line[cat]
        buf.append(game_dict)

dflinestable = pd.DataFrame(buf)
dflinestable.set_index(['id', 'provider'])

从这里，我得到

                              formattedSpread  overUnder  spread
id        provider                                                
401013357 consensus                   UMass -21       68.0   -21.0
401014972 consensus                  Rice -22.5       58.5   -22.5
401022510 Caesars          Colorado State -17.5       57.5   -17.5
          consensus          Colorado State -17       57.5   -17.0
          numberfire         Colorado State -17       58.5   -17.0
          teamrankings       Colorado State -17       58.0   -17.0
401013437 numberfire                 Wyoming -5       47.0     5.0
          teamrankings               Wyoming -5       47.0     5.0
401020671 consensus            Ball State -19.5       61.5   -19.5
401019470 Caesars                     UCF -22.5        NaN    22.5
          consensus                   UCF -22.5        NaN    22.5
          numberfire                    UCF -24       70.0    24.0
          teamrankings                  UCF -24       70.0    24.0
401013328 numberfire            Minnesota -21.5       47.0   -21.5
          teamrankings          Minnesota -21.5       49.0   -21.5

我正在寻找的结果是 4 个不同的提供者中的每一个都有三列，因此它是 caesars_formattedSpread、caesars_overUnder、Caesars spread、numberfire_formattedSpread、numberfire_overUnder、numberfire_spread 等。

当我按照建议运行 unstack 时，我没有得到我期望的结果。相反，我得到：

formattedSpread  0                  UMass -21
                 1                 Rice -22.5
                 2       Colorado State -17.5
                 3         Colorado State -17
                 4         Colorado State -17
                 5         Colorado State -17
                 6                 Wyoming -5
                 7                 Wyoming -5
                 8           Ball State -19.5
                 9                  UCF -22.5
                 10                 UCF -22.5
                 11                   UCF -24
                 12                   UCF -24

【问题讨论】：

你的预期输出是什么？
您尝试过什么，预期的结果是什么？能否请您提供更多信息！
@WeNYoBen - 查看编辑。
@SimonFink 如上所述，我进行了重大修改。可能是试图过于简单化。
聚会迟到了，但预期的输出是多少？我只看到“不正确”的输出。

标签： python pandas group-by pandas-groupby

【解决方案1】：

* 已编辑，基于已编辑的问题 *

鉴于您的数据框是df：

df = df.set_index(['id', 'Rater Name']) # Make it a Multi Index
df_unstacked = df.unstack()

您编辑的代码的问题在于您没有将dflinestable.set_index(['id', 'provider']) 分配给任何东西。因此，当您随后使用dflinestable.unstack() 时，您将取消堆叠原始dflinestable。

所以你的整个代码应该是：

import requests
import pandas as pd


linesinfo_url = 'https://api.collegefootballdata.com/lines?year=2018&seasonType=regular'
linesresp = requests.get(linesinfo_url)

dflines = pd.DataFrame(linesresp.json())
#nesteddata in lines like game info
#setting game ID as index
dflines.set_index('id', inplace=True)

a = linesresp.json()
#defining a as the response to our get request for this data, in JSON format 
buf = []
#i believe this creates a receptacle for nested data I'm extracting from json
for game in a:
    for line in game['lines']:
        game_dict = dict(id=game['id'])
        for cat in ('provider', 'spread','formattedSpread', 'overUnder'):
           game_dict[cat] = line[cat]
        buf.append(game_dict)

dflinestable = pd.DataFrame(buf)
dflinestable.set_index(['id', 'provider'], inplace=True) # Add inplace=True
dflinestable_unstacked = dflinestable.unstack() # unstack (you could also reassign to the same df

# Flatten columns to single level, in the order as described
dflinestable_unstacked.columns = [f'{j}_{i}' for i, j in dflinestable_unstacked.columns]

这将为您提供一个类似（缩写）的 DataFrame：

          Caesars_formattedSpread  ... teamrankings_spread
id                                 ...                    
401012246             Alabama -24  ...               -23.5
401012247            Arkansas -34  ...                 NaN
401012248               Auburn -1  ...                -1.5
401012249                     NaN  ...                 NaN
401012250             Georgia -44  ...                 NaN

【讨论】：

认为我终于开始掌握问题格式的窍门了——见上文。还没有完全解决。
问题是你没有将 'df.set_index()` 重新分配给任何东西。因此，您然后将 df 与原始索引分开。请参阅上面的编辑代码。
非常有帮助。谢谢。