在每个 pandas 数据框行中查找最高值列的名称——包括绑定值答案

【问题标题】：Find name(s) of highest-value columns in each pandas dataframe row--Including tied values在每个 pandas 数据框行中查找最高值列的名称——包括绑定值
【发布时间】：2020-04-17 10:43:06
【问题描述】：

我有一个数据框，记录了不同人拥有的水果的数量和类型。我想添加一列，指示每个人的顶级水果。如果一个人有 2 个以上排名靠前的水果（也就是领带），我想要一个列表（或元组）。

输入

例如，假设我的输入是这个数据框：

# Create all the fruit data
data = [{'fruit0':'strawberry','fruit0_count':23,'fruit1':'orange','fruit1_count':4,'fruit2':'grape','fruit2_count':27},
                  {'fruit0':'apple','fruit0_count':45,'fruit1':'mango','fruit1_count':45,'fruit2':'orange','fruit2_count':12},
                  {'fruit0':'blueberry','fruit0_count':30,'fruit1':'grapefruit','fruit1_count':32,'fruit2':'cherry','fruit2_count':94},
                  {'fruit0':'pineapple','fruit0_count':4,'fruit1':'grape','fruit1_count':4,'fruit2':'lemon','fruit2_count':67}]

# Add people's names as an index 
df = pd.DataFrame(data, index=['Shawn', 'Monica','Jamal','Tracy'])

# Print the dataframe
df

。 . .创建输入数据框：

        fruit0      fruit0_count    fruit1      fruit1_count    fruit2  fruit2_count
Shawn   strawberry  23              orange      4               grape   27
Monica  apples      45              mango       45              orange  12
Jamal   blueberry   30              grapefruit  32              cherry  94
Tracy   pineapple   4               grape       4               lemon   67

目标输出

我想要的是一个新列，它给出了每个人的顶级水果的名称。如果此人有两个（或更多）水果并列第一，我想要这些水果的列表或元组：

        fruit0      fruit0_count    fruit1      fruit1_count    fruit2  fruit2_count    top_fruit
Shawn   strawberry  23              orange      4               grape   27              grape
Monica  apple       45              mango       45              orange  12              (apple,mango)
Jamal   blueberry   30              grapefruit  32              cherry  94              cherry
Tracy   pineapple   4               grape       4               lemon   67              lemon

我的尝试远

我得到的最接近的是基于https://stackoverflow.com/a/38955365/6480859。

问题：

如果顶果有平局，它只会捕获一个顶果（莫妮卡的顶果只有苹果。）
真的很复杂。不是真的问题，但是如果有更直接的路径，我想学习它。

# List the columns that contain count numbers
cols = ['fruit0_count', 'fruit1_count', 'fruit2_count']

# Make a new dataframe with just those columns.
only_counts_df=pd.DataFrame()
only_counts_df[cols]=df[cols].copy()

# Indicate how many results you want. Note: If you increase
# this from 1, it gives you the #2, #3, etc. ranking -- it 
# doesn't represent tied results.
nlargest = 1 

# The next two lines are suggested from 
# https://stackoverflow.com/a/38955365/6480859. I don't totally
# follow along . . . 
order = np.argsort(-only_counts_df.values, axis=1)[:, :nlargest]
result = pd.DataFrame(only_counts_df.columns[order], 
                      columns=['top{}'.format(i) for i in range(1, nlargest+1)],
                      index=only_counts_df.index)

# Join the results back to our original dataframe
result = df.join(result).copy()

# The dataframe now reports the name of the column that 
# contains the top fruit. Convert this to the fruit name.
def id_fruit(row):
    if row['top1'] == 'fruit0_count':
        return row['fruit0']
    elif row['top1'] == 'fruit1_count':
        return row['fruit1']
    elif row['top1'] == 'fruit2_count':
        return row['fruit2']
    else:
        return "Failed"
result['top_fruit'] = result.apply(id_fruit,axis=1)
result = result.drop(['top1'], axis=1).copy()
result

。 . .输出：

        fruit0      fruit0_count    fruit1      fruit1_count    fruit2  fruit2_count    top_fruit
Shawn   strawberry  23              orange      4               grape   27              grape
Monica  apple       45              mango       45              orange  12              apple
Jamal   blueberry   30              grapefruit  32              cherry  94              cherry
Tracy   pineapple   4               grape       4               lemon   67              lemon

莫妮卡最喜欢的水果应该是苹果和芒果。

欢迎任何提示，谢谢！

【问题讨论】：

标签： python pandas numpy dataframe

【解决方案1】：

想法是过滤每一对并将列取消配对到df1和df2，然后通过max比较值并用DataFrame.mask过滤，最后在apply中得到非缺失值：

df1 = df.iloc[:, ::2]
df2 = df.iloc[:, 1::2]
mask = df2.eq(df2.max(axis=1), axis=0)

df['top'] = df1.where(mask.to_numpy()).apply(lambda x: x.dropna().tolist(), axis=1)
print (df)
            fruit0  fruit0_count      fruit1  fruit1_count  fruit2  \
Shawn   strawberry            23      orange             4   grape   
Monica       apple            45       mango            45  orange   
Jamal    blueberry            30  grapefruit            32  cherry   
Tracy    pineapple             4       grape             4   lemon   

        fruit2_count             top  
Shawn             27         [grape]  
Monica            12  [apple, mango]  
Jamal             94        [cherry]  
Tracy             67         [lemon]

【讨论】：

【解决方案2】：

这是我想出的：

maxes = df[[f"fruit{i}_count" for i in range(3)]].max(axis=1)
mask = df[[f"fruit{i}_count" for i in range(3)]].isin(maxes)
df_masked = df[[f"fruit{i}" for i in range(3)]][
    mask.rename(lambda x: x.replace("_count", ""), axis=1)
]

df["top_fruit"] = df_masked.apply(lambda x: x.dropna().tolist(), axis=1)

这将返回

            fruit0  fruit0_count  ... fruit2_count       top_fruit
Shawn   strawberry            23  ...           27         [grape]
Monica       apple            45  ...           12  [apple, mango]
Jamal    blueberry            30  ...           94        [cherry]
Tracy    pineapple             4  ...           67         [lemon]

【讨论】：