【问题标题】:Find name(s) of highest-value columns in each pandas dataframe row--Including tied values在每个 pandas 数据框行中查找最高值列的名称——包括绑定值
【发布时间】:2020-04-17 10:43:06
【问题描述】:

我有一个数据框,记录了不同人拥有的水果的数量和类型。我想添加一列,指示每个人的顶级水果。如果一个人有 2 个以上排名靠前的水果(也就是领带),我想要一个列表(或元组)。

输入

例如,假设我的输入是这个数据框:

# Create all the fruit data
data = [{'fruit0':'strawberry','fruit0_count':23,'fruit1':'orange','fruit1_count':4,'fruit2':'grape','fruit2_count':27},
                  {'fruit0':'apple','fruit0_count':45,'fruit1':'mango','fruit1_count':45,'fruit2':'orange','fruit2_count':12},
                  {'fruit0':'blueberry','fruit0_count':30,'fruit1':'grapefruit','fruit1_count':32,'fruit2':'cherry','fruit2_count':94},
                  {'fruit0':'pineapple','fruit0_count':4,'fruit1':'grape','fruit1_count':4,'fruit2':'lemon','fruit2_count':67}]

# Add people's names as an index 
df = pd.DataFrame(data, index=['Shawn', 'Monica','Jamal','Tracy'])

# Print the dataframe
df

。 . .创建输入数据框:

        fruit0      fruit0_count    fruit1      fruit1_count    fruit2  fruit2_count
Shawn   strawberry  23              orange      4               grape   27
Monica  apples      45              mango       45              orange  12
Jamal   blueberry   30              grapefruit  32              cherry  94
Tracy   pineapple   4               grape       4               lemon   67

目标输出

我想要的是一个新列,它给出了每个人的顶级水果的名称。如果此人有两个(或更多)水果并列第一,我想要这些水果的列表或元组:

        fruit0      fruit0_count    fruit1      fruit1_count    fruit2  fruit2_count    top_fruit
Shawn   strawberry  23              orange      4               grape   27              grape
Monica  apple       45              mango       45              orange  12              (apple,mango)
Jamal   blueberry   30              grapefruit  32              cherry  94              cherry
Tracy   pineapple   4               grape       4               lemon   67              lemon

我的尝试远

我得到的最接近的是基于https://stackoverflow.com/a/38955365/6480859

问题:

  1. 如果顶果有平局,它只会捕获一个顶果(莫妮卡的顶果只有苹果。)
  2. 真的很复杂。不是真的问题,但是如果有更直接的路径,我想学习它。
# List the columns that contain count numbers
cols = ['fruit0_count', 'fruit1_count', 'fruit2_count']

# Make a new dataframe with just those columns.
only_counts_df=pd.DataFrame()
only_counts_df[cols]=df[cols].copy()

# Indicate how many results you want. Note: If you increase
# this from 1, it gives you the #2, #3, etc. ranking -- it 
# doesn't represent tied results.
nlargest = 1 

# The next two lines are suggested from 
# https://stackoverflow.com/a/38955365/6480859. I don't totally
# follow along . . . 
order = np.argsort(-only_counts_df.values, axis=1)[:, :nlargest]
result = pd.DataFrame(only_counts_df.columns[order], 
                      columns=['top{}'.format(i) for i in range(1, nlargest+1)],
                      index=only_counts_df.index)

# Join the results back to our original dataframe
result = df.join(result).copy()

# The dataframe now reports the name of the column that 
# contains the top fruit. Convert this to the fruit name.
def id_fruit(row):
    if row['top1'] == 'fruit0_count':
        return row['fruit0']
    elif row['top1'] == 'fruit1_count':
        return row['fruit1']
    elif row['top1'] == 'fruit2_count':
        return row['fruit2']
    else:
        return "Failed"
result['top_fruit'] = result.apply(id_fruit,axis=1)
result = result.drop(['top1'], axis=1).copy()
result

。 . .输出:

        fruit0      fruit0_count    fruit1      fruit1_count    fruit2  fruit2_count    top_fruit
Shawn   strawberry  23              orange      4               grape   27              grape
Monica  apple       45              mango       45              orange  12              apple
Jamal   blueberry   30              grapefruit  32              cherry  94              cherry
Tracy   pineapple   4               grape       4               lemon   67              lemon

莫妮卡最喜欢的水果应该是苹果芒果。

欢迎任何提示,谢谢!

【问题讨论】:

    标签: python pandas numpy dataframe


    【解决方案1】:

    想法是过滤每一对并将列取消配对到df1df2,然后通过max比较值并用DataFrame.mask过滤,最后在apply中得到非缺失值:

    df1 = df.iloc[:, ::2]
    df2 = df.iloc[:, 1::2]
    mask = df2.eq(df2.max(axis=1), axis=0)
    
    df['top'] = df1.where(mask.to_numpy()).apply(lambda x: x.dropna().tolist(), axis=1)
    print (df)
                fruit0  fruit0_count      fruit1  fruit1_count  fruit2  \
    Shawn   strawberry            23      orange             4   grape   
    Monica       apple            45       mango            45  orange   
    Jamal    blueberry            30  grapefruit            32  cherry   
    Tracy    pineapple             4       grape             4   lemon   
    
            fruit2_count             top  
    Shawn             27         [grape]  
    Monica            12  [apple, mango]  
    Jamal             94        [cherry]  
    Tracy             67         [lemon]  
    

    【讨论】:

      【解决方案2】:

      这是我想出的:

      maxes = df[[f"fruit{i}_count" for i in range(3)]].max(axis=1)
      mask = df[[f"fruit{i}_count" for i in range(3)]].isin(maxes)
      df_masked = df[[f"fruit{i}" for i in range(3)]][
          mask.rename(lambda x: x.replace("_count", ""), axis=1)
      ]
      
      df["top_fruit"] = df_masked.apply(lambda x: x.dropna().tolist(), axis=1)
      

      这将返回

                  fruit0  fruit0_count  ... fruit2_count       top_fruit
      Shawn   strawberry            23  ...           27         [grape]
      Monica       apple            45  ...           12  [apple, mango]
      Jamal    blueberry            30  ...           94        [cherry]
      Tracy    pineapple             4  ...           67         [lemon]
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2015-10-01
        • 2017-04-19
        相关资源
        最近更新 更多