【问题标题】:How to operate a function over multiple columns (Pandas/Python)?如何在多列(Pandas/Python)上操作一个函数?
【发布时间】:2026-01-08 21:15:03
【问题描述】:

让我们考虑一下来自 Kaggle (https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset) 的 IBM HR Attrition Dataset。如何快速获取夏皮罗 p 值最高的变量?

换句话说,我可以将函数shapiro() 在列中应用为shapiro(df['column'])。我想计算这些函数的所有数字列。

我试过了:

from scypy.stats import shapiro
df = pd.read_csv('path')

#here i was expecting the output to be a sequential prints with the name of the columns and their respective p-value from shapiro()
for col in hr:
   print(col," : ", shapiro(hr[col])[0])

有谁可以帮忙解决这个问题?

提前致谢。

【问题讨论】:

    标签: python pandas scipy statistics


    【解决方案1】:

    我希望这会有所帮助!我敢肯定有很多更好的方法,但尝试很有趣:)

    import pandas as pd
    from scipy import stats
    
    df = pd.read_csv('path.csv')
    
    # make a new dataframe newdf with only the columns containing numeric data
    
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    
    newdf = df.select_dtypes (include=numerics)
    
    #check to see that the columns are only numeric
    print(newdf.head())
    
    # new dataframe with rows "W" and "P"
    shapiro_wilks = (newdf).apply(lambda x: pd.Series(shapiro(x), index=['W','P'])).reset_index()
    shapiro_wilks = shapiro_wilks.set_index('index') #ugh
    
    
    print(shapiro_wilks)
    

    【讨论】:

    • 非常感谢!
    最近更新 更多