【问题标题】:pandas crosstab and normalising indexes熊猫交叉表和标准化索引
【发布时间】:2020-09-05 10:39:27
【问题描述】:

如何规范化具有多索引的 pandas 交叉表?

假设你有这样的df:

# RANDOM DATA
np.random.seed(2)

Year = [2020,2019,2018,2017]*1000
col1 = ['A','B','C','D']*1000
col2 = np.random.randint(0,2,4000)
weight = np.random.randint(1,3,4000)

random.shuffle(Year)
random.shuffle(col1)
random.shuffle(col2)
random.shuffle(weight)

column_names = ['Year', 'weight', 'col1', 'col2']

df = pd.DataFrame(columns=column_names)

df['Year'] = Year
df['col1'] = col1
df['col2'] = col2
df['weight'] = weight

现在你做一个交叉表:

pd.crosstab(index=[df['Year']],
            columns=[df['col1'], df['col2']],
            values=df['weight'],
            aggfunc=sum)

col1         A                   B                   C                   D  \
col2         0         1         0         1         0         1         0   
Year                                                                         
2017  0.117962  0.128686  0.128016  0.130697  0.137399  0.122654  0.115282   
2018  0.116832  0.111551  0.120132  0.118152  0.138614  0.125413  0.131353   
2019  0.137584  0.126846  0.127517  0.108725  0.114765  0.138255  0.114765   
2020  0.116356  0.134309  0.113032  0.143617  0.121676  0.118351  0.121676   

col1            
col2         1  
Year            
2017  0.119303  
2018  0.137954  
2019  0.131544  
2020  0.130984  

如何在多索引中进行规范化?

我的预期输出是:

col1    A   A   B   B   C   C   D   D
col2    0   1   0   1   0   1   0   1
Year                                
2017    0.478   0.522   0.495   0.505   0.528   0.472   0.491   0.509
2018    0.512   0.488   0.504   0.496   0.525   0.475   0.488   0.512
2019    0.520   0.480   0.540   0.460   0.454   0.546   0.466   0.534
2020    0.464   0.536   0.440   0.560   0.507   0.493   0.482   0.518

【问题讨论】:

    标签: python-3.x pandas crosstab


    【解决方案1】:

    我认为您可以将groupby().value_counts 与 normalize 一起使用:

    #### Random data ####
    np.random.seed(2)
    Year = [2020,2019,2018,2017]*1000
    col1 = np.repeat(['A','B','C','D'],1000)
    col2 = np.random.randint(0,2,4000)
    
    column_names = ['Year', 'col1', 'col2']
    
    df = pd.DataFrame(columns=column_names)
    
    df['Year'] = Year
    df['col1'] = col1
    df['col2'] = col2
    #### End random data ####
    
    
    (df.groupby(['Year','col1'])['col2']
       .value_counts(normalize=True)
       .unstack(['col1','col2'])
    )
    

    输出:

    col1      A             B             C             D       
    col2      0      1      1      0      0      1      1      0
    Year                                                        
    2017  0.524  0.476  0.528  0.472  0.540  0.460  0.512  0.488
    2018  0.496  0.504  0.484  0.516  0.560  0.440  0.524  0.476
    2019  0.496  0.504  0.552  0.448  0.484  0.516  0.508  0.492
    2020  0.500  0.500  0.540  0.460  0.436  0.564  0.484  0.516
    

    更新:对于加权value_counts

    new_df = df.groupby(['Year','col1','col2'])['weight'].sum()
    new_df /= new_df.sum(level=['Year','col1'])
    new_df = new_df.unstack(['col1','col2'])
    

    输出:

    col1         A                   B                   C                   D  \
    col2         0         1         0         1         0         1         0   
    Year                                                                         
    2017  0.441176  0.558824  0.542500  0.457500  0.474286  0.525714  0.482857   
    2018  0.500000  0.500000  0.474114  0.525886  0.481268  0.518732  0.489461   
    2019  0.521622  0.478378  0.474490  0.525510  0.498801  0.501199  0.531343   
    2020  0.493671  0.506329  0.479339  0.520661  0.515789  0.484211  0.513021   
    
    col1            
    col2         1  
    Year            
    2017  0.517143  
    2018  0.510539  
    2019  0.468657  
    2020  0.486979  
    

    【讨论】:

    • 我编辑了这个问题。对于那个很抱歉。那么,如果您使用 groupby 并希望使用带有“values”和“aggfunc”参数的交叉表很简单的加权数据,该怎么办
    • 在代码中,crosstab 使用pivot_table,同样是groupby。这就是为什么如果您知道该怎么做,groupby 通常会更快。在这种情况下,您也可以将normalize=True 传递给您的crosstab。在扩展的情况下,如果您可以提供详细信息,我们可以看到如何将aggfunc 传递给groupby
    猜你喜欢
    • 2021-04-09
    • 1970-01-01
    • 2022-09-27
    • 1970-01-01
    • 2021-08-23
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多