【问题标题】:Pandas: Outer Join on Non-Unique IndexPandas:非唯一索引的外部连接
【发布时间】:2015-03-19 22:58:20
【问题描述】:

我有一个带有 MultiIndex 的数据框,如下所示:

>>> dfNew.head()
                 status  shopping        TUFNWGTP
state date                                       
6     2003-01-03    emp         0  8155462.672158
      2003-01-03    emp         0  8155462.672158
      2003-01-03    emp         0  8155462.672158
      2003-01-04    emp         0  1735322.527819
      2003-01-04    emp         0  1735322.527819

这里看不到,但status 可以取三个值:empunempNaN。这是州日期级别的数据。我想加入频率不同的新状态日期数据,然后随着时间的推移聚合/分组。

>>> test['foo'].head()
state  date      
1      2004-01-01     1985886
2      2004-01-01      301172
4      2004-01-01     2614525
5      2004-01-01     1180409
6      2004-01-01    16098932

无方法加入=inner

这是我的工作:

dfNew = dfNew.join(test['foo'], method)
dfNew.reset_index(level=0, inplace=True)
doWhat = {'shopping' : np.sum, 'TUFNWGTP': np.sum, 'foo' : np.mean}
aggASS = dfNew.groupby(['state', pd.TimeGrouper("2AS", label='left'), 'status']).agg(doWhat)

应该

  • 从另一个数据库中为每个日期时间组合加入foo,并创建基于 2 年的值。

但这是我得到的:

>>> aggASS.head()
                                    foo      shopping      TUFNWGTP
state date       status                                            
1     2003-01-01 emp     2007116.941176  2.910812e+12  4.500711e+09
                 unemp              NaN  7.836728e+11  5.590089e+08
      2005-01-01 emp     2062059.100000  2.026485e+12  4.440291e+09
                 unemp   2078869.000000  7.543956e+10  2.638597e+08

观察foo 如何具有status=emp 的值,但没有status=unemp 的值,对于相同的statedate

加入how=inner

join 默认使用how=inner,所以这似乎是问题所在。但是,如果我

>>> dfNew = dfNew.join(test['foo'], how='outer')
NotImplementedError: Index._join_level on non-unique index is not implemented

是的,state-date 在这里并不是唯一的。但据我所知,我想要的仍然有意义(不是吗?)。这里有什么有效的工作?

建议的解决方案:追加为列

一个建议的解决方案是将它们附加为一列:

使用sort level对齐数据框:

>>> dfNew.head()
                 status  shopping        TUFNWGTP
state date                                       
1     2003-01-01    emp         0  3227364.873298
      2003-01-01    NaN         0  6841114.725821
      2003-01-01    NaN         0  6841114.725821
      2003-01-01    NaN         0  6841114.725821
      2003-01-01    NaN         0  6841114.725821
>>> test['foo'].head()
state  date      
1      2004-01-01    1985886
       2004-02-01    1990082
       2004-03-01    1999936
       2004-04-01    2009556
       2004-05-01    2009573

然后我们将第二个时间序列添加为列dfNew.append(test['foo'])。有人建议我ignore_index=True,但我认为因为索引标签是正确的,我们不需要它。

但是,这会使我的 Python 实例崩溃。这是数据框的大小:

>>> len(test['foo'])
6864
>>> len(dfNew)
404394

【问题讨论】:

  • 我想我可能有一个解决方案。几个问题——join语句中的方法是什么?加入表是否具有像第一个表一样的状态值 - 我看到 test['emp'] 是否像第一个表中的状态?我是否正确假设没有匹配的州日期组合,因为您试图获得总和和平均值?
  • @kennes913 不,它没有。你观察到的是一个遗留的错字。 test 最初有一列也称为“emp”,它指的是别的东西。为了清楚起见,我在这里将其重命名为foo - 你找到了我忘记更改的一个地方。 test 中没有匹配的状态-日期组合,只有一个时间序列 foo。对于这个时间序列,我需要找到时间组的平均值。我需要在聚合之前加入,因为它们有不同的频率:dfNew2003 开始,test2004 开始,但我想通过test['foo'] 的平均值来匹配这些。
  • 以下解决方案有效吗?
  • @kennes913 我的yourtimeseries 不包含state - 但我看到了这种方法,我也在想一种方法让append 工作
  • 如果您进行追加,然后在另一列中对时间段进行分类,您将能够轻松地对这些时间段进行分组和计算。

标签: python pandas


【解决方案1】:

以下是我采取的一些步骤。希望这可以引导您找到解决方案。

我重新创建了多索引数据框和您提供的时间序列:

In [118]: newdf
Out[118]: 
                      0           1                2
state date                                          
1     2003-01-01    emp           0   3227364.873298
      2003-01-01    NaN           0   6841114.725821
      2003-01-01    NaN           0   6841114.725821
      2003-01-01    NaN           0   6841114.725821
      2003-01-01    NaN           0   6841114.725821
      2003-01-01    NaN           0   6841114.725821
      2003-01-02    NaN           0   5834127.649776
      2003-01-02    NaN           0   5834127.649776
      2003-01-04    emp  2100942000   1506051.861585
      2003-01-04    emp  2100942000   1506051.861585
      2003-01-04    emp  5412841000   1204191.605090
      2003-01-04    emp  5412841000   1204191.605090
      2003-01-04    emp  5412841000   1204191.605090
      2003-01-05    NaN           0   1765953.711812
      2003-01-05    NaN           0   1765953.711812
      2003-01-05    emp           0   1434858.212964
      2003-01-05    emp           0   1434858.212964
      2003-01-05    emp           0   1434858.212964
      2003-01-05    emp           0   1811326.258197
      2003-01-05    emp           0   1811326.258197
      2003-01-05    NaN           0   1908483.149300
      2003-01-05    NaN           0   1908483.149300
      2003-01-06    NaN  1298934000   4190110.086256
      2003-01-07    NaN           0   6241047.457860
      2003-01-07    NaN           0   6241047.457860
      2003-01-07    NaN           0   6241047.457860
      2003-01-07    NaN           0   6241047.457860
      2003-01-08    emp   715231400   4614396.137509
      2003-01-08    emp   715231400   4614396.137509
      2003-01-08    emp   715231400   4614396.137509
2     2013-08-01    emp           0  10571046.129186
      2013-08-01    emp           0  10571046.129186
      2013-08-01    emp           0  10571046.129186
      2013-08-01    emp           0  10571046.129186
      2013-08-27    NaN  6804297000   3376822.385266
      2013-08-27    NaN  6804297000   3376822.385266
      2013-09-28    NaN           0   4645591.067481
      2013-09-28    NaN           0   4645591.067481
      2013-09-28    NaN           0   4645591.067481
      2013-09-28    NaN           0   4645591.067481
      2013-09-28    NaN           0   4645591.067481
      2013-09-28    NaN           0   4645591.067481
      2013-10-18    emp           0  14402621.620998
      2013-10-18    emp           0  14402621.620998
      2013-11-02  unemp           0   7778017.482167
      2013-11-02  unemp           0   7778017.482167
      2013-11-02  unemp           0   7778017.482167
      2013-11-09    NaN           0   2164565.290873
      2013-11-09    NaN           0   2164565.290873
      2013-11-10    emp   527859500   1759531.507169
      2013-11-10    emp   527859500   1759531.507169
      2013-11-24    emp           0   3050339.003118
      2013-11-24    emp           0   3050339.003118
      2013-11-24    emp           0   3050339.003118
      2013-11-29    NaN           0  11224606.711441
      2013-11-29    NaN           0  11224606.711441
      2013-12-12    emp           0  13804339.863606
      2013-12-12    emp           0  13804339.863606
      2013-12-12    emp           0  13804339.863606
      2013-12-12    emp           0  13804339.863606

In [120]: newfoo
Out[120]: 
                      foo
state date               
1     2004-01-01  1985886
      2004-02-01  1990082
      2004-03-01  1999936
      2004-04-01  2009556
      2004-05-01  2009573
      2004-06-01  2013057
      2004-07-01  2019963
      2004-08-01  2015320
      2004-09-01  2015103
      2004-10-01  2035705
      2004-11-01  2043152
      2004-12-01  2041339
      2005-01-01  2011219
      2005-02-01  2014928
      2005-03-01  2028597
2     2013-10-01   340483
      2013-11-01   338445
      2013-12-01   336903
      2014-01-01   334565
      2014-02-01   334667
      2014-03-01   335922
      2014-04-01   337188
      2014-05-01   343958
      2014-06-01   349122
      2014-07-01   354911
      2014-08-01   350833
      2014-09-01   344849
      2014-10-01   341434
      2014-11-01   339866
      2014-12-01   339203

我将数据框和时间序列展平:

   In [147]: flattenednewdf
Out[147]: 
    state       date status    shopping         TUFNWGTP
0       1 2003-01-01    emp           0   3227364.873298
1       1 2003-01-01    NaN           0   6841114.725821
2       1 2003-01-01    NaN           0   6841114.725821
3       1 2003-01-01    NaN           0   6841114.725821
4       1 2003-01-01    NaN           0   6841114.725821
5       1 2003-01-01    NaN           0   6841114.725821
6       1 2003-01-02    NaN           0   5834127.649776
7       1 2003-01-02    NaN           0   5834127.649776
8       1 2003-01-04    emp  2100942000   1506051.861585
9       1 2003-01-04    emp  2100942000   1506051.861585
10      1 2003-01-04    emp  5412841000   1204191.605090
11      1 2003-01-04    emp  5412841000   1204191.605090
12      1 2003-01-04    emp  5412841000   1204191.605090
13      1 2003-01-05    NaN           0   1765953.711812
14      1 2003-01-05    NaN           0   1765953.711812
15      1 2003-01-05    emp           0   1434858.212964
16      1 2003-01-05    emp           0   1434858.212964
17      1 2003-01-05    emp           0   1434858.212964
18      1 2003-01-05    emp           0   1811326.258197
19      1 2003-01-05    emp           0   1811326.258197
20      1 2003-01-05    NaN           0   1908483.149300
21      1 2003-01-05    NaN           0   1908483.149300
22      1 2003-01-06    NaN  1298934000   4190110.086256
23      1 2003-01-07    NaN           0   6241047.457860
24      1 2003-01-07    NaN           0   6241047.457860
25      1 2003-01-07    NaN           0   6241047.457860
26      1 2003-01-07    NaN           0   6241047.457860
27      1 2003-01-08    emp   715231400   4614396.137509
28      1 2003-01-08    emp   715231400   4614396.137509
29      1 2003-01-08    emp   715231400   4614396.137509
30      2 2013-08-01    emp           0  10571046.129186
31      2 2013-08-01    emp           0  10571046.129186
32      2 2013-08-01    emp           0  10571046.129186
33      2 2013-08-01    emp           0  10571046.129186
34      2 2013-08-27    NaN  6804297000   3376822.385266
35      2 2013-08-27    NaN  6804297000   3376822.385266
36      2 2013-09-28    NaN           0   4645591.067481
37      2 2013-09-28    NaN           0   4645591.067481
38      2 2013-09-28    NaN           0   4645591.067481
39      2 2013-09-28    NaN           0   4645591.067481
40      2 2013-09-28    NaN           0   4645591.067481
41      2 2013-09-28    NaN           0   4645591.067481
42      2 2013-10-18    emp           0  14402621.620998
43      2 2013-10-18    emp           0  14402621.620998
44      2 2013-11-02  unemp           0   7778017.482167
45      2 2013-11-02  unemp           0   7778017.482167
46      2 2013-11-02  unemp           0   7778017.482167
47      2 2013-11-09    NaN           0   2164565.290873
48      2 2013-11-09    NaN           0   2164565.290873
49      2 2013-11-10    emp   527859500   1759531.507169
50      2 2013-11-10    emp   527859500   1759531.507169
51      2 2013-11-24    emp           0   3050339.003118
52      2 2013-11-24    emp           0   3050339.003118
53      2 2013-11-24    emp           0   3050339.003118
54      2 2013-11-29    NaN           0  11224606.711441
55      2 2013-11-29    NaN           0  11224606.711441
56      2 2013-12-12    emp           0  13804339.863606
57      2 2013-12-12    emp           0  13804339.863606
58      2 2013-12-12    emp           0  13804339.863606
59      2 2013-12-12    emp           0  13804339.863606


In [143]: flattenedfoo
Out[143]: 
    state       date      foo
0       1 2004-01-01  1985886
1       1 2004-02-01  1990082
2       1 2004-03-01  1999936
3       1 2004-04-01  2009556
4       1 2004-05-01  2009573
5       1 2004-06-01  2013057
6       1 2004-07-01  2019963
7       1 2004-08-01  2015320
8       1 2004-09-01  2015103
9       1 2004-10-01  2035705
10      1 2004-11-01  2043152
11      1 2004-12-01  2041339
12      1 2005-01-01  2011219
13      1 2005-02-01  2014928
14      1 2005-03-01  2028597
15      2 2013-10-01   340483
16      2 2013-11-01   338445
17      2 2013-12-01   336903
18      2 2014-01-01   334565
19      2 2014-02-01   334667
20      2 2014-03-01   335922
21      2 2014-04-01   337188
22      2 2014-05-01   343958
23      2 2014-06-01   349122
24      2 2014-07-01   354911
25      2 2014-08-01   350833
26      2 2014-09-01   344849
27      2 2014-10-01   341434
28      2 2014-11-01   339866
29      2 2014-12-01   339203

我将时间序列附加到数据框中。我将行数和列数留在了底部,因此您可以根据您提供的示例验证这是正确的数据框大小:

In [149]: final_df
Out[149]: 
          TUFNWGTP       date      foo    shopping  state status
0   3227364.873298 2003-01-01      NaN           0      1    emp
1   6841114.725821 2003-01-01      NaN           0      1    NaN
2   6841114.725821 2003-01-01      NaN           0      1    NaN
3   6841114.725821 2003-01-01      NaN           0      1    NaN
4   6841114.725821 2003-01-01      NaN           0      1    NaN
5   6841114.725821 2003-01-01      NaN           0      1    NaN
6   5834127.649776 2003-01-02      NaN           0      1    NaN
7   5834127.649776 2003-01-02      NaN           0      1    NaN
8   1506051.861585 2003-01-04      NaN  2100942000      1    emp
9   1506051.861585 2003-01-04      NaN  2100942000      1    emp
10  1204191.605090 2003-01-04      NaN  5412841000      1    emp
11  1204191.605090 2003-01-04      NaN  5412841000      1    emp
12  1204191.605090 2003-01-04      NaN  5412841000      1    emp
13  1765953.711812 2003-01-05      NaN           0      1    NaN
14  1765953.711812 2003-01-05      NaN           0      1    NaN
15  1434858.212964 2003-01-05      NaN           0      1    emp
16  1434858.212964 2003-01-05      NaN           0      1    emp
17  1434858.212964 2003-01-05      NaN           0      1    emp
18  1811326.258197 2003-01-05      NaN           0      1    emp
19  1811326.258197 2003-01-05      NaN           0      1    emp
20  1908483.149300 2003-01-05      NaN           0      1    NaN
21  1908483.149300 2003-01-05      NaN           0      1    NaN
22  4190110.086256 2003-01-06      NaN  1298934000      1    NaN
23  6241047.457860 2003-01-07      NaN           0      1    NaN
24  6241047.457860 2003-01-07      NaN           0      1    NaN
25  6241047.457860 2003-01-07      NaN           0      1    NaN
26  6241047.457860 2003-01-07      NaN           0      1    NaN
27  4614396.137509 2003-01-08      NaN   715231400      1    emp
28  4614396.137509 2003-01-08      NaN   715231400      1    emp
29  4614396.137509 2003-01-08      NaN   715231400      1    emp
..             ...        ...      ...         ...    ...    ...
0              NaN 2004-01-01  1985886         NaN      1    NaN
1              NaN 2004-02-01  1990082         NaN      1    NaN
2              NaN 2004-03-01  1999936         NaN      1    NaN
3              NaN 2004-04-01  2009556         NaN      1    NaN
4              NaN 2004-05-01  2009573         NaN      1    NaN
5              NaN 2004-06-01  2013057         NaN      1    NaN
6              NaN 2004-07-01  2019963         NaN      1    NaN
7              NaN 2004-08-01  2015320         NaN      1    NaN
8              NaN 2004-09-01  2015103         NaN      1    NaN
9              NaN 2004-10-01  2035705         NaN      1    NaN
10             NaN 2004-11-01  2043152         NaN      1    NaN
11             NaN 2004-12-01  2041339         NaN      1    NaN
12             NaN 2005-01-01  2011219         NaN      1    NaN
13             NaN 2005-02-01  2014928         NaN      1    NaN
14             NaN 2005-03-01  2028597         NaN      1    NaN
15             NaN 2013-10-01   340483         NaN      2    NaN
16             NaN 2013-11-01   338445         NaN      2    NaN
17             NaN 2013-12-01   336903         NaN      2    NaN
18             NaN 2014-01-01   334565         NaN      2    NaN
19             NaN 2014-02-01   334667         NaN      2    NaN
20             NaN 2014-03-01   335922         NaN      2    NaN
21             NaN 2014-04-01   337188         NaN      2    NaN
22             NaN 2014-05-01   343958         NaN      2    NaN
23             NaN 2014-06-01   349122         NaN      2    NaN
24             NaN 2014-07-01   354911         NaN      2    NaN
25             NaN 2014-08-01   350833         NaN      2    NaN
26             NaN 2014-09-01   344849         NaN      2    NaN
27             NaN 2014-10-01   341434         NaN      2    NaN
28             NaN 2014-11-01   339866         NaN      2    NaN
29             NaN 2014-12-01   339203         NaN      2    NaN

[90 rows x 6 columns]

构建时间箱对我来说是新事物,但要使用您提供的方法,我必须将索引设置回日期列。我创建了一个新的数据框,因为这个过程中有很多是实验性的,我不想重建旧的数据框:

final_df_2 = final_df.set_index(['date'])

从此时起,您应该能够进行任何您想要的计算。我根据您的代码在下面运行了一些,但问题是我们非常有选择性地分组,所以结果看起来很奇怪:

In [187]: doWhat = {'shopping' : np.sum, 'TUFNWGTP': np.sum, 'foo' : np.mean}

In [188]: aggASS = final_df_2.groupby([pd.TimeGrouper("2AS", label='left')]).agg(doWhat)
In [189]: aggASS
Out[189]: 
                       foo     shopping      TUFNWGTP
date                                                 
2003-01-01  2014889.333333  23885035200  1.139995e+08
2005-01-01  2018248.000000          NaN           NaN
2013-01-01   341489.933333  14664313000  2.237165e+08

In [190]: aggASS = final_df_2.groupby(['state', pd.TimeGrouper("2AS", label='left'), 'status']).agg(doWhat)

In [191]: aggASS
Out[191]: 
                         foo     shopping      TUFNWGTP
state date       status                                
1     2003-01-01 emp     NaN  22586101200  3.162246e+07
2     2013-01-01 emp     NaN   1055719000  1.389769e+08
                 unemp   NaN            0  2.333405e+07

我阅读了另一篇关于使用cut 方法进行分桶的帖子。你可以在这里阅读——Grouping data by value ranges。我认为您可以使用日期时间对象操作来构建 2 年的存储桶。

【讨论】:

  • 您能否显示您用于 append 的确切代码 - 以创建您的 final_df?我对append 的使用使整个数据集上的 Python 崩溃。
  • @FooBar final_df = flattenednewdf.append(flattenedfoo) ,你删除多索引了吗?
  • @FooBar 你找到解决方案了吗?
【解决方案2】:

这是@kennes913 的答案的相关部分,仅用于对未来访问者的概述:

# flatten the data frames. For overview, just select one column each
df1flat = df.reset_index()[['state', 'date', 'TUFNWGTP']]
df2flat = df_emp.reset_index()[['state', 'date', 'foo']]
# the "merge"
X = df1flat.append(df2flat)
# now, recover the original data frames:
test1 = X.loc[np.isnan(X.foo) == False, ['state', 'date', 'foo']]
# fix dtype which was lost in the merge
test1['state'] = test1['state'].astype(int)

test2 = X.loc[np.isnan(X.TUCASEID) == False, ['state', 'date', 'TUFNWGTP']]
# check if nothing was lost:
print assert_frame_equal(bar, test1) # output: None
print assert_frame_equal(foo, test2) # output: None

【讨论】:

    猜你喜欢
    • 2017-08-29
    • 2019-01-21
    • 1970-01-01
    • 2015-05-30
    • 2023-03-08
    • 2020-10-21
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多