【问题标题】:pandas dataframe drop rows by multiindex熊猫数据框通过多索引删除行
【发布时间】:2015-06-13 16:06:08
【问题描述】:

我想使用 MultiIndex 值从 pandas 数据框中删除行。

我已经尝试了很多东西,但我把我认为更接近的东西放在了下面。 (实际上我会解释完整的问题,因为可能会有使用完全不同的方法的替代解决方案)。从相关矩阵中,我想获得更多相关的列对。我使用unstack 并将结果放入数据框中:

In [263]: corr_df = pd.DataFrame(total.corr().unstack())

然后得到更高的相关性(实际上我也应该得到负数)。

In [264]: high = corr_df[(corr_df[0] > 0.5) & (corr_df[0] < 1.0)]

In [236]: print high
                                                  0
residual sugar       density               0.552517
free sulfur dioxide  total sulfur dioxide  0.720934
total sulfur dioxide free sulfur dioxide   0.720934
                     wine                  0.700357
density              residual sugar        0.552517
wine                 total sulfur dioxide  0.700357

足够接近,但有重复,这实际上是相关矩阵的点。为了清理它们,我的想法是迭代高值以删除重复项:

In [267]:
for row in high.iterrows():
    print row[0][0], ",", row[0][1]
    print high.loc[row[0][1]].loc[row[0][0]].index
    high.drop(high.loc[row[0][1]].loc[row[0][0]].index)
residual sugar , density
Int64Index([0], dtype='int64')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-267-1258da2a4772> in <module>()
      2     print row[0][0], ",", row[0][1]
      3     print high.loc[row[0][1]].loc[row[0][0]].index
----> 4     high.drop(high.loc[row[0][1]].loc[row[0][0]].index)

...
[huge stack of errors]
...
KeyError: 0

当索引正常时drop 方法工作正常(请参阅drop),但是,当我得到MultiIndex 时如何构建label

【问题讨论】:

    标签: python python-2.7 pandas


    【解决方案1】:
    corr_df = pd.DataFrame(
    {'residual sugar': [1, 0, 0, 0.552517, 0], 
    'free sulfur dioxide': [0, 1, 0.720934, 0, 0], 
    'total sulfur dioxide': [0, 0.720934, 1, 0, 0.700357],
    'density': [0.552517, 0, 0, 1, 0],
    'wine': [0, 0, 0.700357, 0, 1]}, 
    index=['residual sugar', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'wine']).unstack()
    
    # Notice the slight modification to the original
    high = corr_df[(corr_df > 0.5) & (corr_df < 1.0)]
    
    # Sort by index, then values
    high.sort_index()
    high.sort()
    
    # Drop every other value (e.g. just take the evens)
    result = high.iloc[[count for count, _ in enumerate(high) if count % 2 == 0]]
    >>> result
    density               residual sugar          0.552517
    total sulfur dioxide  wine                    0.700357
    free sulfur dioxide   total sulfur dioxide    0.720934
    

    【讨论】:

      猜你喜欢
      • 2018-01-02
      • 2020-10-31
      • 2020-10-18
      • 1970-01-01
      • 1970-01-01
      • 2016-06-23
      • 2015-05-28
      • 2021-08-01
      • 1970-01-01
      相关资源
      最近更新 更多