【问题标题】:Creating a Mixin class for pandas DataFrame and native Python dict为 pandas DataFrame 和本机 Python dict 创建一个 Mixin 类
【发布时间】:2017-05-22 04:17:54
【问题描述】:

如何为 pandas DataFrame 和原生 Python dict 创建一个 mixin 类,以便可以像嵌套 dict 一样访问 dataframe 列?

Accessing pandas DataFrame as a nested list 开始,使用df.loc() 函数是访问所需行/列/切片的方法。

但目标是使用与原生 Python dict 相同的语法访问二维数据帧。例如

>>> import pandas as pd
>>> df = pd.DataFrame([['x', 1,2,3,4,5], ['y', 6,7,8,9,10], ['z', 11,12,13,14,15]])
>>> df.columns = ['index', 'a', 'b', 'c', 'd', 'e']
>>> df = df.set_index(['index'])
>>> df
        a   b   c   d   e
index                    
x       1   2   3   4   5
y       6   7   8   9  10
z      11  12  13  14  15

>>> df['x']
[1, 2, 3, 4, 5]

>>> df['x']['a']
1

>>> df['x']['a', 'b']
(1, 2)

>>> df['x']['a', 'd', 'c']
(1, 4, 3)

我已经尝试过这样创建一个 mixin 类:

from pandas import DataFrame

class VegeTable(DataFrame, dict):
    def __init__(self, *args, **kwargs):
        DataFrame.__init__(self, *args, **kwargs)
    def __getitem__(self, row_key, column_key):
        if type(row_key) != list:
            row_key = [row_key]
        if type(column_key) != list:
            column_key = [column_key]
        return df.loc[row_key, column_key]

但我认为缺少一些东西,例如字典键访问不起作用,dict.get 返回一个奇怪的值:

>>> from pandas import DataFrame
>>> 
>>> 
>>> class VegeTable(DataFrame, dict):
...     def __init__(self, *args, **kwargs):
...         DataFrame.__init__(self, *args, **kwargs)
...     def __getitem__(self, row_key, column_key):
...         if type(row_key) != list:
...             row_key = [row_key]
...         if type(column_key) != list:
...             column_key = [column_key]
...         return df.loc[row_key, column_key]
... 
>>> 
>>> vt = VegeTable([['x', 1,2,3,4,5], ['y', 6,7,8,9,10], ['z', 11,12,13,14,15]])
>>> vt.columns = ['index', 'a', 'b', 'c', 'd', 'e']
>>> vt = vt.set_index(['index'])
>>> vt
        a   b   c   d   e
index                    
x       1   2   3   4   5
y       6   7   8   9  10
z      11  12  13  14  15
>>> vt['x']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 2062, in __getitem__
    return self._getitem_column(key)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 2069, in _getitem_column
    return self._get_item_cache(key)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/generic.py", line 1534, in _get_item_cache
    values = self._data.get(item)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 3590, in get
    loc = self.items.get_loc(item)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/indexes/base.py", line 2395, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239)
  File "pandas/_libs/index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)
  File "pandas/_libs/hashtable_class_helper.pxi", line 1207, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20405)
  File "pandas/_libs/hashtable_class_helper.pxi", line 1215, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20359)
KeyError: 'x'
>>> vt.get(['x'])
>>> vt.get('x')
>>> vt.get('x', 'a')
'a'
>>> vt.get('x', ['a', 'b'])
['a', 'b']
>>> vt.get('x', ['a', 'b'])

如何为 pandas DataFrame 和本机 Python dict 创建一个 mixin 类,以便可以像嵌套 dict 一样访问数据帧列?这可能吗?如果有,怎么做?

【问题讨论】:

  • 如果您使用__getitem__ 进行行访问,而不是当前列访问,您建议如何进行列访问?
  • 与访问嵌套字典 defaultdict(dict) 的方式相同,即 vt[row_ids, column_ids] 以及对于行访问,vt[row_ids]

标签: python pandas dictionary get mixins


【解决方案1】:

推理错误

  1. vt = vt.set_index(['index'])
    这会将df 重新定义为&lt;class 'pandas.core.frame.DataFrame'&gt;
    你必须重载它或Typecast 产生的df

  2. def __getitem__(self, row_key, column_key=None):
    只有一个参数被传递给def __getitem__(...
    多个参数必须在[...] 内, 例如vt['x', ['a', 'b', 'c']]

如果您接受这种略有不同的表示法, 这个实现做你想做的事:

class DataFrame2(DataFrame):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def __getitem__(self, item):
        if isinstance(item, tuple):
            row = self.loc[item[0]]
            sub_item = item[1]
            if isinstance(sub_item, list):
                r = [row.loc[key] for key in sub_item]
                if len(r) == 1:
                    return r[0]
                else:
                    return tuple(r)
            else:
                # NotImplemented, Parameter other than tuple('x', [list])
                raise Exception(NotImplemented)
        else:
            return tuple(self.loc[item])

    def set_index(self, index):
        return DataFrame2(super().set_index(index))

# Usage:
df = DataFrame2(data)
df.columns = ['index', 'a', 'b', 'c', 'd', 'e']
df = df.set_index(['index'])

print('df[\'x\']={}\n'.format(df['x']))
print('df[\'x\'][\'a\']={}\n'.format(df['x',['a']]))
print('df[\'x\'][\'a\', \'b\']={}\n'.format(df['x', ['a', 'b']]))
print('df[\'x\'][\'a\', \'b\', \'c\']={}\n'.format(df['x', ['a', 'b', 'c']]))

输出

df['x']=(1, 2, 3, 4, 5)
df['x']['a']=1
df['x']['a', 'b']=(1, 2)
df['x']['a', 'b', 'c']=(1, 2, 3)

用 Python 测试:3.4.2

【讨论】:

    【解决方案2】:

    我不认为创建一个 mixin 类是个好主意。当你使用 pandas 时,你应该以 pandas 的方式思考。而且我也怀疑原生Python嵌套字典可以通过这种方式评估:

    In []: df['x']['a', 'b']
    

    但是,如果您坚持,请先尝试以下代码:

    In []: df.T.to_dict()
    Out[]:
    {'x': {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5},
     'y': {'a': 6, 'b': 7, 'c': 8, 'd': 9, 'e': 10},
     'z': {'a': 11, 'b': 12, 'c': 13, 'd': 14, 'e': 15}}
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-11-26
      • 1970-01-01
      • 2021-11-06
      • 2016-07-13
      • 2013-12-18
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多