【问题标题】:How to correctly read csv file generated by groupby results?如何正确读取由 groupby 结果生成的 csv 文件?
【发布时间】:2022-01-26 13:05:36
【问题描述】:

我已经计算了两组DataFrame的平均值并将结果保存到CSV文件。

然后,我尝试通过read_csv()再次读取它,但是.loc()函数对加载的DataFrame不起作用。

代码示例如下:

import pandas as pd
import numpy as np

np.random.seed(100)
df = pd.DataFrame(np.random.randn(100, 3), columns=['a', 'b', 'value'])

a_bins = np.arange(-3, 4, 1)
b_bins = np.arange(-2, 4, 2)

# calculate the mean value
df['a_bins'] = pd.cut(df['a'], bins=a_bins)
df['b_bins'] = pd.cut(df['b'], bins=b_bins)
df_value_bin = df.groupby(['a_bins','b_bins']).agg({'value':'mean'})

# save to csv file
df_value_bin.to_csv('test.csv')

# read the exported file
df_test = pd.read_csv('test.csv')

当我输入时:

df_value_bin.loc[(1.5, -1)]

我得到了这个输出

value    0.254337
Name: ((1, 2], (-2, 0]), dtype: float64

但是,如果我使用相同的方法从加载的 CSV 文件中找到值:

df_test.loc[(1.5, -1)]

我收到了这个键错误:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_33836/4042082162.py in <module>
----> 1 df_test.loc[(1.5, -1)]

~/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    923                 with suppress(KeyError, IndexError):
    924                     return self.obj._get_value(*key, takeable=self._takeable)
--> 925             return self._getitem_tuple(key)
    926         else:
    927             # we by definition only have the 0th axis

~/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
   1098     def _getitem_tuple(self, tup: tuple):
   1099         with suppress(IndexingError):
-> 1100             return self._getitem_lowerdim(tup)
   1101 
   1102         # no multi-index, so validate all of the indexers

~/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py in _getitem_lowerdim(self, tup)
    836                 # We don't need to check for tuples here because those are
    837                 #  caught by the _is_nested_tuple_indexer check above.
--> 838                 section = self._getitem_axis(key, axis=i)
    839 
    840                 # We should never have a scalar section here, because

~/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1162         # fall thru to straight lookup
   1163         self._validate_key(key, axis)
-> 1164         return self._get_label(key, axis=axis)
   1165 
   1166     def _get_slice_axis(self, slice_obj: slice, axis: int):

~/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py in _get_label(self, label, axis)
   1111     def _get_label(self, label, axis: int):
   1112         # GH#5667 this will fail if the label is not present in the axis.
-> 1113         return self.obj.xs(label, axis=axis)
   1114 
   1115     def _handle_lowerdim_multi_index_axis0(self, tup: tuple):

~/miniconda3/lib/python3.9/site-packages/pandas/core/generic.py in xs(self, key, axis, level, drop_level)
   3774                 raise TypeError(f"Expected label or tuple of labels, got {key}") from e
   3775         else:
-> 3776             loc = index.get_loc(key)
   3777 
   3778             if isinstance(loc, np.ndarray):

~/miniconda3/lib/python3.9/site-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
    386                 except ValueError as err:
    387                     raise KeyError(key) from err
--> 388             raise KeyError(key)
    389         return super().get_loc(key, method=method, tolerance=tolerance)
    390 

KeyError: 1.5

【问题讨论】:

    标签: python pandas dataframe


    【解决方案1】:

    您应该将索引读取为MultiIndex,但您需要将字符串转换为区间。你可以使用to_interval(所有功劳归korakot):

    def to_interval(istr):
        c_left = istr[0]=='['
        c_right = istr[-1]==']'
        closed = {(True, False): 'left',
                  (False, True): 'right',
                  (True, True): 'both',
                  (False, False): 'neither'
                  }[c_left, c_right]
        left, right = map(int, istr[1:-1].split(','))
        return pd.Interval(left, right, closed)
    
    df_test = pd.read_csv('test.csv',  index_col=[0,1], converters={0: to_interval,1: to_interval})
    

    测试:

    df_test.loc[(1.5, -1)]
    #value    0.254337
    #Name: ((1, 2], (-2, 0]), dtype: float64
    

    【讨论】:

    • 谢谢!是否可以将to_interval 应用于列名列表?当我们有很多垃圾箱时,它会更干净并节省行数。
    • 不,converters 必须是带有键的 dict 列号(整数)或标签,而不是它们的列表。
    • 好的。然后,我想出了这个:index_range = range(0, 2, 1); df_test = pd.read_csv('test.csv', index_col=list(index_range), converters={index: to_interval for index in index_range})。由于 bin 通常来自 index0,用户可以简单地给出 strat 和 end 索引来读取 bin 索引。
    • 好吧,你总是可以随意创建converters dict,但最后你需要一个带有col数字或标签作为键和转换器作为值的dict,所以是的 - 这个也是可以的。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2022-11-13
    • 2021-06-09
    • 2015-04-16
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-03-11
    相关资源
    最近更新 更多