【问题标题】:Split pandas datetime index to create categorical variable拆分熊猫日期时间索引以创建分类变量
【发布时间】:2019-07-09 13:14:42
【问题描述】:

我必须从 pandas 日期时间索引中创建一个分类变量,并为它寻找一种 Python 方式。

到目前为止,我只是遍历了所有索引并做了一堆 if-else。我尝试使用 (Adding a new pandas column with mapped value from a dictionary) 的灵感,使用 lambda if else 函数的字典,并使用 map 来创建分类函数,但它不起作用

date_series = pd.date_range(start = '2010-12-31', end = '2018-12-31', freq = 'M')

regime_splitter = {lambda x : x < '2012' : 'before 2012' , lambda x : x>= '2012' and x < '2014': '2012 - 2014', lambda x : x>= '2014' : 'after 2014'}

date_series.map(regime_splitter)

预期结果

         date              regime
0  2010-12-31         before 2012
1  2013-05-31  between 2012, 2014
2  2018-12-31          after 2014

【问题讨论】:

    标签: python pandas python-datetime


    【解决方案1】:
    import pandas as pd
    data_series = pd.date_range(start='2010-12-31', end='2018-12-31', freq='M')
    df = pd.DataFrame(data_series, columns=['Dates'])
     
    def regime_splitter(value):
        if value < pd.to_datetime('2012-01-01'):
            return 'before 2012'
        elif value > pd.to_datetime('2014-12-31'):
            return'After 2014'
        else:
            return 'Between 2012, 2014'
     
    df['regime_splitter'] = df['Dates'].apply(regime_splitter)
     
    df.head(15)
     
    Dates     regime_splitter
    0              2010-12-31           before 2012
    1              2011-01-31           before 2012
    2              2011-02-28           before 2012
    3              2011-03-31           before 2012
    4              2011-04-30           before 2012
    5              2011-05-31           before 2012
    6              2011-06-30           before 2012
    7              2011-07-31           before 2012
    8              2011-08-31           before 2012
    9              2011-09-30           before 2012
    10           2011-10-31           before 2012
    11           2011-11-30           before 2012
    12           2011-12-31           before 2012
    13           2012-01-31           Between 2012, 2014
    14           2012-02-29           Between 2012, 2014
    

    【讨论】:

      【解决方案2】:

      如果需要添加/删除更多组,请使用 cutDatetimeIndex.year 解决方案:

      a = pd.cut(date_series.year, 
             bins=[-np.inf, 2012, 2014, np.inf], 
             labels=['before 2012','2012 - 2014','after 2014'])
      print (a.value_counts())
      before 2012    25
      2012 - 2014    24
      after 2014     48
      dtype: int64
      

      numpy.select 的另一个解决方案:

      x = date_series.year
      a = np.select([x <= 2012, x>= 2014], ['before 2012','after 2014'], '2012 - 2014')
      
      print (pd.Series(a).value_counts())
      after 2014     60
      before 2012    25
      2012 - 2014    12
      dtype: int64
      

      你的解决方案应该改成嵌套if-else,但是如果数据量大的话应该会很慢:

      regime_splitter = (lambda x: 'before 2012' if x <= 2012 else 
                                   ('2012 - 2014' if x>= 2012 and x <= 2014 else 'after 2014'))
      
      a = date_series.year.map(regime_splitter)
      print (a.value_counts())
      after 2014     48
      before 2012    25
      2012 - 2014    24
      dtype: int64
      

      【讨论】:

      • 谢谢,第一个完全符合我的要求。我正在创建一个交互式应用程序,用户可以在其中发送逗号分隔的 timeseries_break_points 并在不同的制度下进行分析。类似may-2012,june-2013,aug-2014,sep-2016。我正在使用 ipytwidgets.Textarea 接受输入,然后应用 pd.cut 创建新列 bins= [date_series.min()] + list(pd.to_datetime(regime_break_points.split(','))) + [date_series.max()]
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2019-05-08
      • 1970-01-01
      • 1970-01-01
      • 2019-04-17
      • 2013-07-20
      • 2017-07-14
      • 1970-01-01
      相关资源
      最近更新 更多