根据其他列中的值创建新列答案

【问题标题】：Create new column based on values in other column根据其他列中的值创建新列
【发布时间】：2020-09-17 11:40:49
【问题描述】：

这是我的 DataFrame 中的一列：

Index    Direction Output
10886    DOWN      None
10887      UP      None
10888      UP      None
10889      UP      None
10890      UP      None
10891      UP      STRONG_UP
10892      UP      STRONG_UP
10893      UP      STRONG_UP
10894      UP      STRONG_UP
10895      UP      STRONG_UP
10896      UP      STRONG_UP
10897      UP      STRONG_UP
10898      UP      STRONG_UP
10899      UP      STRONG_UP
10900    DOWN      None 
10901    DOWN      None
10902      UP      None
10903      UP      None
10904    DOWN      None
10905    DOWN      None
10906    DOWN      None

我想创建新列。
如果当前方向值和 5 个之前的方向值 == UP，则单元格变为“STRONG_UP”
如果当前方向值和前 5 个方向值 == DOWN，则单元格变为“STRONG_DOWN”
否则值为“无”
怎么做？

【问题讨论】：

你能添加你的预期输出吗？

标签： python pandas

【解决方案1】：

不幸的是rolling 只处理数字，所以map 使用解码和编码，但是如果大数据帧很慢：

def f(x):
    if np.all(x == 1):
        return 2
    elif np.all(x == 0):
        return 3
    else:
        return np.nan
        

df['Output'] = df['Direction'].map({'UP':1,'DOWN':0})
                              .rolling(6)
                              .apply(f)
                              .map({2:'STRONG_UP',3:'STRONG_DOWN'})

print (df)
    Index Direction     Output
0   10887        UP        NaN
1   10888        UP        NaN
2   10889        UP        NaN
3   10890        UP        NaN
4   10891        UP        NaN
5   10892        UP  STRONG_UP
6   10893        UP  STRONG_UP
7   10894        UP  STRONG_UP
8   10895        UP  STRONG_UP
9   10896        UP  STRONG_UP
10  10897        UP  STRONG_UP
11  10898        UP  STRONG_UP
12  10899        UP  STRONG_UP
13  10900      DOWN        NaN
14  10901      DOWN        NaN
15  10902        UP        NaN
16  10903        UP        NaN
17  10904      DOWN        NaN
18  10905      DOWN        NaN
19  10906      DOWN        NaN

如果性能很重要，strides 和 numpy.select 的另一个想法：

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

n = 6
x = np.concatenate([[None] * (n-1), df['Direction'].to_numpy()])

a = rolling_window(x, n)

print (a)
[[None None None None None 'UP']
 [None None None None 'UP' 'UP']
 [None None None 'UP' 'UP' 'UP']
 [None None 'UP' 'UP' 'UP' 'UP']
 [None 'UP' 'UP' 'UP' 'UP' 'UP']
 ['UP' 'UP' 'UP' 'UP' 'UP' 'UP']
 ['UP' 'UP' 'UP' 'UP' 'UP' 'UP']
 ['UP' 'UP' 'UP' 'UP' 'UP' 'UP']
 ['UP' 'UP' 'UP' 'UP' 'UP' 'UP']
 ['UP' 'UP' 'UP' 'UP' 'UP' 'UP']
 ['UP' 'UP' 'UP' 'UP' 'UP' 'UP']
 ['UP' 'UP' 'UP' 'UP' 'UP' 'UP']
 ['UP' 'UP' 'UP' 'UP' 'UP' 'UP']
 ['UP' 'UP' 'UP' 'UP' 'UP' 'DOWN']
 ['UP' 'UP' 'UP' 'UP' 'DOWN' 'DOWN']
 ['UP' 'UP' 'UP' 'DOWN' 'DOWN' 'DOWN']
 ['UP' 'UP' 'DOWN' 'DOWN' 'DOWN' 'UP']
 ['UP' 'DOWN' 'DOWN' 'DOWN' 'UP' 'UP']
 ['DOWN' 'DOWN' 'DOWN' 'UP' 'UP' 'DOWN']
 ['DOWN' 'DOWN' 'UP' 'UP' 'DOWN' 'DOWN']]

m1 = np.all(a == 'UP', axis=1)
m2 = np.all(a == 'DOWN', axis=1)

df['Output'] = np.select([m1, m2], ['STRONG_UP','STRONG_DOWN'], None)

print (df)
    Index Direction     Output
0   10887        UP       None
1   10888        UP       None
2   10889        UP       None
3   10890        UP       None
4   10891        UP       None
5   10892        UP  STRONG_UP
6   10893        UP  STRONG_UP
7   10894        UP  STRONG_UP
8   10895        UP  STRONG_UP
9   10896        UP  STRONG_UP
10  10897        UP  STRONG_UP
11  10898        UP  STRONG_UP
12  10899        UP  STRONG_UP
13  10900      DOWN       None
14  10901      DOWN       None
15  10902      DOWN       None
16  10903        UP       None
17  10904        UP       None
18  10905      DOWN       None
19  10906      DOWN       None

性能：Forst方法被省略了，因为太慢了。

print (pd.show_versions())


INSTALLED VERSIONS
------------------
commit           : f2ca0a2665b2d169c97de87b8e778dbed86aea07
python           : 3.8.5.final.0
python-bits      : 64
OS               : Windows
OS-release       : 7
Version          : 6.1.7601
machine          : AMD64
processor        : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : en
LOCALE           : Slovak_Slovakia.1250

pandas           : 1.1.1
numpy            : 1.19.1

import perfplot

np.random.seed(123)


def GW(df):
    df['group'] = np.r_[True, df.Direction.values[1:] != df.Direction.values[:-1]].cumsum()
    df['count'] = df.groupby('group').cumcount()+1
    df['result'] = np.where(df['count'] >= 6, 'STRONG_'+df.Direction, np.nan) 
    df = (df[['Index','Direction','result']])
    return df

def ST(df):
    
    def rolling_window(a, window):
        shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
        strides = a.strides + (a.strides[-1],)
        return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

    n = 6
    x = np.concatenate([[None] * (n-1), df['Direction'].to_numpy()])
    a = rolling_window(x, n)
    m1 = np.all(a == 'UP', axis=1)
    m2 = np.all(a == 'DOWN', axis=1)
    df['Output2'] = np.select([m1, m2], ['STRONG_UP','STRONG_DOWN'], None)
    return df

def make_df(n):
    direction = np.random.choice(['UP','DOWN'], n)
    df = pd.DataFrame({
        'Index': np.arange(len(direction)),
        'Direction': direction
    })
    return df

perfplot.show(
    setup=make_df,
    kernels=[GW, ST],
    n_range=[2**k for k in range(5, 25)],
    logx=True,
    logy=True,
    equality_check=False,
    xlabel='len(df)')

【讨论】：

我很容易调用列单元格 1 和 0（不是字符串）。
我喜欢带有滚动的纯熊猫解决方案。
这是运行基准以比较函数的好方法。介意我复制你的代码用于未来的基准测试吗？
@mikksu - 当然，有些来自我，有些来自 pir 或 coldspeed，link
@jezrael，哇，这是一个很好的答案，感谢您分享先生的欢呼。

【解决方案2】：

一个 numpy 没有应用函数的想法

import numpy as np
df['group'] = np.r_[True, df.Direction.values[1:] != df.Direction.values[:-1]].cumsum()
df['count'] = df.groupby('group').cumcount()+1
df['result'] = np.where(df['count'] >= 6, 'STRONG_'+df.Direction, np.nan) 
print(df[['Index','Direction','result']])

输出

    Index Direction     result
0   10887        UP        NaN
1   10888        UP        NaN
2   10889        UP        NaN
3   10890        UP        NaN
4   10891        UP        NaN
5   10892        UP  STRONG_UP
6   10893        UP  STRONG_UP
7   10894        UP  STRONG_UP
8   10895        UP  STRONG_UP
9   10896        UP  STRONG_UP
10  10897        UP  STRONG_UP
11  10898        UP  STRONG_UP
12  10899        UP  STRONG_UP
13  10900      DOWN        NaN
14  10901      DOWN        NaN
15  10902        UP        NaN
16  10903        UP        NaN
17  10904      DOWN        NaN
18  10905      DOWN        NaN
19  10906      DOWN        NaN

微基准测试

出于好奇，我在笔记本电脑上运行了一个小基准测试（i5-7200u，8GB Ram，在 Jupyter Notebook 中）

熊猫滚动和应用 (RA)
Pandas GroupBy & Numpy Where (GW)
Numpy 步幅 (NP)

数据是这样生成的

direction = np.random.choice(['UP','DOWN'], 100000)
df = pd.DataFrame({
    'Index': np.arange(len(direction)),
    'Direction': direction
})

结果

          N=1000       |      N=10000      |     N=100000
RA   32.7 ms ± 3.05 ms |  271 ms ± 22.9 ms | 2.35 s ± 60.1 ms
GW   6.33 ms ± 230 µs  | 10.2 ms ± 51.4 µs | 63.8 ms ± 1.31 ms
NP   1.33 ms ± 32.5 µs | 8.21 ms ± 555 µs  | 74.4 ms ± 2.73 ms

【讨论】：

我真的很喜欢这个解决方案。它似乎比 jezrael 提供的更清晰、更容易理解。
Numpy Solution (NP) 是我的第二个解决方案吗？
对不起，没有。这是我在这个答案中的解决方案。
所以应该叫pandas/numpy的组合，我的第二种方案是numpy。只有分配回熊猫列的方式。
好吧，如果您在第一步中不计算 .values。你是对的。这或多或少是 'pd.groupby.cumcount & np.where' 和 'pd.rolling & apply(func)' 之间的比较。