向量化 for 循环并返回 x 天的最高价和最低价答案

【问题标题】：Vectorize for loop and return x day high and low向量化 for 循环并返回 x 天的最高价和最低价
【发布时间】：2021-12-02 06:16:16
【问题描述】：

概述

对于数据框的每一行，我想计算 x 天的最高价和最低价。

x 天的最高点高于前 x 天。 x 天的低点低于前 x 天。

post 中更详细地解释了 for 循环

更新：

下面@mozway 的回答在大约 20 秒内完成，数据集包含 18k 行。这可以通过 numpy 广播等来改善吗？

示例

2020-03-20 的 x_day_low 值为 1，因为它低于前一天。

2020-03-27 的 x_day_high 值为 8，因为它高于前 8 天。

请参阅下面的所需输出和测试代码，这些代码是使用 findHighLow 函数中的 for 循环计算的。我将如何矢量化findHighLow，因为实际的数据框有点大。

测试数据

def genMockDataFrame(days,startPrice,colName,startDate,seed=None): 
   
    periods = days*24
    np.random.seed(seed)
    steps = np.random.normal(loc=0, scale=0.0018, size=periods)
    steps[0]=0
    P = startPrice+np.cumsum(steps)
    P = [round(i,4) for i in P]

    fxDF = pd.DataFrame({ 
        'ticker':np.repeat( [colName], periods ),
        'date':np.tile( pd.date_range(startDate, periods=periods, freq='H'), 1 ),
        'price':(P)})
    fxDF.index = pd.to_datetime(fxDF.date)
    fxDF = fxDF.price.resample('D').ohlc()
    fxDF.columns = [i.title() for i in fxDF.columns]
    return fxDF

#rows set to 15 for minimal example but actual dataframe contains around 18000 rows.
number_of_rows = 15    

df = genMockDataFrame(number_of_rows,1.1904,'tttmmm','19/3/2020',seed=157)

def findHighLow (df):

    df['x_day_high'] = 0
    df['x_day_low'] = 0

    for n in reversed(range(len(df['High']))):
        for i in reversed(range(n)):
                if df['High'][n] > df['High'][i]:
                    df['x_day_high'][n] = n - i
                else: break

    for n in reversed(range(len(df['Low']))):
        for i in reversed(range(n)):
                if df['Low'][n] < df['Low'][i]:
                    df['x_day_low'][n] = n - i
                else: break
    return df

df = findHighLow (df)

所需的输出应与此匹配：

df[["High","Low","x_day_high","x_day_low"]]

             High   Low x_day_high  x_day_low
date                
2020-03-19  1.1937  1.1832  0       0
2020-03-20  1.1879  1.1769  0       1
2020-03-21  1.1767  1.1662  0       2
2020-03-22  1.1721  1.1611  0       3
2020-03-23  1.1819  1.1690  2       0
2020-03-24  1.1928  1.1807  4       0
2020-03-25  1.1939  1.1864  6       0
2020-03-26  1.2141  1.1964  7       0
2020-03-27  1.2144  1.2039  8       0
2020-03-28  1.2099  1.2018  0       1
2020-03-29  1.2033  1.1853  0       4
2020-03-30  1.1887  1.1806  0       6
2020-03-31  1.1972  1.1873  1       0 
2020-04-01  1.1997  1.1914  2       0
2020-04-02  1.1924  1.1781  0       9

【问题讨论】：

你能解释一下逻辑吗？
问题更新说明
另见stackoverflow.com/questions/70138987/…

标签： pandas

【解决方案1】：

这里有两种解决方案。两者都产生所需的输出，如问题中所述。

第一个解决方案使用 Numba，并在我的机器上在 0.5 秒内完成 20k 行。如果你可以使用 Numba，这就是要走的路。第二种解决方案仅使用 Pandas/Numpy 并在 1.5 秒内完成 20k 行。

麻木

@numba.njit
def count_smaller(arr):
    current = arr[-1]
    count = 0
    
    for i in range(arr.shape[0]-2, -1, -1):
        if arr[i] > current:
            break
        
        count += 1
        
    return count


@numba.njit
def count_greater(arr):
    current = arr[-1]
    count = 0
    
    for i in range(arr.shape[0]-2, -1, -1):
        if arr[i] < current:
            break
        
        count += 1
        
    return count

df["x_day_high"] = df.High.expanding().apply(count_smaller, engine='numba', raw=True)
df["x_day_low"] = df.Low.expanding().apply(count_greater, engine='numba', raw=True)

熊猫/Numpy

def count_consecutive_true(bool_arr):
    return bool_arr[::-1].cumprod().sum()

def count_smaller(arr):
    return count_consecutive_true(arr <= arr[-1]) - 1

def count_greater(arr):
    return count_consecutive_true(arr >= arr[-1]) - 1

df["x_day_high"] = df.High.expanding().apply(count_smaller, raw=True)
df["x_day_low"] = df.Low.expanding().apply(count_greater, raw=True)

最后一个解决方案类似于 mozway 的。但是它运行得更快，因为它不需要执行连接并尽可能多地使用 numpy。它也看起来任意远。

【讨论】：

注意：我已将 Pandas/Numpy 解决方案更新为使用 raw=True，它的速度提高了 3 倍。
能否分解并解释 Pandas/Numpy 解决方案的工作原理？非常感谢
我了解大部分内容，但不了解count_consecutive_true(arr <= arr[-1]) - 1 末尾的-1。
count_consecutive_true 返回条件为真的连续数组元素的数量（从数组末尾开始计数）。总会有至少一个元素在arr <= arr[-1]，即最后一个arr[-1]。因此count_consecutive_true(arr <= arr[-1]) 的最小值是1，我们需要减去它。这能回答你的问题吗？

【解决方案2】：

您可以使用rolling 获取最后 N 天，比较 + cumprod 在反向布尔数组上仅保留最后连续的有效值，并使用 sum 对它们进行计数。在添加前缀后使用agg 和join 应用于每一列的输出。

# number of days
N = 8

df.join(df.rolling(f'{N+1}d', min_periods=1)
          .agg({'High': lambda s: s.le(s.iloc[-1])[::-1].cumprod().sum()-1,
                'Low': lambda s: s.ge(s.iloc[-1])[::-1].cumprod().sum()-1,
               })
          .add_prefix(f'{N}_days_')
        )

输出：

              Open    High     Low   Close  8_days_High  8_days_Low
date                                                               
2020-03-19  1.1904  1.1937  1.1832  1.1832          0.0         0.0
2020-03-20  1.1843  1.1879  1.1769  1.1772          0.0         1.0
2020-03-21  1.1755  1.1767  1.1662  1.1672          0.0         2.0
2020-03-22  1.1686  1.1721  1.1611  1.1721          0.0         3.0
2020-03-23  1.1732  1.1819  1.1690  1.1819          2.0         0.0
2020-03-24  1.1836  1.1928  1.1807  1.1922          4.0         0.0
2020-03-25  1.1939  1.1939  1.1864  1.1936          6.0         0.0
2020-03-26  1.1967  1.2141  1.1964  1.2114          7.0         0.0
2020-03-27  1.2118  1.2144  1.2039  1.2089          7.0         0.0
2020-03-28  1.2080  1.2099  1.2018  1.2041          0.0         1.0
2020-03-29  1.2033  1.2033  1.1853  1.1880          0.0         4.0
2020-03-30  1.1876  1.1887  1.1806  1.1879          0.0         6.0
2020-03-31  1.1921  1.1972  1.1873  1.1939          1.0         0.0
2020-04-01  1.1932  1.1997  1.1914  1.1914          2.0         0.0
2020-04-02  1.1902  1.1924  1.1781  1.1862          0.0         7.0

【讨论】：

我将您的答案标记为有用，这比现有代码更快。如果设置n=500 和具有数千行的数据框，它仍然很慢。我也对 1000 天的高点/低点以及在大型数据集上运行感兴趣。
我不确定你可以用 pandas 做得更好。限制是滚动时间戳，你可以避免你确定所有连续的日子。如果你有很多内存，你可以使用 numpy 进行广播。可能会提供有关您的数据集、可用计算能力、当前和预期运行时间的更多信息。