Pandas - 扩展反分位数函数答案

【问题标题】：Pandas - expanding inverse quantile functionPandas - 扩展反分位数函数
【发布时间】：2016-07-01 15:54:36
【问题描述】：

我有一个值的数据框：

df = pd.DataFrame(np.random.uniform(0,1,(500,2)), columns = ['a', 'b'])
>>> print df
            a         b
1    0.277438  0.042671
..        ...       ...
499  0.570952  0.865869

[500 rows x 2 columns]

我想通过用它们的百分位数替换这些值来转换这一点，其中百分位数将取代之前行中所有值的分布。即，如果您执行 df.T.unstack()，它将是一个纯扩展样本。如果您将索引视为 DatetimeIndex，这可能会更直观，并且我要求在整个横截面历史中采用扩展百分位数。

所以目标就是这个人：

      a   b
0    99  99
..   ..  ..
499  58  84

(理想情况下我想在 and 之前的所有行中的所有值集合上分配一个值，包括该行，所以不完全是扩展百分位数；但如果我们不能得到，那也没关系。）

我有一个真的丑陋的方法来做这件事，我转置和取消堆叠数据帧，生成一个百分位掩码，并使用 for 循环将该掩码覆盖在数据帧上以获取百分位数：

percentile_boundaries_over_time = pd.DataFrame({integer: 
                                     pd.expanding_quantile(df.T.unstack(), integer/100.0) 
                                     for integer in range(0,101,1)})

percentile_mask = pd.Series(index = df.unstack().unstack().unstack().index)

for integer in range(0,100,1):
    percentile_mask[(df.unstack().unstack().unstack() >= percentile_boundaries_over_time[integer]) &
                    (df.unstack().unstack().unstack() <= percentile_boundaries_over_time[integer+1])] = integer

我一直在尝试使用 scipy.stats.percentileofscore() 和 pd.expanding_apply() 来加快工作速度，但它没有给出正确的输出，我正在疯狂地试图找出原因。这是我一直在玩的：

perc = pd.expanding_apply(df, lambda x: stats.percentileofscore(x, x[-1], kind='weak'))

有没有人对为什么这会给出不正确的输出有任何想法？或者更快的方法来完成整个练习？非常感谢任何和所有帮助！

【问题讨论】：

是什么让您认为您的扩展应用程序给出了错误的结果？乍一看，它看起来不错（在每一列中，它似乎不允许跨行组合）。也许在生成数据之前打一个np.random.seed() 电话，以便其他人可以根据相同的数据检查结果？

标签： python pandas scipy percentile

【解决方案1】：

正如其他几位评论者所指出的，计算每行的百分位数可能涉及每次对数据进行排序。这可能适用于任何当前的预打包解决方案，包括pd.DataFrame.rank 或scipy.stats.percentileofscore。重复排序是浪费和计算密集型的，因此我们需要一个解决方案，将其最小化。

退一步，找到一个值相对于现有数据集的反分位数类似于找到我们将该值插入到数据集中的位置（如果它已排序）。问题是我们还有一组不断扩大的数据。值得庆幸的是，一些排序算法在处理大部分排序的数据（并插入少量未排序的元素）方面非常快。因此，我们的策略是维护我们自己的排序数据数组，并在每次行迭代时，将其添加到我们现有的列表中并查询它们在新扩展的排序集中的位置。考虑到数据已排序，后一种操作也很快。

我认为insertion sort 将是最快的排序方式，但它在 Python 中的性能可能会比任何原生 NumPy 排序方式都慢。合并排序似乎是 NumPy 中最好的可用选项。一个理想的解决方案是编写一些 Cython，但是将上述策略与 NumPy 结合使用可以帮助我们完成大部分工作。

这是一个手动解决方案：

def quantiles_by_row(df):
    """ Reconstruct a DataFrame of expanding quantiles by row """

    # Construct skeleton of DataFrame what we'll fill with quantile values
    quantile_df = pd.DataFrame(np.NaN, index=df.index, columns=df.columns)

    # Pre-allocate numpy array. We only want to keep the non-NaN values from our DataFrame
    num_valid = np.sum(~np.isnan(df.values))
    sorted_array = np.empty(num_valid)

    # We want to maintain that sorted_array[:length] has data and is sorted
    length = 0

    # Iterates over ndarray rows
    for i, row_array in enumerate(df.values):

        # Extract non-NaN numpy array from row
        row_is_nan = np.isnan(row_array)
        add_array = row_array[~row_is_nan]

        # Add new data to our sorted_array and sort.
        new_length = length + len(add_array)
        sorted_array[length:new_length] = add_array
        length = new_length
        sorted_array[:length].sort(kind="mergesort")

        # Query the relative positions, divide by length to get quantiles
        quantile_row = np.searchsorted(sorted_array[:length], add_array, side="left").astype(np.float) / length

        # Insert values into quantile_df
        quantile_df.iloc[i][~row_is_nan] = quantile_row

    return quantile_df

根据 bhalperin 提供的数据（离线），这个解决方案的速度提高了 10 倍。

最后一条评论：np.searchsorted 有'left' 和'right' 的选项，这决定了您是否希望您的预期插入位置成为可能的第一个或最后一个合适的位置。如果您的数据中有很多重复项，这很重要。上述解决方案的更准确版本将取'left' 和'right' 的平均值：

# Query the relative positions, divide to get quantiles
left_rank_row = np.searchsorted(sorted_array[:length], add_array, side="left")
right_rank_row = np.searchsorted(sorted_array[:length], add_array, side="right")
quantile_row = (left_rank_row + right_rank_row).astype(np.float) / (length * 2)

【讨论】：

【解决方案2】：

还不是很清楚，但是你想要一个累计和除以总数吗？

norm = 100.0/df.a.sum()
df['cum_a'] = df.a.cumsum()
df['cum_a'] = df.cum_a * norm

b 同上

【讨论】：

【解决方案3】：

这是一个尝试实现您的“在所有行中的所有值的集合上的百分位数，包括该行之前和包括该行”的要求。 stats.percentileofscore 似乎在给定 2D 数据时会起作用，因此 squeezeing 似乎有助于获得正确的结果：

a_percentile = pd.Series(np.nan, index=df.index)
b_percentile = pd.Series(np.nan, index=df.index)

for current_index in df.index:
    preceding_rows = df.loc[:current_index, :]
    # Combine values from all columns into a single 1D array
    #   * 2 should be * N if you have N columns
    combined = preceding_rows.values.reshape((1, len(preceding_rows) *2)).squeeze()
    a_percentile[current_index] = stats.percentileofscore(
        combined, 
        df.loc[current_index, 'a'], 
        kind='weak'
    )
    b_percentile[current_index] = stats.percentileofscore(
        combined, 
        df.loc[current_index, 'b'], 
        kind='weak'
    )

【讨论】：

我发现这种方法与我之前的方法差不多快。此外，我仍然无法弄清楚为什么 stats.percentileofscore() 给出的答案与 pd.quantile() 不同！我想我一定是误解了 pd.quantile() vs stats.percentileofscore()