在大型数据框中降低因子水平的 Pythonic 方法答案

【问题标题】：Pythonic Way of Reducing Factor Levels in Large Dataframe在大型数据框中降低因子水平的 Pythonic 方法
【发布时间】：2020-03-30 21:10:24
【问题描述】：

我正在尝试减少 pandas 数据框中一列中因子级别的数量，以便任何因子的总实例占所有列行的比例低于定义的阈值（默认设置为 1%）放入一个标有“其他”的新因素中。下面是我用来完成这个任务的函数：

def condenseMe(df, column_name, threshold = 0.01, newLabel = "Other"):

    valDict = dict(df[column_name].value_counts() / len(df[column_name]))
    toCondense = [v for v in valDict.keys() if valDict[v] < threshold]
    if 'Missing' in toCondense:
        toCondense.remove('Missing')
    df[column_name] = df[column_name].apply(lambda x: newLabel if x in toCondense else x)

我遇到的问题是我正在处理一个大型数据集（约 1800 万行），并试图在超过 10,000 个级别的列上使用此函数。因此，在此列上执行此功能需要很长时间才能完成。有没有更 Pythonic 的方法来减少执行速度更快的因子级别的数量？任何帮助将不胜感激！

【问题讨论】：

标签： python pandas categories bucket levels

【解决方案1】：

您可以结合使用groupby、tranform 和count：

def condenseMe(df, col, threshold = 0.01, newLabel="Other"):
    # Create a new Series with the normalized value counts
    counts = df[[col]].groupby(col)[col].transform('count') / len(df)
    # Create a 1D mask based on threshold (ignoring "Missing")
    mask = (counts < threshold) & (df[col] != 'Missing')

    # Assign these masked values a new label
    df[col][mask] = newLabel

【讨论】：