【问题标题】:Market Basket Analysis市场篮子分析
【发布时间】:2026-02-12 20:40:01
【问题描述】:

我有以下关于零售店的 pandas 交易数据集:

print(df)

product       Date                   Assistant_name
product_1     2017-01-02 11:45:00    John
product_2     2017-01-02 11:45:00    John
product_3     2017-01-02 11:55:00    Mark
...

我想创建以下数据集,用于购物篮分析:

product       Date                   Assistant_name  Invoice_number
product_1     2017-01-02 11:45:00    John            1
product_2     2017-01-02 11:45:00    John            1
product_3     2017-01-02 11:55:00    Mark            2
    ...

简而言之,如果交易具有相同的 Assistant_name 和 Date,我认为它确实会生成新的发票。

【问题讨论】:

    标签: python python-3.x pandas market-basket-analysis


    【解决方案1】:

    最简单的方法是 factorize 将列连接在一起:

    df['Invoice'] = pd.factorize(df['Date'].astype(str) + df['Assistant_name'])[0] + 1
    print (df)
         product                 Date Assistant_name  Invoice
    0  product_1  2017-01-02 11:45:00           John        1
    1  product_2  2017-01-02 11:45:00           John        1
    2  product_3  2017-01-02 11:55:00           Mark        2
    

    如果性能很重要,请使用pd.lib.fast_zip:

    df['Invoice']=pd.factorize(pd.lib.fast_zip([df.Date.values, df.Assistant_name.values]))[0]+1
    

    时间安排

    #[30000 rows x 3 columns]
    df = pd.concat([df] * 10000, ignore_index=True)
    
    In [178]: %%timeit
         ...: df['Invoice'] = list(zip(df['Date'], df['Assistant_name']))
         ...: df['Invoice'] = df['Invoice'].astype('category').cat.codes + 1
         ...: 
    9.16 ms ± 54.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In [179]: %%timeit
         ...: df['Invoice'] = pd.factorize(df['Date'].astype(str) + df['Assistant_name'])[0] + 1
         ...: 
    11.2 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In [180]: %%timeit 
         ...: df['Invoice'] = pd.factorize(pd.lib.fast_zip([df.Date.values, df.Assistant_name.values]))[0] + 1
         ...: 
    6.27 ms ± 93.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    【讨论】:

    • 在哪里可以找到pd.lib.fast_zip 的文档?
    【解决方案2】:

    使用pandas 类别是一种方法:

    df['Invoice'] = list(zip(df['Date'], df['Assistant_name']))
    df['Invoice'] = df['Invoice'].astype('category').cat.codes + 1
    
    #               product      Date Assistant_name  Invoice
    # product_1  2017-01-02  11:45:00           John        1
    # product_2  2017-01-02  11:45:00           John        1
    # product_3  2017-01-02  11:55:00           Mark        2
    

    这种方法的好处是您可以轻松检索映射字典:

    dict(enumerate(df['Invoice'].astype('category').cat.categories, 1))
    # {1: ('11:45:00', 'John'), 2: ('11:55:00', 'Mark')}
    

    【讨论】: