【问题标题】:Pandas - Count number of purchase for each customer for each specific productPandas - 计算每个客户对每个特定产品的购买次数
【发布时间】:2022-01-11 02:00:22
【问题描述】:

JSON 文件中的输入数据、交易历史记录:

{"customer_id": "C1", "basket": [{"product_id": "P3", "price": 506}, {"product_id": "P4", "price": 121}], "date_of_purchase": "2018-09-01 11:09:00"}
{"customer_id": "C27", "basket": [{"product_id": "P57", "price": 154}, {"product_id": "P42", "price": 349}, {"product_id": "P47", "price": 180}], "date_of_purchase": "2021-09-06 04:52:08.505909"}
{"customer_id": "C1", "basket": [{"product_id": "P3", "price": 506}, {"product_id": "P4", "price": 121}], "date_of_purchase": "2018-10-01 11:09:00"}

数据框:

    customer_id                                             basket            date_of_purchase
0          C4               [{'product_id': 'P31', 'price': 26}]  2021-09-06 05:47:08.505909
1         C13              [{'product_id': 'P36', 'price': 566}]  2021-09-06 03:52:08.505909
2         C15              [{'product_id': 'P02', 'price': 839}]  2021-09-06 05:48:08.505909
3         C22             [{'product_id': 'P37', 'price': 1235}]  2021-09-05 20:52:08.505909
4         C27  [{'product_id': 'P57', 'price': 154}, {'produc...  2021-09-06 04:52:08.505909

我将 JSON 读入数据框的代码:

def read_json_folder(json_folder: str):
    transactions_files = glob.glob("{}*/*.json".format(json_folder))

    return pandas.concat(pandas.read_json(tf, lines=True) for tf in transactions_files)

对于每笔交易,我都需要客户 ID 以及他们购买特定产品的次数。

预期输出:

customer_id product_id purchase_count
C1          P2         11
C1          P3         5    
C2          P9         7

【问题讨论】:

  • 你的数据框中已经有 JSON 了吗?
  • @user17242583 是的,它已经在数据框中了。
  • 你是怎么弄进去的?像这样? pd.json_normalize(j, record_path='basket', meta='customer_id')j 是 json 对象的列表)

标签: python pandas dataframe data-science


【解决方案1】:
  1. 从数据构建数据框

    • read_json 带行参数
    • 按篮子“行”展开篮子列表
    • 在产品 ID 和价格中分解产品信息
    • 删除不需要的列
  2. 从 df 构建结果数据框

    • 分组和计数
    • 重命名计数列
>>>TESTDATA="""
...{"customer_id": "C1", "basket": [{"product_id": "P3", "price": 506}, {"product_id": "P4", "price": 121}], "date_of_purchase": "2018-09-01 11:09:00"}
...{"customer_id": "C27", "basket": [{"product_id": "P57", "price": 154}, {"product_id": "P42", "price": 349}, {"product_id": "P47", "price": 180}], "date_of_purchase": "2021-09-06 04:52:08.505909"}
...{"customer_id": "C1", "basket": [{"product_id": "P3", "price": 506}, {"product_id": "P4", "price": 121}], "date_of_purchase": "2018-10-01 11:09:00"}
..."""
>>>df = pd.read_json(TESTDATA, lines=True)
>>>df = df.explode('basket')
>>>df[['product_id', 'price']] = df['basket'].apply(pd.Series)
>>>df.drop(['basket', 'price'], axis=1, inplace=True)
>>>df2 = df.groupby(['customer_id', 'product_id'], as_index=False).count()
>>>df2.rename(columns={'date_of_purchase': 'purchase_count'}, inplace=True)
>>>df2
  customer_id product_id purchase_count
0          C1         P3              2
1          C1         P4              2
2         C27        P42              1
3         C27        P47              1
4         C27        P57              1

【讨论】:

  • 第三列应该是 purchase_count 而不是 date_of_purchase
  • @Casper2210 ,我加了一行重命名
【解决方案2】:

如果你的数据框是这样的:

shop_list = [
{"customer_id": "C1", "basket": [{"product_id": "P3", "price": 506}, {"product_id": "P4", "price": 121}], "date_of_purchase": "2018-09-01 11:09:00"},
{"customer_id": "C27", "basket": [{"product_id": "P57", "price": 154}, {"product_id": "P42", "price": 349}, {"product_id": "P47", "price": 180}], "date_of_purchase": "2021-09-06 04:52:08.505909"},
{"customer_id": "C1", "basket": [{"product_id": "P3", "price": 506}, {"product_id": "P4", "price": 121}], "date_of_purchase": "2018-10-01 11:09:00"}
]

shop = pd.DataFrame(shop_list)

首先让每个客户获得所有产品位置

customer_groupby = shop.groupby('customer_id')['basket'].apply(list).to_dict()
for k in customer_groupby.keys():
  customer_groupby[k] = [item['product_id'] for sublist in customer_groupby[k] for item in sublist]

output: 
#{'C1': ['P3', 'P4', 'P3', 'P4'], 'C27': ['P57', 'P42', 'P47']}

然后创建结果表:

table= pd.DataFrame(columns=['customer_id', 'product_id', 'purchase_count'])
for customer,value in customer_groupby.items():
  items = set(value)
  for item in items:
    table= table.append({'customer_id':customer, 'product_id':item, 'purchase_count':value.count(item)}, ignore_index=True)

最终结果:

【讨论】:

  • 这个解决方案能回答你的问题吗?@Casper2210
【解决方案3】:

试试这个:

purchase_counts = df.groupby(['customer_id', 'product_id'], as_index=False).count()

输出:

>>> purchase_counts
  customer_id product_id  price
0          C1         P3      2
1          C1         P4      2
2         C27        P42      1
3         C27        P47      1
4         C27        P57      1

【讨论】:

  • 如果我的代码不适合您,您能否在问题中添加一个数据框示例?
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2022-10-17
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2016-07-19
  • 1970-01-01
  • 2023-01-22
相关资源
最近更新 更多