Apriori算法中如何处理大数据？答案

【问题标题】：How to deal with large data in Apriori algorithm?Apriori算法中如何处理大数据？
【发布时间】：2021-11-29 22:49:50
【问题描述】：

我想使用关联规则分析来自我的电子商店的客户数据。这些是我采取的步骤：

首先：我的数据框 raw_data 有三列 ["id_customer","id_product","product_quantity"]，它包含 700,000 行。

第二：我重新排序我的数据框，我得到一个包含 680,000 行和 366 列的数据框：

customer = (
    raw_data.groupby(["id_customer", "product_id"])["product_quantity"]
    .sum()
    .unstack()
    .reset_index()
    .fillna(0)
    .set_index("id_customer")
)
customer[customer != 0] = 1

最后：我想创建一个项目的频率：

from mlxtend.frequent_patterns import apriori
frequent_itemsets = apriori(customer, min_support=0.00001, use_colnames=True)

但现在我收到一个错误MemoryError: Unable to allocate 686. GiB for an array with shape (66795, 2, 689587) and data type float64

如何解决？或者如何在不使用apriori函数的情况下计算frequent_itemsets？

【问题讨论】：

标签： python apriori

【解决方案1】：

如果您的数据太大而无法放入内存，您可以传递一个返回 generator 而不是列表的函数。

from efficient_apriori import apriori as ap

def data_generator(df):
  """
  Data generator, needs to return a generator to be called several times.
  Use this approach if data is too large to fit in memory.
  """
  def data_gen():
        yield [tuple(row) for row in df.values.tolist()]

  return data_gen


transactions = data_generator(df)
itemsets, rules = ap(transactions, min_support=0.9, min_confidence=0.6)

【讨论】：

您好，感谢您的回复。你能更正你的代码吗？有几个错误使它无法复制。例如： 1. apriori 只返回一个变量。 2. 先验中没有“min_confidence”参数。 3.函数先验返回AttributeError: 'function' object has no attribute 'size'
我编辑了我的答案，因为我使用了efficient-apriori。辛萨克斯很好。 pypi.org/project/efficient-apriori
感谢您的努力，但您使用efficient-apriori 的解决方案不适用于我的情况。它需要不同的输入、不同的输出等。如果可能的话，我更喜欢mlxtend.frequent_patterns 解决方案，正如我在问题中所写的那样。