从 pandas df 中的列创建一个二元组答案

【问题标题】：create a bigram from a column in pandas df从 pandas df 中的列创建一个二元组
【发布时间】：2017-03-28 09:54:39
【问题描述】：

我在 pandas 数据框中有这个测试表

   Leaf_category_id  session_id  product_id
0               111           1         987
3               111           4         987
4               111           1         741
1               222           2         654
2               333           3         321

这是我上一个问题的扩展，@jazrael 回答了这个问题。 view answer

所以在获得 product_id 列中的值后（只是一个假设，与我之前问题的输出略有不同，

   |product_id               |
   ---------------------------
   |111,987,741,34,12        |
   |987,1232                 |
   |654,12,324,465,342,324   |
   |321,741,987              |
   |324,654,862,467,243,754  |
   |6453,123,987,741,34,12   |

等等，我想创建一个新列，其中一行中的所有值都应该作为一个二元组，下一个是二元组，最后一个不是。在与第一个组合的行中，例如：

   |product_id               |Bigram
   -------------------------------------------------------------------------
   |111,987,741,34,12        |(111,987),**(987,741)**,(741,34),(34,12),(12,111)
   |987,1232                 |(987,1232),(1232,987)
   |654,12,324,465,342,32    |(654,12),(12,324),(324,465),(465,342),(342,32),(32,654)
   |321,741,987              |(321,741),**(741,987)**,(987,321)
   |324,654,862              |(324,654),(654,862),(862,324)
   |123,987,741,34,12        |(123,987),(987,741),(34,12),(12,123)

忽略**（稍后我会告诉你我为什么加星标）

实现二元组的代码是

for i in df.Leaf_category_id.unique(): 
    print (df[df.Leaf_category_id == i].groupby('session_id')['product_id'].apply(lambda x: list(zip(x, x[1:]))).reset_index())

从这个 df 中，我想考虑二元列并再制作一个名为频率的列，这给了我二元出现的频率。

注意*：(987,741) 和 (741,987) 被视为相同，应删除一个重复条目，因此 (987,741) 的频率应为 2。 (34,12) 的情况类似，它出现两次，所以频率应该是 2

   |Bigram
   ---------------
   |(111,987),
   |**(987,741)**
   |(741,34)
   |(34,12)
   |(12,111)
   |**(741,987)**
   |(987,321)
   |(34,12)
   |(12,123)

最终结果应该是。

   |Bigram       | frequency |
   --------------------------
   |(111,987)    |  1 
   |(987,741)    |  2
   |(741,34)     |  1
   |(34,12)      |  2
   |(12,111)     |  1
   |(987,321)    |  1
   |(12,123)     |  1

我希望在这里找到答案，请帮助我，我已经尽可能详细了。

【问题讨论】：

您希望如何存储频率？在一行中，Bigram 列将包含多个元组，因此会有多个频率。
@James ：一行中的每个元组都应该作为一个新行，如倒数第二个表所示。然后如果有重复的表，正如我提到的，频率应该相应地改变
所以Bigram 和frequency 在单独的数据框中？
@James：df 中只有二元组，您将通过我发布的代码得到它。我想创建一个名为 frequency 的新列，它计算单个二元组的出现次数。
@jezrael 你能看看这个问题吗？

标签： python python-2.7 python-3.x pandas

【解决方案1】：

试试这个代码

from itertools import combinations
import pandas as pd

df = pd.DataFrame.from_csv("data.csv")
#consecutive
grouped_consecutive_product_ids = df.groupby(['Leaf_category_id','session_id'])['product_id'].apply(lambda x: [tuple(sorted(pair)) for pair in zip(x,x[1:])]).reset_index()

df1=pd.DataFrame(grouped_consecutive_product_ids)
s=df1.product_id.apply(lambda x: pd.Series(x)).unstack()
df2=pd.DataFrame(s.reset_index(level=0,drop=True)).dropna()
df2.rename(columns = {0:'Bigram'}, inplace = True)
df2["freq"] = df2.groupby('Bigram')['Bigram'].transform('count')
bigram_frequency_consecutive = df2.drop_duplicates(keep="first").sort_values("Bigram").reset_index()
del bigram_frequency_consecutive["index"]

对于组合（所有可能的二元组）

from itertools import combinations
import pandas as pd

df = pd.DataFrame.from_csv("data.csv")
#combinations
grouped_combination_product_ids = df.groupby(['Leaf_category_id','session_id'])['product_id'].apply(lambda x: [tuple(sorted(pair)) for pair in combinations(x,2)]).reset_index()

df1=pd.DataFrame(grouped_combination_product_ids)
s=df1.product_id.apply(lambda x: pd.Series(x)).unstack()
df2=pd.DataFrame(s.reset_index(level=0,drop=True)).dropna()
df2.rename(columns = {0:'Bigram'}, inplace = True)
df2["freq"] = df2.groupby('Bigram')['Bigram'].transform('count')
bigram_frequency_combinations = df2.drop_duplicates(keep="first").sort_values("Bigram").reset_index()
del bigram_frequency_combinations["index"]

data.csv 包含在哪里

Leaf_category_id,session_id,product_id
0,111,1,111
3,111,4,987
4,111,1,741
1,222,2,654
2,333,3,321
5,111,1,87
6,111,1,34
7,111,1,12
8,111,1,987
9,111,4,1232
10,222,2,12
11,222,2,324
12,222,2,465
13,222,2,342
14,222,2,32
15,333,3,321
16,333,3,741
17,333,3,987
18,333,3,324
19,333,3,654
20,333,3,862
21,222,1,123
22,222,1,987
23,222,1,741
24,222,1,34
25,222,1,12

结果bigram_frequency_consecutive 将是

         Bigram  freq
0      (12, 34)     2
1     (12, 324)     1
2     (12, 654)     1
3     (12, 987)     1
4     (32, 342)     1
5      (34, 87)     1
6     (34, 741)     1
7     (87, 741)     1
8    (111, 741)     1
9    (123, 987)     1
10   (321, 321)     1
11   (321, 741)     1
12   (324, 465)     1
13   (324, 654)     1
14   (324, 987)     1
15   (342, 465)     1
16   (654, 862)     1
17   (741, 987)     2
18  (987, 1232)     1

结果bigram_frequency_combinations 将是

           Bigram  freq
0      (12, 32)     1
1      (12, 34)     2
2      (12, 87)     1
3     (12, 111)     1
4     (12, 123)     1
5     (12, 324)     1
6     (12, 342)     1
7     (12, 465)     1
8     (12, 654)     1
9     (12, 741)     2
10    (12, 987)     2
11    (32, 324)     1
12    (32, 342)     1
13    (32, 465)     1
14    (32, 654)     1
15     (34, 87)     1
16    (34, 111)     1
17    (34, 123)     1
18    (34, 741)     2
19    (34, 987)     2
20    (87, 111)     1
21    (87, 741)     1
22    (87, 987)     1
23   (111, 741)     1
24   (111, 987)     1
25   (123, 741)     1
26   (123, 987)     1
27   (321, 321)     1
28   (321, 324)     2
29   (321, 654)     2
30   (321, 741)     2
31   (321, 862)     2
32   (321, 987)     2
33   (324, 342)     1
34   (324, 465)     1
35   (324, 654)     2
36   (324, 741)     1
37   (324, 862)     1
38   (324, 987)     1
39   (342, 465)     1
40   (342, 654)     1
41   (465, 654)     1
42   (654, 741)     1
43   (654, 862)     1
44   (654, 987)     1
45   (741, 862)     1
46   (741, 987)     3
47   (862, 987)     1
48  (987, 1232)     1

在上述情况下，它按两者分组

【讨论】：

非常好的答案，+1
@Mr. A bigram_frequency_consecutive 和 bigram_frequency_combinations 有什么区别？
in bigram_frequency_consecutive 如果一个组有产品 id [27,35,99] 然后你得到二元组 [(27,35),(35,99)] 如果你正在做任何类型的产品，由组合形成的二元组是 [(27,35),(27,99),(35,99)]购买分析你应该使用二元组合的。由于我不知道确切的用例，我提供了两种解决方案，其中第一个是根据您给出的代码 sn-p 而第二个解决方案是最需要的。
@Mr.A 我可以看到您在编码时使用了循环，我的数据集非常大（> 1Gb），因此循环会花费我的计算时间。我可以定义一个函数并得到相同的结果吗？
@SRingne 我已按您的要求进行了更改，请检查。

【解决方案2】：

我们将从product_id 中提取值，创建已排序并因此去重的bigrams，并对它们进行计数以获得频率，然后填充数据框。

from collections import Counter

# assuming your data frame is called 'df'

bigrams = [list(zip(x,x[1:])) for x in df.product_id.values.tolist()]
bigram_set = [tuple(sorted(xx) for x in bigrams for xx in x]
freq_dict = Counter(bigram_set)
df_freq = pd.DataFrame([list(f) for f in freq_dict], columns=['bigram','freq'])

【讨论】：

当我运行时** freq_dict = Counter(bigram_set)** 我遇到这个错误：unhashable type: 'list'
tuple 函数应该已经解决了这个问题
type(bigram_set) = list.
freq_dict = Counter(bigram_set) 它给了我错误。不可散列的类型：“列表”
另外，您的代码仅针对一个 Leaf_category_id 运行。我从@jazrael 得到了一个适用于“i”leaf_category id 的解决方案，你可以修改你的代码吗：for i in df.Leaf_category_id.unique():print (df[df.Leaf_category_id == i].groupby('session_id')['product_id'].apply(lambda x: list(zip(x, x[1:]))).reset_index())