将两列数据框转换为熊猫中的出现矩阵答案

【问题标题】：Convert Two column data frame to occurrence matrix in pandas将两列数据框转换为熊猫中的出现矩阵
【发布时间】：2015-07-20 14:19:54
【问题描述】：

大家好，我有一个 csv 文件，其中包含以下格式的数据

A   a
A   b
B   f
B   g
B   e
B   h
C   d
C   e
C   f

第一列包含项目第二列包含来自特征向量的可用特征=[a,b,c,d,e,f,g,h] 我想将其转换为如下所示的出现矩阵

    a,b,c,d,e,f,g,h
A   1,1,0,0,0,0,0,0
B   0,0,0,0,1,1,1,1
C   0,0,0,1,1,1,0,0

谁能告诉我如何使用 pandas 做到这一点？

【问题讨论】：

标签： python pandas sparse-matrix

【解决方案1】：

这是使用pd.get_dummies() 的另一种方法。

import pandas as pd

# your data
# =======================
df

  col1 col2
0    A    a
1    A    b
2    B    f
3    B    g
4    B    e
5    B    h
6    C    d
7    C    e
8    C    f

# processing
# ===================================
pd.get_dummies(df.col2).groupby(df.col1).apply(max)

      a  b  d  e  f  g  h
col1                     
A     1  1  0  0  0  0  0
B     0  0  0  1  1  1  1
C     0  0  1  1  1  0  0

【讨论】：

apply 函数不应该使用“sum”来记录特定对的多个值吗？在上述情况下它也会给出正确答案。

【解决方案2】：

不清楚您的数据是否有错字，但您可以crosstab为此：

In [95]:
pd.crosstab(index=df['A'], columns = df['a'])

Out[95]:
a  b  d  e  f  g  h
A                  
A  1  0  0  0  0  0
B  0  0  1  1  1  1
C  0  1  1  1  0  0

在您的示例数据中，您的第二列将值 a 作为该列的名称，但在您的预期输出中，它作为值在列中

编辑

好的，我修复了您的输入数据，以便生成正确的结果：

In [98]:
import pandas as pd
import io
t="""A   a
A   b
B   f
B   g
B   e
B   h
C   d
C   e
C   f"""
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, names=['A','a'])
df

Out[98]:
   A  a
0  A  a
1  A  b
2  B  f
3  B  g
4  B  e
5  B  h
6  C  d
7  C  e
8  C  f

In [99]:
ct = pd.crosstab(index=df['A'], columns = df['a'])
ct

Out[99]:
a  a  b  d  e  f  g  h
A                     
A  1  1  0  0  0  0  0
B  0  0  0  1  1  1  1
C  0  0  1  1  1  0  0

【讨论】：

【解决方案3】：

这种方法在 scipy 稀疏 coo 矩阵中更快地产生相同的结果

from scipy import sparse

df['col1'] = df['col1'].astype("category")
df['col2'] = df['col2'].astype("category")
df['ones'] = 1
user_items = sparse.coo_matrix((df.ones.astype(float),
                               (df.col1.cat.codes,
                                df.col2.cat.codes)))

【讨论】：