在 Pandas 中为 Sankey 重新格式化数据集答案

【问题标题】：Reformatting dataset for Sankey in Pandas在 Pandas 中为 Sankey 重新格式化数据集
【发布时间】：2021-02-19 22:19:04
【问题描述】：

我的数据在融化的 Pandas 数据框中（下面的数据代码）：

student	course	order
Jerry	A	1
Jerry	B	2
Jerry	C	NaN
Jessy	C	1
Jessy	A	2
Jessy	B	3
Raphael	A	1
Raphael	C	2
Raphael	C	3
Raphael	B	4
Sally	A	1
Sally	B	2
Sally	C	NaN

Sankey 需要这样的格式：

course1	course2	course3	course4	count
A	B			2
A	C	C	B	1
C	A	B		1

我不知道如何为order 的每个级别创建列并使用course 的值填充该列，同时还创建count 列来计算具有相同值的学生数量顺序。

如果我尝试df.groupby('order')['course'].count()，它会将组返回为行，而不是我需要的列。

order
1.0    2682
2.0     578
3.0     197
4.0      89
5.0      27
6.0       8
7.0       1
Name: course, dtype: int64

它也不会创建需要填充决赛桌的序列集。

有人可以帮我将我的长表重新格式化为一个包含所有课程序列计数的表格吗？

非常感谢任何帮助。

玩具数据：

student = ['Jerry','Jerry','Jerry','Jessy','Jessy','Jessy','Raphael','Raphael','Raphael','Raphael','Sally','Sally','Sally']
course = ['A','B','C','C','A','B','A','C','C','B','A','B','C']
order = [1,2,np.NaN,1,2,3,1,2,3,4,1,2,np.NaN]
df = pd.DataFrame({'student':student, 'course':course,'order':order})

【问题讨论】：

为了确保我正在跟踪，我将 sankey 视为源、目标、值集，类似于情节设置 plotly.com/python/sankey-diagram 。那么，您希望每个课程在 4 个时间段内的总人数吗？
@Docuemada：这是一个很好的观点。你是完全正确的：将数据放入 source-target-count 是 plotly 需要的。 Medium 文章 (medium.com/kenlok/…) 的代码可以从我要求的表格中创建该格式，因此我朝那个方向前进。如果您知道如何操作，我会很乐意帮助您获取源-目标-计数格式。

标签： python pandas dataframe pandas-groupby sankey-diagram

【解决方案1】：

步骤数可能会少一些，但我创建了以下流程。

删除 Na 值并添加课程名称列。
按课程名称转换为横向格式
将所有课程名称组合成一个字符串
按所有课程字符串汇总
结合原始数据框和聚合数据框
删除重复行并重命名列

df.dropna(axis=0, how='any', inplace=True)
df['course_gp'] = df['order'].apply(lambda x: 'course' + str(int(x)))
df = df.pivot(index='student', columns='course_gp', values='course')
df.fillna('', inplace=True)
df['course_all'] = df['course1'] + df['course2'] + df['course3'] + df['course4']
dfc = df.groupby('course_all').count()
df = df.merge(dfc[['course1']], left_on='course_all', right_on='course_all', how='inner' )
df.drop_duplicates(keep='first', inplace=True)
df.rename({'course1_y':'count','course1_x':'course1'}, axis=1, inplace=True)

	course1	course2	course3	course4	course_all	count
0	A	B			AB	2
2	C	A	B		CAB	1
3	A	C	C	B	ACCB	1

【讨论】：

pivot 将索引设置为电子邮件并返回此错误：ValueError: Index contains duplicate entries, cannot reshape
我认为如果存在重复索引，pivot 会出错，因此您需要在实际要处理的数据中编写代码。请添加一些实际数据结构的示例数据。