在 Python 中绘制分类数据的三个维度答案

【问题标题】：Plotting three dimensions of categorical data in Python在 Python 中绘制分类数据的三个维度
【发布时间】：2019-10-09 11:32:05
【问题描述】：

我的数据包含三个我试图可视化的分类变量：

城市（五个之一）
职业（四个之一）
血型（四种之一）

到目前为止，我已经成功地以我认为易于使用的方式对数据进行分组：

import numpy as np, pandas as pd

# Make data
cities = ['Tijuana','Las Vegas','Los Angeles','Anaheim','Atlantis']
occupations = ['Doctor','Lawyer','Engineer','Drone security officer']
bloodtypes = ['A','B','AB','O']
df = pd.DataFrame({'City': np.random.choice(cities,500),
                   'Occupation': np.random.choice(occupations,500),
                   'Blood Type':np.random.choice(bloodtypes,500)})

# You need to make a dummy column, otherwise the groupby returns an empty df
df['Dummy'] = np.ones(500)

# This is now what I'd like to plot
df.groupby(by=['City','Occupation','Blood Type']).count().unstack(level=1)

                       Dummy
Occupation             Doctor Drone security officer Engineer Lawyer
City        Blood Type
Anaheim     A               7                      7        7      7
            AB              6                     10        8      5
            B               2                     10        4      2
            O               4                      3        3      6
Atlantis    A               6                      5        5      7
            AB             12                      7        7     10
            B               7                      4        7      3
            O               7                      4        6      4
Las Vegas   A               8                      4        8      5
            AB              5                      6        8      9
            B               6                     10        6      6
            O               6                      9        5      9
Los Angeles A               7                      4        8      8
            AB              9                      8        8      8
            B               3                      6        4      1
            O               9                     11       11      9
Tijuana     A               3                      4        5      3
            AB              9                      5        5      7
            B               3                      6        4      9
            O               3                      5        5      8

我的目标是创建类似于下面显示的 Seaborn swarmplot 的东西，它来自 Seaborn documentation。 Seaborn 将抖动应用于定量数据，以便您可以查看各个数据点及其色调：

使用我的数据，我想在 x 轴上绘制City，在 y 轴上绘制Occupation，对每个图像应用抖动，然后通过Blood type 进行色调。但是，sns.swarmplot 要求其中一个轴是定量的：

sns.swarmplot(data=df,x='City',y='Occupation',hue='Blood Type')

返回错误。

一个可接受的替代方案可能是创建 20 个分类条形图，一个用于 City 和 Occupation 的每个交叉点，我会通过在每个类别上运行一个 for 循环来做到这一点，但我无法想象我如何将其提供给 matplotlib 子图以将它们放入 4x5 网格中。

我能找到的most similar question在R中，提问者只想指出第三个变量的最常见值，所以我没有从中得到任何好主意。

感谢您提供的任何帮助。

【问题讨论】：

标签： python pandas seaborn

【解决方案1】：

好吧，我今天开始研究“可接受的替代方案”，我找到了一个使用基本上纯 matplotlib 的解决方案（但我将 Seaborn 样式放在它上面，只是因为）。

import numpy as np, pandas as pd
import matplotlib.pyplot as plt
from matplotlib.cm import get_cmap
from matplotlib.patches import Patch
import seaborn as sns

# Make data
cities = ['Tijuana','Las Vegas','Los Angeles','Anaheim','Atlantis']
occupations = ['Doctor','Lawyer','Engineer','Drone security officer']
bloodtypes = ['A','B','AB','O']
df = pd.DataFrame({'City': np.random.choice(cities,500),
                   'Occupation': np.random.choice(occupations,500),
                   'Blood Type':np.random.choice(bloodtypes,500)})

# Make a dummy column, otherwise the groupby returns an empty df
df['Dummy'] = np.ones(500)

# This is now what I'd like to plot
grouped = df.groupby(by=['City','Occupation','Blood Type']).count().unstack()

# List of blood types, to use later as categories in subplots
kinds = grouped.columns.levels[1]

# colors for bar graph
colors = [get_cmap('viridis')(v) for v in np.linspace(0,1,len(kinds))]

sns.set(context="talk")
nxplots = len(grouped.index.levels[0])
nyplots = len(grouped.index.levels[1])
fig, axes = plt.subplots(nxplots,
                         nyplots,
                         sharey=True,
                         sharex=True,
                         figsize=(10,12))

fig.suptitle('City, occupation, and blood type')

# plot the data
for a, b in enumerate(grouped.index.levels[0]):
    for i, j in enumerate(grouped.index.levels[1]):
        axes[a,i].bar(kinds,grouped.loc[b,j],color=colors)
        axes[a,i].xaxis.set_ticks([])

axeslabels = fig.add_subplot(111, frameon=False)
plt.tick_params(labelcolor='none', top=False, bottom=False, left=False, right=False)
plt.grid(False)
axeslabels.set_ylabel('City',rotation='horizontal',y=1,weight="bold")
axeslabels.set_xlabel('Occupation',weight="bold")

# x- and y-axis labels
for i, j in enumerate(grouped.index.levels[1]):
    axes[nyplots,i].set_xlabel(j)
for i, j in enumerate(grouped.index.levels[0]):
    axes[i,0].set_ylabel(j)

# Tune this manually to make room for the legend
fig.subplots_adjust(right=0.82)

fig.legend([Patch(facecolor = i) for i in colors],
           kinds,
           title="Blood type",
           loc="center right")

如果有人可以提供首选解决方案，我将不胜感激。

【讨论】：