选择特定数据进行汇总和绘图答案

【问题标题】：Selecting Specific Data to Sum and Plot选择特定数据进行汇总和绘图
【发布时间】：2019-05-06 17:25:03
【问题描述】：

this is some of the data that is located in the excel sheet

我想选择演员阵容中少数族裔人数多于白种人的音乐剧节目（代码中称为“ID”）一旦确定，我想将所选代码的信息放入一个新的数据框中只会举行表演，因为它会更容易操纵。在新的数据框中，我想在同一行显示相关种族，以便与观众种族进行比较。然后我试图绘制这些信息。

一般来说，如果该行符合特定的求和标准，我想将特定行中的值相加。该项目中使用的所有数据都位于一个 Excel 工作表中，该工作表转换为 csv 并作为数据框上传。然后，我想完整地绘制演员的价值观，并将演员的种族与观众的种族进行比较。

我正在使用 python，我尝试通过使用 if 语句选择列来删除不需要的数据，以便数据框仅包含少数族裔比白种人多的节目，然后我尝试使用这个情节中的信息。如果我不在计算中使用它们，我不确定是否必须过滤所有不需要的列

import numpy as np
import pandas as pd
#first need to import numpy so that calculations can be made

from google.colab import files
uploaded = files.upload()
# df = pd.read_csv('/content/drive/My Drive/allTheaterDataV2.csv')

import io
df = pd.read_csv(io.BytesIO(uploaded['allTheaterDataV2.csv']))
# need to download excel sheet as csv and then upload into colab so that it can
# be manipulated as a dataframe 

# want to select shows(ID) that had more minorities than Caucasians in the cast
# once determined, the selected shows should be placed into a new data frame that 
# will only hold the shows and the related ethnicity, and compared to audience ethnicity
# this information should then be plotted 

# first we will determine the shows that have a majority ethnic cast

minorcal = list(df)
minorcal.remove('CAU')
minoritycastSUM = df[minorcal].sum(axis=1)

# print(minorcal)

# next, we determine how many people in the cast were Caucasian, so remove all others

caucasiancal = list(df)
# i first wanted to do caucasiancal.remove('AFRAM', 'ASIAM', 'LAT', 'OTH')
# but got the statement I could only have 1 argument so i just put each on their own line
caucasiancal.remove('AFRAM')
caucasiancal.remove('ASIAM')
caucasiancal.remove('LAT')
caucasiancal.remove('OTH')
idrowcaucal = df[caucasiancal].sum(axis=1)

minoritycompare = old.filter(['idrowcaucal','minoritycastSUM'])
print(minoritycompare)

# now compare the two values per line
if minoritycastSUM < caucasiancal:
  minoritydf = pd.df.minorcal.append()
  # plot new data frame per each show and compare to audience ethnicity
  df.plot(x=['AFRAM', 'ASIAM', 'CAU', 'LAT', 'OTH', 'WHT', 'BLK', 'ASN', 'HSP', 'MRO'], y = [''])
             # i am unsure how to call the specific value for each column
  plt.title('ID Ethnicity Comparison')
             # i am unsure how to call the specific show so that only one show is per plot so for now i just subbed in 'ID' 
  plt.xlabel('Ethnicity comparison')
  plt.ylabel('Number of Cast Members/Audience Members')
  plt.show()

我想查看具有符合标准的特定节目的数据框，然后是节目的情节，但现在我在如何制定新数据框和 python 中遇到错误，说不能使用 if 语句。[2]

【问题讨论】：

标签： python pandas numpy dataframe

【解决方案1】：

首先，这将不是一个完整的答案，因为

我不知道你是怎么想象你的最终情节的样子
我不知道您的 DataFrame 中的列是什么（考虑使用更具描述性的列标签，例如“白种人演员”而不是“CAU”……）
我不清楚您的数据是否可以形成任何趋势，因为您发布的屏幕截图显示第一部电影的观众构成相同

尽管如此，我在 this answer 中建立了 DataFrame，也许每部电影的“非高加索/高加索比例”的初始图可以为您指明正确的方向。也许您可以为观众列构建一组类似的总和和比率列，然后将演员比率绘制为观众比率的函数，以查看更多高加索观众是否更喜欢更多或更少的高加索演员（我猜这是你在追求什么？）。

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({'ID':['Billy Elliot','next to normal','shrek','guys and dolls',
                         'west side story', 'pal joey'],
                   'Season' : [20082009,20082009,20082009,
                               20082009,20082009,20082009],
                   'AFRAM' : [2,0,4,4,0,1],
                   'ASIAM' : [0,0,1,0,0,0],
                   'CAU' : [48,10,25,24,28,20],
                   'LAT' : [1,0,1,3,18,0],
                   'OTH' : [0,0,0,0,0,0],
                   'WHT' : [73.7,73.7,73.7,73.7,73.7,73.7]}) 

## define a sum column for non caucasian actors (I suppose?)
df['non_cau']=df[['AFRAM','ASIAM','LAT','OTH']].sum(axis=1)
## build a ratio of non caucasian to caucasian
df['cau_ratio']=df['non_cau']/df['CAU']

## make a quick plot
fig,ax=plt.subplots()
ax.scatter(df['ID'],df['cau_ratio'])
ax.set_ylabel('non cau / cau ratio')
plt.tight_layout()
plt.show()

【讨论】：

谢谢！！我编辑了问题以包含更好的描述，但这很棒。后续问题，我怎样才能使情节更大？我尝试了多种方法都没有成功。
抱歉解释不好，但我想要的是你的计算（谢谢）和 ['WHT', 'BLK', 'ASN', 'HSP', 'MRO'] 在同一轴上（演员与观众）。这很困难的原因是因为我不想添加所有列（这就是我删除不需要的列以便我只能添加需要的列的原因）。
@tnndynamite 到 change the figure size，使用 figsize 参数，例如fig,ax=plt.subplots(figsize=(10,5))。请注意，您可以轻松删除低于某个“非白种人”阈值的行（即节目），例如通过df=df[(df['cau_ratio']>=0.2)]