【发布时间】:2015-12-06 01:18:45
【问题描述】:
基于this堆栈上的帖子,我尝试了像这样的值计数函数
df2 = df1.join(df1.genres.str.split(",").apply(pd.value_counts).fillna(0))
除了以下事实之外它工作得很好,尽管我的数据有 22 个独特的流派,并且在拆分后我得到 42 个值,这当然不是唯一的。 数据示例:
Action Adventure Casual Design & Illustration Early Access Education Free to Play Indie Massively Multiplayer Photo Editing RPG Racing Simulation Software Training Sports Strategy Utilities Video Production Web Publishing Accounting Action Adventure Animation & Modeling Audio Production Casual Design & Illustration Early Access Education Free to Play Indie Massively Multiplayer Photo Editing RPG Racing Simulation Software Training Sports Strategy Utilities Video Production Web Publishing nan
0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 1.0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
(我只粘贴了头部和第一行)
我感觉问题是由我的原始数据引起的。嗯,我的列(类型)是一个包含括号的列表列表
示例:[Action,Indie]
所以当python读取它时,它会将[Action and Action and Action]读取为不同的值,输出是303个不同的值。
所以我所做的是:
for i in df1['genres'].tolist():
if str(i) != 'nan':
i = i[1:-1]
new.append(i)
else:
new.append('nan')
【问题讨论】:
-
你可以试试:
if str(i).notnull(): -
你能告诉我你的输入数据
df1,5 - 6行吗? -
但我认为你可以使用:
print df['genres'].str.get_dummies(sep=',') -
好的,我找到了问题,但我不知道如何解决。我的标题数据,这意味着流派有空格问题。这意味着 Action 显示为 [space]Action , Action , Action(space)
-
可以通过函数
strip()删除这个空间