在 Pandas 中组合系列答案

【问题标题】：Combining Series in Pandas在 Pandas 中组合系列
【发布时间】：2014-11-16 09:15:29
【问题描述】：

我需要组合多个包含字符串值的 Pandas Series。该系列是由多个验证步骤产生的消息。我尝试将这些消息组合成 1 个Series 以将其附加到DataFrame。问题是结果是空的。

这是一个例子：

import pandas as pd

df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['aa', 'bb', 'cc', 'dd']})

index1 = df[df['a'] == 'b'].index
index2 = df[df['a'] == 'a'].index

series = df.iloc[index1].apply(lambda x: x['b'] + '-bbb', axis=1)
series += df.iloc[index2].apply(lambda x: x['a'] + '-aaa', axis=1)

print series
# >>> series
# 0    NaN
# 1    NaN

更新

import pandas as pd

df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['aa', 'bb', 'cc', 'dd']})

index1 = df[df['a'] == 'b'].index
index2 = df[df['a'] == 'a'].index

series1 = df.iloc[index1].apply(lambda x: x['b'] + '-bbb', axis=1)
series2 = df.iloc[index2].apply(lambda x: x['a'] + '-aaa', axis=1)
series3 = df.iloc[index2].apply(lambda x: x['a'] + '-ccc', axis=1)

# series3 causes a ValueError: cannot reindex from a duplicate axis
series = pd.concat([series1, series2, series3])
df['series'] = series
print df

更新2

在此示例中，索引似乎混淆了。

import pandas as pd

df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['aa', 'bb', 'cc', 'dd']})

index1 = df[df['a'] == 'a'].index
index2 = df[df['a'] == 'b'].index
index3 = df[df['a'] == 'c'].index

series1 = df.iloc[index1].apply(lambda x: x['a'] + '-aaa', axis=1)
series2 = df.iloc[index2].apply(lambda x: x['a'] + '-bbb', axis=1)
series3 = df.iloc[index3].apply(lambda x: x['a'] + '-ccc', axis=1)

print series1
print
print series2
print
print series3
print

df['series'] = pd.concat([series1, series2, series3], ignore_index=True)
print df
print

df['series'] = pd.concat([series2, series1, series3], ignore_index=True)
print df
print

df['series'] = pd.concat([series3, series2, series1], ignore_index=True)
print df
print

这会导致以下输出：

0    a-aaa
dtype: object

1    b-bbb
dtype: object

2    c-ccc
dtype: object

   a   b series
0  a  aa  a-aaa
1  b  bb  b-bbb
2  c  cc  c-ccc
3  d  dd    NaN

   a   b series
0  a  aa  b-bbb
1  b  bb  a-aaa
2  c  cc  c-ccc
3  d  dd    NaN

   a   b series
0  a  aa  c-ccc
1  b  bb  b-bbb
2  c  cc  a-aaa
3  d  dd    NaN

我希望row0中只有a，row1中只有b，row2中只有c，但事实并非如此......

更新 3

这是一个更好的例子，它应该展示预期的行为。正如我所说，用例是对于给定的DataFrame，一个函数评估每一行，并可能将某些行的错误消息返回为Series（包含一些索引，一些不包含；如果没有错误返回，错误序列为空）。

In [12]:

s1 = pd.Series(['b', 'd'], index=[1, 3])
s2 = pd.Series(['a', 'b'], index=[0, 1])
s3 = pd.Series(['c', 'e'], index=[2, 4])
s4 = pd.Series([], index=[])
pd.concat([s1, s2, s3, s4]).sort_index()

# I'd like to get:
#
# 0    a
# 1    b b
# 2    c
# 3    d
# 4    e
Out[12]:
0    a
1    b
1    b
2    c
3    d
4    e
dtype: object

【问题讨论】：

标签： python string pandas series

【解决方案1】：

concat 怎么样？

s1 = df.iloc[index1].apply(lambda x: x['b'] + '-bbb', axis=1)
s2 = df.iloc[index2].apply(lambda x: x['a'] + '-aaa', axis=1)


s = pd.concat([s1,s2])
print s

1    bb-bbb
0    a-aaa
dtype: object

【讨论】：

对不起，我不得不撤回。获取 ValueError（请参阅更新的示例）。

【解决方案2】：

连接时默认使用现有索引，但如果它们发生冲突，则会引发ValueError，正如您所发现的，因此您需要设置ignore_index=True：

In [33]:

series = pd.concat([series1, series2, series3], ignore_index=True)
df['series'] = series
print (df)
   a   b  series
0  a  aa  bb-bbb
1  b  bb   a-aaa
2  c  cc   a-ccc
3  d  dd     NaN

编辑

我想我现在知道您想要什么，您可以通过将系列转换为数据框然后使用索引进行合并来实现您想要的：

In [96]:

df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['aa', 'bb', 'cc', 'dd']})

index1 = df[df['a'] == 'b'].index
index2 = df[df['a'] == 'a'].index

series1 = df.iloc[index1].apply(lambda x: x['b'] + '-bbb', axis=1)
series2 = df.iloc[index2].apply(lambda x: x['a'] + '-aaa', axis=1)
series3 = df.iloc[index2].apply(lambda x: x['a'] + '-ccc', axis=1)
# we now don't ignore the index in order to preserve the identity of the row we want to merge back to later
series = pd.concat([series1, series2, series3])
# construct a dataframe from the series and give the column a name
df1 = pd.DataFrame({'series':series})
# perform an outer merge on both df's indices
df.merge(df1, left_index=True, right_index=True, how='outer')

Out[96]:
   a   b  series
0  a  aa   a-aaa
0  a  aa   a-ccc
1  b  bb  bb-bbb
2  c  cc     NaN
3  d  dd     NaN

【讨论】：

我认为这行不通。 series2 应该将 -aaa 添加到 index 0 (df['a'] == 'a')，而不是 index 1。
您能否更新您的帖子以准确显示您想要的内容，这将有助于澄清事情，目前很难知道您的期望是什么
基本上你的问题是你有一个非唯一索引与你试图分配值的 df 冲突，对吗？看起来你想要的是重复 a 行并有 2 行，一个是 a-aaa，另一个是 a-ccc，你能确认一下
这是我想要实现的，但是使用 series = pd.concat([series1, series2, series3]) 时，您会遇到与我最初遇到的相同的问题：ValueError: cannot reindex from a duplicate axis（只是在此示例中未公开）。
@orange 抱歉，您现在是否使用我修改后的新方法？您应该能够从串联的系列中创建一个数据框并将其合并回您的原始数据框

【解决方案3】：

我可能已经找到了解决方案。我希望有人可以评论它...

s1 = pd.Series(['b', 'd'], index=[1, 3])
s2 = pd.Series(['a', 'b'], index=[0, 1])
s3 = pd.Series(['c', 'e'], index=[2, 4])
s4 = pd.Series([], index=[])
pd.concat([s1, s2, s3, s4]).sort_index()


df1 = pd.DataFrame(s1)
df2 = pd.DataFrame(s2)
df3 = pd.DataFrame(s3)
df4 = pd.DataFrame(s4)

d = pd.DataFrame({0:[]})
d = pd.merge(df1, d, how='outer', left_index=True, right_index=True)
d = d.fillna('')
d = pd.DataFrame(d['0_x'] + d['0_y'])

d = pd.merge(df2, d, how='outer', left_index=True, right_index=True)
d = d.fillna('')
d = pd.DataFrame(d['0_x'] + d['0_y'])

d = pd.merge(df3, d, how='outer', left_index=True, right_index=True)
d = d.fillna('')
d = pd.DataFrame(d['0_x'] + d['0_y'])

d = pd.merge(df4, d, how='outer', left_index=True, right_index=True)
d = d.fillna('')
d = pd.DataFrame(d['0_x'] + d['0_y'])
print d

【讨论】：