根据条件在熊猫数据框中创建列答案

【问题标题】：Create column in pandas dataframe based on condition根据条件在熊猫数据框中创建列
【发布时间】：2019-03-27 03:39:39
【问题描述】：

我有一个数据框，想根据条件创建第三列说 col3 如果 col1 中存在 col2 值，则为“是”，否则为“否”

data = [[[('330420', 0.9322496056556702), ('76546', 0.9322003126144409)],76546],[[('330420', 0.9322496056556702), ('500826', 0.9322003126144409)],876546]]
test = pd.DataFrame(data, columns=['col1','col2'])

                                                col1    col2
0  [(330420, 0.9322496056556702), (76546, 0.93220...   76546
1  [(330420, 0.9322496056556702), (500826, 0.9322...  876546

想要的结果：

data = [[[('330420', 0.9322496056556702), ('76546', 0.9322003126

    144409)],76546, 'Yes'],[[('330420', 0.9322496056556702), ('500826', 0.9322003126144409)],876546,'No']]
    test = pd.DataFrame(data, columns=['col1','col2', 'col3'])

                                                    col1    col2 col3
    0  [(330420, 0.9322496056556702), (76546, 0.93220...   76546  Yes
    1  [(330420, 0.9322496056556702), (500826, 0.9322...  876546   No

我的解决方案：

test['col3'] = [entry for tag in test['col2'] for entry in test['col1'] if tag in entry]

收到错误：ValueError: Length of values does not match length of index

【问题讨论】：

标签： python pandas dataframe tuples

【解决方案1】：

将any 与zip 一起使用

[any([int(z[0])==y for z in x]) for x, y in zip (test.col1,test.col2)]
Out[227]: [True, False]

【讨论】：

小评论：你有一对不必要的括号括起来。

【解决方案2】：

使用 numpy where:

test['col3'] = test.apply(lambda x: np.where(str(x.col2) in [i[0] for i in x.col1],"yes", "no"), axis =1)
test['col3']
0    yes
1     no

【讨论】：

【解决方案3】：

您应该避免串联列表。让我们尝试一个矢量化的解决方案：

# extract array of values and reshape
arr = np.array(df.pop('col1').values.tolist()).reshape(-1, 4)

# join to dataframe and replace list of tuples
df = df.join(pd.DataFrame(arr, dtype=float))

# apply test via isin
df['test'] = df.drop('col2', 1).isin(df['col2']).any(1)

print(df)

     col2         0        1         2       3   test
0   76546  330420.0  0.93225   76546.0  0.9322   True
1  876546  330420.0  0.93225  500826.0  0.9322  False

【讨论】：

【解决方案4】：

您可以使用.apply() 来做到这一点

def sublist_checker(row):
    check_both = ['Yes' if str(row['col2']) in sublist else 'No' for sublist in row['col1']]
    check_any = 'Yes' if 'Yes' in check_both else 'No'
    return check_any

test['col3'] = test.apply(sublist_checker, axis=1)
print(test)

                                                   col1    col2 col3
0   [(330420, 0.932249605656), (76546, 0.932200312614)]   76546  Yes
1  [(330420, 0.932249605656), (500826, 0.932200312614)]  876546   No

函数sublist_checker 针对test['col1'] 中的每个子列表对test['col2'] 中的每个元素执行逐行检查，并根据该元素在任何的子列表。

【讨论】：

@user15051990 如果您检查运行时间，您会发现 apply 方法效率较低。，