【问题标题】:Create column in pandas dataframe based on condition根据条件在熊猫数据框中创建列
【发布时间】:2019-03-27 03:39:39
【问题描述】:

我有一个数据框,想根据条件创建第三列说 col3 如果 col1 中存在 col2 值,则为“是”,否则为“否”

data = [[[('330420', 0.9322496056556702), ('76546', 0.9322003126144409)],76546],[[('330420', 0.9322496056556702), ('500826', 0.9322003126144409)],876546]]
test = pd.DataFrame(data, columns=['col1','col2'])

                                                col1    col2
0  [(330420, 0.9322496056556702), (76546, 0.93220...   76546
1  [(330420, 0.9322496056556702), (500826, 0.9322...  876546

想要的结果:

data = [[[('330420', 0.9322496056556702), ('76546', 0.9322003126

    144409)],76546, 'Yes'],[[('330420', 0.9322496056556702), ('500826', 0.9322003126144409)],876546,'No']]
    test = pd.DataFrame(data, columns=['col1','col2', 'col3'])

                                                    col1    col2 col3
    0  [(330420, 0.9322496056556702), (76546, 0.93220...   76546  Yes
    1  [(330420, 0.9322496056556702), (500826, 0.9322...  876546   No

我的解决方案:

test['col3'] = [entry for tag in test['col2'] for entry in test['col1'] if tag in entry]

收到错误:ValueError: Length of values does not match length of index

【问题讨论】:

    标签: python pandas dataframe tuples


    【解决方案1】:

    anyzip 一起使用

    [any([int(z[0])==y for z in x]) for x, y in zip (test.col1,test.col2)]
    Out[227]: [True, False]
    

    【讨论】:

    • 小评论:你有一对不必要的括号括起来。
    【解决方案2】:

    使用 numpy where:

    test['col3'] = test.apply(lambda x: np.where(str(x.col2) in [i[0] for i in x.col1],"yes", "no"), axis =1)
    test['col3']
    0    yes
    1     no
    

    【讨论】:

      【解决方案3】:

      您应该避免串联列表。让我们尝试一个矢量化的解决方案:

      # extract array of values and reshape
      arr = np.array(df.pop('col1').values.tolist()).reshape(-1, 4)
      
      # join to dataframe and replace list of tuples
      df = df.join(pd.DataFrame(arr, dtype=float))
      
      # apply test via isin
      df['test'] = df.drop('col2', 1).isin(df['col2']).any(1)
      
      print(df)
      
           col2         0        1         2       3   test
      0   76546  330420.0  0.93225   76546.0  0.9322   True
      1  876546  330420.0  0.93225  500826.0  0.9322  False
      

      【讨论】:

        【解决方案4】:

        您可以使用.apply() 来做到这一点

        def sublist_checker(row):
            check_both = ['Yes' if str(row['col2']) in sublist else 'No' for sublist in row['col1']]
            check_any = 'Yes' if 'Yes' in check_both else 'No'
            return check_any
        
        test['col3'] = test.apply(sublist_checker, axis=1)
        print(test)
        
                                                           col1    col2 col3
        0   [(330420, 0.932249605656), (76546, 0.932200312614)]   76546  Yes
        1  [(330420, 0.932249605656), (500826, 0.932200312614)]  876546   No
        

        函数sublist_checker 针对test['col1'] 中的每个子列表对test['col2'] 中的每个元素执行逐行检查,并根据该元素在任何的子列表。

        【讨论】:

        • @user15051990 如果您检查运行时间,您会发现 apply 方法效率较低。,
        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2019-10-20
        • 2023-02-02
        • 2022-01-15
        • 2022-08-02
        • 1970-01-01
        • 2020-04-23
        相关资源
        最近更新 更多