【问题标题】:Pandas: merge dataframes without creating new columns inside a for operationPandas:合并数据框而不在 for 操作中创建新列
【发布时间】:2020-06-22 11:51:03
【问题描述】:

我正在尝试使用从 API 收集的数据来丰富数据框。 所以,我会这样:

for i in df.index:
    if pd.isnull(df.cnpj[i]) == True:
        pass
    else:
        k=get_financials_hnwi(df.cnpj[i]) # this is my API requesting function, working fine
        df=df.merge(k,on=["cnpj"],how="left") # here is my problem <-------------------------------

由于我在 for 语句中运行该合并,因此它显示了后缀 (_x, _y)。所以我在这里找到了这个替代方案:

Pandas: merge dataframes without creating new columns

for i in df.index:
     if pd.isnull(df.cnpj[i]) == True:
        pass
    else:
        k=get_financials_hnwi(df.cnpj[i]) # this is my requesting function, working fine
        val = np.intersect1d(df.cnpj, k.cnpj)
        df_temp = pd.concat([df,k], ignore_index=True)
        df=df_temp[df_temp.cnpj.isin(val)]

但是它创建了一个新的 df,杀死了原来的索引并且不允许这条线运行if pd.isnull(df.cnpj[i]) == True:

有没有一种很好的方法可以在 for 操作中运行合并/加入/连接而不用 _x 和 _y 创建新列?或者有一种方法可以混合 _x 和 _y 列,然后摆脱它并将其浓缩在一个列中?我只想要一个包含所有内容的列

示例数据和可重现的代码

df=pd.DataFrame({'cnpj':[12,32,54,65],'co_name':['Johns Market','T Bone Gril','Superstore','XYZ Tech']})

#first API request:

k=pd.DataFrame({'cnpj':[12],'average_revenues':[687],'years':['2019,2018,2017']})
df=df.merge(k,on="cnpj", how='left')

#second API request:
k=pd.DataFrame({'cnpj':[32],'average_revenues':[456],'years':['2019,2017']})
df=df.merge(k,on="cnpj", how='left')

#third API request:
k=pd.DataFrame({'cnpj':[53],'average_revenues':[None],'years':[None]})
df=df.merge(k,on="cnpj", how='left')

#fourth API request:
k=pd.DataFrame({'cnpj':[65],'average_revenues':[4142],'years':['2019,2018,2015,2013,2012']})
df=df.merge(k,on="cnpj", how='left')

print(df)

结果:

   cnpj       co_name average_revenues_x         years_x  average_revenues_y  \
0    12  Johns Market              687.0  2019,2018,2017                 NaN   
1    32   T Bone Gril                NaN             NaN               456.0   
2    54    Superstore                NaN             NaN                 NaN   
3    65      XYZ Tech                NaN             NaN                 NaN   

     years_y average_revenues_x years_x  average_revenues_y  \
0        NaN               None    None                 NaN   
1  2019,2017               None    None                 NaN   
2        NaN               None    None                 NaN   
3        NaN               None    None              4142.0   

                    years_y  
0                       NaN  
1                       NaN  
2                       NaN  
3  2019,2018,2015,2013,2012  

想要的结果:

   cnpj       co_name   average_revenues                     years
0    12  Johns Market              687.0            2019,2018,2017                 
1    32   T Bone Gril              456.0                 2019,2017               
2    54    Superstore               None                      None        
3    65      XYZ Tech             4142.0  2019,2018,2015,2013,2012                 

【问题讨论】:

  • 请添加一些示例数据和预期输出,以清楚地说明您的问题。通过阅读您的问题,我会将值保存在字典中,然后将它们映射到您的目标数据帧中。
  • 好吧,我放一些样本数据
  • @Datanovice 因为我正在调用 API,所以对数据进行采样非常棘手。看起来不错,或者您会建议另一种采样方式?

标签: python pandas dataframe


【解决方案1】:

当您加入单个列并映射值时,我们可以利用cnpj 列并将其设置为索引,然后我们可以使用combine_firstupdatemap 将您的值添加到你的数据框。

假设k 看起来像这样。如果不只是更新函数以返回可以使用map 的字典。

   cnpj  average_revenues           years
0    12               687  2019,2018,2017

让我们把它放在一个整洁的函数中。

def update_api_call(dataframe,api_call):
    
    if dataframe.index.name == 'cnpj':
        pass
    else:
        dataframe = dataframe.set_index('cnpj')
    
    return dataframe.combine_first(
                                    api_call.set_index('cnpj')
                                   )

假设您的变量 ks 在我们的测试中编号为 1-4。

df1 = update_api_call(df,k1)

print(df1)

      average_revenues       co_name           years
cnpj                                                
12               687.0  Johns Market  2019,2018,2017
32                 NaN   T Bone Gril             NaN
54                 NaN    Superstore             NaN
65                 NaN      XYZ Tech             NaN


df2 = update_api_call(df1,k2)

print(df2)

      average_revenues       co_name           years
cnpj                                                
12               687.0  Johns Market  2019,2018,2017
32               456.0   T Bone Gril       2019,2017
54                 NaN    Superstore             NaN
65                 NaN      XYZ Tech             NaN

print(df4)
      average_revenues       co_name                     years
cnpj                                                          
12               687.0  Johns Market            2019,2018,2017
32               456.0   T Bone Gril                 2019,2017
53                 NaN           NaN                       NaN
54                 NaN    Superstore                       NaN
65              4142.0      XYZ Tech  2019,2018,2015,2013,2012

【讨论】:

    猜你喜欢
    • 2017-05-06
    • 2016-12-25
    • 1970-01-01
    • 2014-12-06
    • 1970-01-01
    • 1970-01-01
    • 2018-09-04
    • 2020-04-10
    • 2021-01-21
    相关资源
    最近更新 更多