【问题标题】:New column in pandas dataframe based on existing column values with conditions list熊猫数据框中的新列基于具有条件列表的现有列值
【发布时间】:2019-10-18 13:00:55
【问题描述】:

点击此链接: New column in pandas dataframe based on existing column values

我有一个数据框,其中包含一个名为“Country”的列,其中列出了世界上的几个国家/地区。我需要使用像“欧洲”这样的区域说明符创建另一列。我有三个属于多个地区的国家/地区列表,因此如果 df ['Country'] 中的状态与“Europe”列表中的状态匹配,则“Europe”说明符将插入新列 df['Region'] .

我的数据是: https://sendeyo.com/up/d/2acd2eb849

问题是,当我使用上一个链接中表达的解决方案时,它们适用于示例数据框,但不适用于我的数据库。 我的数据框是这样的:

Year    Country Population  GDP 
1870    Austria  4,520     8,419    
1870    Belgium  5,096     13,716   
1870    Denmark  1,888     3,782    
1870    Finland  1,754     1,999    
1870    France   38,440    72,100   

我的清单:

Europa = ["Austria", "Belgium", "Denmark"]

RamasOccidentales = ["Australia","New Zealand","Canada","United States"]

Latinoamerica = ["Brazil","Chile","Uruguay"]

Asia = ["Indonesia","Japan","Sri Lanka"]

预期结果

Year    Country Population  GDP Region
1870    Austria 4,520   8,419   Europa 
1870    Belgium 5,096   13,716  Europa 
1870    Denmark 1,888   3,782   Europa 
1870    Finland 1,754   1,999   Europa 
1870    France  38,440  72,100  Europa 

这是我试过的代码:

def Continent(country):
    return "Europa" if country in Europa else "Unknown"

df['Region'] = df['Country'].apply(Continent)

谢谢。

【问题讨论】:

  • 你能提供你试过的代码和它产生的输出吗?
  • def Continent(country): 如果国家在欧罗巴,则返回“欧罗巴”,否则返回“未知” df['Region'] = df['Country'].apply(Continent)
  • 您可能需要去除尾随空格。 df['Country'] = df['Country'].str.strip() 在应用映射之前。

标签: python pandas dataframe


【解决方案1】:

Series.map 与从列表创建的字典一起使用:

Europa = ["Austria", "Belgium", "Denmark",'France','Finland']
RamasOccidentales = ["Australia","New Zealand","Canada","United States"]
Latinoamerica = ["Brazil","Chile","Uruguay"]
Asia = ["Indonesia","Japan","Sri Lanka"]

d = {'Europa':Europa,'RamasOccidentales':RamasOccidentales,
     'Latinoamerica':Latinoamerica,'Asia':Asia}

#swap key values in dict
#http://stackoverflow.com/a/31674731/2901002
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}

df['Region'] = df['Country'].map(d1)

print (df)
   Year  Country Population     GDP  Region
0  1870  Austria      4,520   8,419  Europa
1  1870  Belgium      5,096  13,716  Europa
2  1870  Denmark      1,888   3,782  Europa
3  1870  Finland      1,754   1,999  Europa
4  1870   France     38,440  72,100  Europa

print (d1)

{'Austria': 'Europa', 'Belgium': 'Europa', 'Denmark': 'Europa', 
 'France': 'Europa', 'Finland': 'Europa', 
 'Australia': 'RamasOccidentales', 
 'New Zealand': 'RamasOccidentales', 
 'Canada': 'RamasOccidentales', 
 'United States': 'RamasOccidentales', 
 'Brazil': 'Latinoamerica', 'Chile': 'Latinoamerica', 
 'Uruguay': 'Latinoamerica', 'Indonesia': 'Asia',
 'Japan': 'Asia', 'Sri Lanka': 'Asia'}

10k 行的性能提高了 2.58 倍:

np.random.seed(2019)

Europa = ["Austria", "Belgium", "Denmark",'France','Finland']
RamasOccidentales = ["Australia","New Zealand","Canada","United States"]
Latinoamerica = ["Brazil","Chile","Uruguay"]
Asia = ["Indonesia","Japan","Sri Lanka"]

d = {'Europa':Europa,'RamasOccidentales':RamasOccidentales,
     'Latinoamerica':Latinoamerica,'Asia':Asia}

d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
df = pd.DataFrame({'Country': np.random.choice(list(d1.keys()), size=10000)})

In [280]: %%timeit
     ...: d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
     ...: 
     ...: df['Region'] = df['Country'].map(d1)
     ...: 
3.04 ms ± 43.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [281]: %%timeit
     ...: classification_countries={'Europa':Europa,
     ...:                           'RamasOccidentales':RamasOccidentales,
     ...:                           'Latinoamerica':Latinoamerica ,
     ...:                           'Asia':Asia}
     ...: 
     ...: cond=[df['Country'].isin(classification_countries[key]) for key in classification_countries]
     ...: values=[ key for key in classification_countries]
     ...: 
     ...: df['Region']=np.select(cond,values)
     ...: 
7.86 ms ± 56.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [282]: %%timeit
     ...: cond=[df['Country'].isin(Europa),df['Country'].isin(RamasOccidentales),df['Country'].isin(Latinoamerica),df['Country'].isin(Asia)]
     ...: values=['Europa','RamasOccidentales','Latinoamerica','Asia']
     ...: df['Region']=np.select(cond,values)
     ...: 
7.96 ms ± 281 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [293]: %%timeit
     ...: classification_countries={'Europa':Europa,
     ...:                           'RamasOccidentales':RamasOccidentales,
     ...:                           'Latinoamerica':Latinoamerica ,
     ...:                           'Asia':Asia}
     ...: 
     ...: dict_cond_values= {key:df['Country'].isin(classification_countries[key]) for key in classification_countries}
     ...: 
     ...: 
     ...: df['Region']=np.select(dict_cond_values.values(),dict_cond_values.keys())
     ...: 
8.54 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

【讨论】:

    【解决方案2】:

    使用np.select + Series.isin

    Europa = ["Austria", "Belgium", "Denmark",'France','Finland']
    
    RamasOccidentales = ["Australia","New Zealand","Canada","United States"]
    
    Latinoamerica = ["Brazil","Chile","Uruguay"]
    
    Asia = ["Indonesia","Japan","Sri Lanka"]
    
    
    #using np.select
    cond=[df['Country'].isin(Europa),df['Country'].isin(RamasOccidentales),df['Country'].isin(Latinoamerica),df['Country'].isin(Asia)]
    values=['Europa','RamasOccidentales','Latinoamerica','Asia']
    df['Region']=np.select(cond,values)
    
    print(df)
    

       Year  Country Population     GDP  Region
    0  1870  Austria      4,520   8,419  Europa
    1  1870  Belgium      5,096  13,716  Europa
    2  1870  Denmark      1,888   3,782  Europa
    3  1870  Finland      1,754   1,999  Europa
    4  1870   France     38,440  72,100  Europa
    

    您也可以使用字典来创建条件和值列表。它更快

    classification_countries={'Europa':Europa,
                              'RamasOccidentales':RamasOccidentales,
                              'Latinoamerica':Latinoamerica ,
                              'Asia':Asia}
    
    dict_cond_values= {key:df['Country'].isin(classification_countries[key]) for key in classification_countries}
    
    
    df['Region']=np.select(dict_cond_values.values(),dict_cond_values.keys())
    print(df)
       Year  Country Population     GDP  Region
    0  1870  Austria      4,520   8,419  Europa
    1  1870  Belgium      5,096  13,716  Europa
    2  1870  Denmark      1,888   3,782  Europa
    3  1870  Finland      1,754   1,999  Europa
    4  1870   France     38,440  72,100  Europa
    

    classification_countries={'Europa':Europa,
                              'RamasOccidentales':RamasOccidentales,
                              'Latinoamerica':Latinoamerica ,
                              'Asia':Asia}
    
    cond=[df['Country'].isin(classification_countries[key]) for key in classification_countries]
    values=[ key for key in classification_countries]
    
    df['Region']=np.select(cond,values)
    print(df)
    
       Year  Country Population     GDP  Region
    0  1870  Austria      4,520   8,419  Europa
    1  1870  Belgium      5,096  13,716  Europa
    2  1870  Denmark      1,888   3,782  Europa
    3  1870  Finland      1,754   1,999  Europa
    4  1870   France     38,440  72,100  Europa
    

    从字典创建到执行打印(df)与jezrael测量的解决方案比较

    %%timeit
    dict_cond_values= {key:df['Country'].isin(classification_countries[key]) for key in classification_countries}   
    df['Region']=np.select(dict_cond_values.values(),dict_cond_values.keys())
    print(df)
    #5.06 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    %%timeit
    cond=[df['Country'].isin(classification_countries[key]) for key in classification_countries]
    values=[ key for key in classification_countries]
    
    df['Region']=np.select(cond,values)
    print(df)
    #5.18 ms ± 652 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    @jezrael

    %%timeit
    
    d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
    
    df['Region'] = df['Country'].map(d1)
    
    print (df)
    #7.88 ms ± 824 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    【讨论】:

      【解决方案3】:

      一种非常相似但替代的方法是使用基于字典的查找来确定国家/地区。在此实现中,您将创建一个字典,其中国家作为键,其对应的地区作为配对值。

      region_map = {
          'Austria': 'Europa',
          'Brazil': 'Latinoamerica',
          'Japan': 'Asia'  # so on and so forth
      }
      df['Region'] = df['Country'].apply(lambda c: region_map.get(c, 'Unknown'))
      

      如果不存在键值对,这将生成字典地图中对应的国家或字符串“未知”。

      【讨论】:

      • 那么你需要的key数量等于世界上国家的数量吗?
      • @ansev,是的。在这方面,这种方法仅在存在已知的、相对较少的可能值(如世界上的国家)时才有效。它权衡冗余以提高可读性(即避免级联 .isin() 方法调用之类的事情)。
      猜你喜欢
      • 2022-11-15
      • 2021-02-02
      • 1970-01-01
      • 1970-01-01
      • 2016-11-08
      • 1970-01-01
      • 2023-04-02
      • 2017-07-15
      • 1970-01-01
      相关资源
      最近更新 更多