【问题标题】:pandas dataframe insert values according to range of another column values熊猫数据框根据另一列值的范围插入值
【发布时间】:2017-10-14 23:21:57
【问题描述】:

我有如下数据框,我想根据sic2 列中的值插入一个“字符串”。

        conm            sic2
115466  ALLEGION PLC    34.0
115471  AGILITY HEALTH INC  80.0
115473  NORDIC AMERICAN OFFSHORE    44.0
115474  AAD             54.0
115477  DORIAN LPG LTD  44.0
115484  NOMAD FOODS LTD 20.0
115486  ATHENE HOLDING LTD  63.0
115490  MIDATECH PHARMA PLC 28.0
115495  MOTIF BIO PLC   28.0

sic2 数字到字符串的范围如下。

1-9 Agriculture, Forestry and Fishing
10-14   Mining
15-17   Construction
18-19   not used
20-39   Manufacturing
40-49   Transportation, Communications, Electric, Gas and Sanitary service
50-51   Wholesale Trade
52-59   Retail Trade
60-67   Finance, Insurance and Real Estate
70-89   Services
91-97   Public Administration
99-99   Nonclassifiable
0 -1    Agricultural Production-Crops

如何使pandas.DataFrame 看起来像这样应用整个大型数据集?

我尝试了几个条件代码,但总是失败。

        conm            sic2                industry
115466  ALLEGION PLC    34.0                Manufacturing
115471  AGILITY HEALTH INC  80.0            Services
115473  NORDIC AMERICAN OFFSHORE    44.0    Transportation, Communications, Electric, Gas and Sanitary service
115474  AAD             54.0                Retail Trade

【问题讨论】:

    标签: python pandas dataframe range conditional-statements


    【解决方案1】:

    如果您将sics 数字转换为字典,那么根据需要查找行业是相当简单的:

    代码:

    sic = [x.strip().split(' ', 1) for x in """
        1-9 Agriculture, Forestry and Fishing
        10-14 Mining
        15-17 Construction
        18-19 not used
        20-39 Manufacturing
        40-49 Transportation, Communications, ...
        50-51 Wholesale Trade
        52-59 Retail Trade
        60-67 Finance, Insurance and Real Estate
        70-89 Services
        91-97 Public Administration
        99-99 Nonclassifiable
    """.split('\n')[1:-1]]
    
    sic_dict = dict(sum([[(x, z) for x in
                          range(*[int(y) for y in v.split('-')])]
                         for v, z in sic], []))
    

    测试代码:

    df = pd.read_fwf(StringIO(u"""
        number  conm                      sic2
        115466  ALLEGION PLC              34.0
        115471  AGILITY HEALTH INC        80.0
        115473  NORDIC AMERICAN OFFSHORE  44.0
        115474  AAD                       54.0
        115477  DORIAN LPG LTD            44.0
        115484  NOMAD FOODS LTD           20.0
        115486  ATHENE HOLDING LTD        63.0
        115490  MIDATECH PHARMA PLC       28.0
        115495  MOTIF BIO PLC             28.0"""), header=1)
    
    df['industry'] = df.sic2.apply(lambda x: sic_dict[int(x)])
    
    print(df)
    

    结果:

       number                      conm  sic2                             industry
    0  115466              ALLEGION PLC  34.0                        Manufacturing
    1  115471        AGILITY HEALTH INC  80.0                             Services
    2  115473  NORDIC AMERICAN OFFSHORE  44.0  Transportation, Communications, ...
    3  115474                       AAD  54.0                         Retail Trade
    4  115477            DORIAN LPG LTD  44.0  Transportation, Communications, ...
    5  115484           NOMAD FOODS LTD  20.0                        Manufacturing
    6  115486        ATHENE HOLDING LTD  63.0   Finance, Insurance and Real Estate
    7  115490       MIDATECH PHARMA PLC  28.0                        Manufacturing
    8  115495             MOTIF BIO PLC  28.0                        Manufacturing
    

    【讨论】:

      【解决方案2】:
      #Save your mapping table to a data frame
      
      df2 = pd.DataFrame({'id_end': {0: 9,  1: 14,  2: 17,  3: 19,  4: 39,  5: 49,  6: 51,  7: 59,  8: 67,  9: 89,  10: 97,  11: 99,  12: 1},
       'id_start': {0: 1,  1: 10,  2: 15,  3: 18,  4: 20,  5: 40,  6: 50,  7: 52,  8: 60,  9: 70,  10: 91,  11: 99,  12: 0},
       'industry': {0: 'Agriculture, Forestry and Fishing',  1: 'Mining',  2: 'Construction',  3: 'not used',  4: 'Manufacturing',
        5: 'Transportation, Communications, Electric, Gas and Sanitary service',
        6: 'Wholesale Trade',  7: 'Retail Trade',  8: 'Finance, Insurance and Real Estate',  9: 'Services',  
        10: 'Public Administration',  11: 'Nonclassifiable',  12: 'Agricultural Production Crops'}})
      
      df2 = df2.sort_values(by='id_end')
      
      Out[354]: 
          id_end  id_start                                           industry
      12       1         0                      Agricultural Production Crops
      0        9         1                  Agriculture, Forestry and Fishing
      1       14        10                                             Mining
      2       17        15                                       Construction
      3       19        18                                           not used
      4       39        20                                      Manufacturing
      5       49        40  Transportation, Communications, Electric, Gas ...
      6       51        50                                    Wholesale Trade
      7       59        52                                       Retail Trade
      8       67        60                 Finance, Insurance and Real Estate
      9       89        70                                           Services
      10      97        91                              Public Administration
      11      99        99                                    Nonclassifiable
      
      #Map sic2 number to industry names
      df['industry'] = df['sic2'].astype(np.int).apply(lambda x: df2.loc[df2.id_end>=x,'industry'].iloc[0])
      
      
      Out[352]: 
                                  conm  sic2                                             industry
      115466              ALLEGION PLC  34.0                                        Manufacturing 
      115471        AGILITY HEALTH INC  80.0                                             Services 
      115473  NORDIC AMERICAN OFFSHORE  44.0    Transportation, Communications, Electric, Gas ... 
      115474                       AAD  54.0                                         Retail Trade 
      115477            DORIAN LPG LTD  44.0    Transportation, Communications, Electric, Gas ... 
      115484           NOMAD FOODS LTD  20.0                                        Manufacturing 
      115486        ATHENE HOLDING LTD  63.0                   Finance, Insurance and Real Estate 
      115490       MIDATECH PHARMA PLC  28.0                                        Manufacturing 
      115495             MOTIF BIO PLC  28.0                                        Manufacturing 
      

      【讨论】:

        猜你喜欢
        • 2019-01-25
        • 2016-08-09
        • 1970-01-01
        • 2017-11-30
        • 1970-01-01
        • 1970-01-01
        • 2022-01-23
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多