【问题标题】:Pandas: dataframes won't merge熊猫:数据框不会合并
【发布时间】:2016-04-23 08:13:41
【问题描述】:

我在下面有两个数据框(可以在herehere 找到):

df= pd.read_csv('Thesis/ExternalData/naics_conversion_data/SIC2CRPCats.csv', \
                engine='python', sep=r'\s{2,}', encoding='utf-8_sig')

我只提供了df中的阅读代码,因为它有一些独特的格式问题。

df.dtypes

SICcode     object
Catcode     object
Category    object
SICname     object
MultSIC     object
dtype: object

merged.dtypes

2012 NAICS Code     float64
2002to2007 NAICS    float64
SICcode              object
dtype: object

df.columns.tolist()
['SICcode', 'Catcode', 'Category', 'SICname', 'MultSIC']

merged.columns.tolist()
['2012 NAICS Code', '2002to2007 NAICS', 'SICcode']

df.head(3)

    SICcode     Catcode     Category                          SICname   MultSIC
0   111         A1500   Wheat, corn, soybeans and cash grain    Wheat   X
1   112         A1600   Other commodities (incl rice, peanuts)  Rice    X
2   115         A1500   Wheat, corn, soybeans and cash grain    Corn    X

merged.sort_values('SICcode')

    2012 NAICS Code     2002to2007 NAICS    SICcode
89  212210                       212210     1011
93  212234                       212234     1021
92  212231                       212231     1031
90  212221                       212221     1041
91  212222                       212222     1044
96  212299                       212299     1061
94  212234                       212234     1061
119 213114                       213114     1081
1770    541360                   541360     1081
233     238910                   238910     1081
95  212291                       212291     1094
97  212299                       212299     1099
3   111140                       111140     111
6   111160                       111160     112
4   111150                       111150     115
0   111110                       111110     116

我正在尝试将它们与以下代码合并在一起:merged=pd.merge(merged,df, how='right', on='SICcode')

结果如下:

2012 NAICS Code        0
2002to2007 NAICS       0
SICcode             1007
Catcode              991
Category            1007
SICname             1007
MultSIC              906
dtype: int64

我怀疑问题出在df 的格式上,但我不知道如何描述(我听说过white space 这个词,可能与这种情况有关)或解决问题。有人对此有想法吗?

【问题讨论】:

    标签: python pandas dataframe merge formatting


    【解决方案1】:

    我相信这是您的问题的原因:

    In [47]: merged[merged.SICcode == 'Aux']
    Out[47]:
          2012 NAICS Code  2002to2007 NAICS SICcode
    1828         551114.0          551114.0     Aux
    

    导致不同的数据类型:

    In [61]: df.dtypes
    Out[61]:
    SICcode      int64
    Catcode     object
    Category    object
    SICname     object
    MultSIC     object
    dtype: object
    
    In [62]: merged.dtypes
    Out[62]:
    2012 NAICS Code     float64
    2002to2007 NAICS    float64
    SICcode              object
    dtype: object
    
    In [63]: df.SICcode.unique()
    Out[63]: array([ 111,  112,  115, ..., 9711, 9721, 9999], dtype=int64)
    
    In [64]: merged.SICcode.head(10).unique()
    Out[64]: array(['116', '119', '111', '115', '112', '139'], dtype=object)
    

    所以你可以这样做:

    url = 'https://raw.githubusercontent.com/108michael/ms_thesis/master/SIC2CRPCats.csv'
    df = pd.read_csv(url, engine='python', sep=r'\s{2,}', encoding='utf-8_sig')
    
    url='https://raw.githubusercontent.com/108michael/ms_thesis/master/test.merge'
    merged = pd.read_csv(url, index_col=0)
    
    # clearing data
    merged.SICcode = pd.to_numeric(merged.SICcode, errors='coerce')
    
    mrg = df.merge(merged, on='SICcode', how='left')
    
    mrg.head()
    

    输出:

    In [51]: mrg.head()
    Out[51]:
       SICcode Catcode                                       Category  \
    0      111   A1500           Wheat, corn, soybeans and cash grain
    1      112   A1600  Other commodities (incl rice, peanuts, honey)
    2      115   A1500           Wheat, corn, soybeans and cash grain
    3      116   A1500           Wheat, corn, soybeans and cash grain
    4      119   A1500           Wheat, corn, soybeans and cash grain
    
                SICname MultSIC  2012 NAICS Code  2002to2007 NAICS
    0             Wheat       X         111140.0          111140.0
    1              Rice       X         111160.0          111160.0
    2              Corn       X         111150.0          111150.0
    3          Soybeans       X         111110.0          111110.0
    4  Cash grains, NEC       X         111120.0          111120.0
    

    【讨论】:

    • @MichaelPerdue,总是乐于提供帮助:)
    猜你喜欢
    • 2013-09-26
    • 1970-01-01
    • 1970-01-01
    • 2018-02-02
    • 2017-06-11
    • 2016-01-01
    • 2016-10-31
    • 2014-07-02
    相关资源
    最近更新 更多