【问题标题】:Why am I getting NaN when I join two DataFrame that have no NaN in them (multilevel index)?当我加入两个没有 NaN 的 DataFrame(多级索引)时,为什么会得到 NaN?
【发布时间】:2016-03-31 09:27:22
【问题描述】:

我有两个具有多级索引 r1 和 r2 的数据帧,这样

a1=['iso3_o', 'iso3_d', 'year', 'ExportFoodAndLiveAnimals']
a=np.array([['CAN', 'USA', '1995.0', '5918210.506'],
       ['CAN', 'USA', '1996.0', '6988508.727'],
       ['CAN', 'USA', '1997.0', '7792977.258'],
       ['CAN', 'USA', '1998.0', '8177456.631'],
       ['CAN', 'USA', '1999.0', '8773990.755'],
       ['CAN', 'USA', '2000.0', '9650783.071'],
       ['CAN', 'USA', '2001.0', '10800432.88'],
       ['CAN', 'USA', '2002.0', '11348837.38'],
       ['CAN', 'USA', '2003.0', '11313334.46'],
       ['CAN', 'USA', '2004.0', '12337588.35'],
       ['CAN', 'USA', '2005.0', '13227226.96'],
       ['CAN', 'USA', '2006.0', '14236699.34'],
       ['CAN', 'USA', '2007.0', '15638919.3'],
       ['CAN', 'USA', '2008.0', '17449901.08'],
       ['CAN', 'USA', '2009.0', '14813089.89'],
       ['CAN', 'USA', '2010.0', '16399733.82']])
r1 = pd.DataFrame(a, columns=a1)
r1

而r2被定义为

a1=['iso3_o', 'iso3_d', 'year', 'contig']
a=np.array([['CAN', 'USA', 1995, 1],
       ['CAN', 'USA', 1996, 1],
       ['CAN', 'USA', 1997, 1],
       ['CAN', 'USA', 1998, 1],
       ['CAN', 'USA', 1999, 1],
       ['CAN', 'USA', 2000, 1],
       ['CAN', 'USA', 2001, 1],
       ['CAN', 'USA', 2002, 1],
       ['CAN', 'USA', 2003, 1],
       ['CAN', 'USA', 2004, 1],
       ['CAN', 'USA', 2005, 1],
       ['CAN', 'USA', 2006, 1],
       ['CAN', 'USA', 2007, 1],
       ['CAN', 'USA', 2008, 1],
       ['CAN', 'USA', 2009, 1],
       ['CAN', 'USA', 2010, 1]])
r2 = pd.DataFrame(a, columns=a1)
r2

然后我决定加入他们的多索引级别

因此,我所做的就是将列重置为索引

 multi_r2 = r2.set_index(['iso3_o', 'iso3_d','year'])
    multi_r1 = r1.set_index(['iso3_o', 'iso3_d','year'])
    df = multi_r2.join(multi_r1)

当我加入 'iso3_o'、'iso3_d'、'year' 时,DataFrame df 给了我一个 NAN

为什么会这样?

提前谢谢你

【问题讨论】:

标签: python join pandas merge dataframe


【解决方案1】:

r1r2 中的 year 列都是 str,但不一样,将其更改为 int 即可

r1['year'] = [int(float(i)) for i in r1['year']]
r2['year'] = [int(i) for i in r2['year']]
multi_r1 = r1.set_index(['iso3_o', 'iso3_d','year'])
multi_r2 = r2.set_index(['iso3_o', 'iso3_d','year'])
df = multi_r2.join(multi_r1)

【讨论】:

    【解决方案2】:

    我遇到的问题看起来很简单,但我想我想和你分享一下。基本上就像 EdChum 指出的那样,我必须更改年份的数据类型,所以我已经完成了一系列步骤。也许存在一种更简单的方法,但如果你愿意分享,我不知道。

    提取值并将它们保存在一个 numpy 数组中

    import scipy
    a=r1.values
    C = scipy.delete(a, 2, 1)
    

    为年份变量创建一个数字并将其与新数组连接

    n=np.array(range(1995,2011)).reshape(1,16)
    C1=np.concatenate((C, n.T), axis=1)
    C1
    

    提取前一个数组 r1 的列并重新采样该数组,使年份位于最后

    cols=list(r1)
    cols
    cols.insert(len(cols)-1, cols.pop(cols.index('year')))
    cols
    

    将 DataFrame r1 重新创建为

    r1=pd.DataFrame(C1,columns= cols)
    r1
    

    然后做我之前做的同样的步骤

    multi_r2 = r2.set_index(['iso3_o', 'iso3_d','year'])
    multi_r1 = r1.set_index(['iso3_o', 'iso3_d','year'])
    df = multi_r2.join(multi_r1)
    

    这对我来说很好用

    【讨论】:

    • 您实际上可以在一行中执行此操作 - 只需将其转换为浮点数,然后转换为 int。 r1["year"] = r1.year.astype(float).astype(int)
    猜你喜欢
    • 2016-09-19
    • 2013-07-01
    • 2019-11-10
    • 1970-01-01
    • 2016-12-19
    • 2020-03-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多