【问题标题】:Pandas dataframes merge repeats values to get aligned熊猫数据框合并重复值以对齐
【发布时间】:2019-03-01 21:46:10
【问题描述】:

这里是原始数据集源的链接: dataset for capacitydataset for type

或修改版dataset modified1dataset modified2

我有 2 个要合并的数据框:

  first_df=pd.DataFrame([['2001','Abu Dhabi','100-','462'],['2001','Abu Dhabi','100','44'],['2001','Abu Dhabi','200','462'],['2001','Dubai','100-','40'],['2001','Dubai','100','30'],['2001','Dubai','200','51'],['2002','Abu Dhabi','100-','300'],['2002','Abu Dhabi','100','220'],['2002','Abu Dhabi','200','56'],['2002','Dubai','100-','55'],['2002','Dubai','100','67'],['2002','Dubai','200','89']],columns=['Year','Emirate','Capacity','Number'])

  second_df=pd.DataFrame([['2001','Abu Dhabi','Performed','45'],['2001','Abu Dhabi','Not Performed','76'],['2001','Dubai','Performed','90'],['2001','Dubai','Not Performed','50'],['2002','Abu Dhabi','Performed','78'],['2002','Abu Dhabi','Not Performed','45'],['2002','Dubai','Performed','76'],['2002','Dubai','Not Performed','58']],columns=['Year','Emirate','Type','Value'])

所以我为两个数据框设置了 multiIndex:

first=first_df.set_index(['Year','Emirate'])
second=second_df.set_index(['Year','Emirate'])

并合并:

merged=first.merge(second,how='outer',right_index=True,left_index=True)

结果如下:

合并

| Year , Emirate | Capacity | count | friday | count | |:----------------------|:-----------|--------:|:--------------|--------:| | ('2001', 'Abu Dhabi') | 100- | 462 | Performed | 45 | | ('2001', 'Abu Dhabi') | 100- | 462 | Not Performed | 76 | | ('2001', 'Abu Dhabi') | 100 | 44 | Performed | 45 | | ('2001', 'Abu Dhabi') | 100 | 44 | Not Performed | 76 | | ('2001', 'Abu Dhabi') | 200 | 657 | Performed | 45 | | ('2001', 'Abu Dhabi') | 200 | 657 | Not Performed | 76 | | ('2001', 'Dubai') | 100- | 40 | Performed | 90 | | ('2001', 'Dubai') | 100- | 40 | Not Performed | 50 | | ('2001', 'Dubai') | 100 | 30 | Performed | 90 | | ('2001', 'Dubai') | 100 | 30 | Not Performed | 50 | | ('2001', 'Dubai') | 200 | 51 | Performed | 90 | | ('2001', 'Dubai') | 200 | 51 | Not Performed | 50 | | ('2002', 'Abu Dhabi') | 100- | 300 | Performed | 78 | | ('2002', 'Abu Dhabi') | 100- | 300 | Not Performed | 45 | | ('2002', 'Abu Dhabi') | 100 | 220 | Performed | 78 | | ('2002', 'Abu Dhabi') | 100 | 220 | Not Performed | 45 | | ('2002', 'Abu Dhabi') | 200 | 56 | Performed | 78 | | ('2002', 'Abu Dhabi') | 200 | 56 | Not Performed | 45 | | ('2002', 'Dubai') | 100- | 55 | Performed | 76 | | ('2002', 'Dubai') | 100- | 55 | Not Performed | 58 | | ('2002', 'Dubai') | 100 | 67 | Performed | 76 | | ('2002', 'Dubai') | 100 | 67 | Not Performed | 58 | | ('2002', 'Dubai') | 200 | 89 | Performed | 76 | | ('2002', 'Dubai') | 200 | 89 | Not Performed | 58 |

并尝试与以下结果连接:

joined=pd.concat([first,second])

已加入

| Year , Emirate | Capacity | Number | Type | Value | |:----------------------|:-----------|---------:|:--------------|--------:| | ('2001', 'Abu Dhabi') | 100- | 462 | nan | nan | | ('2001', 'Abu Dhabi') | 100 | 44 | nan | nan | | ('2001', 'Abu Dhabi') | 200 | 657 | nan | nan | | ('2001', 'Dubai') | 100- | 40 | nan | nan | | ('2001', 'Dubai') | 100 | 30 | nan | nan | | ('2001', 'Dubai') | 200 | 51 | nan | nan | | ('2002', 'Abu Dhabi') | 100- | 300 | nan | nan | | ('2002', 'Abu Dhabi') | 100 | 220 | nan | nan | | ('2002', 'Abu Dhabi') | 200 | 56 | nan | nan | | ('2002', 'Dubai') | 100- | 55 | nan | nan | | ('2002', 'Dubai') | 100 | 67 | nan | nan | | ('2002', 'Dubai') | 200 | 89 | nan | nan | | ('2001', 'Abu Dhabi') | nan | nan | Performed | 45 | | ('2001', 'Abu Dhabi') | nan | nan | Not Performed | 76 | | ('2001', 'Dubai') | nan | nan | Performed | 90 | | ('2001', 'Dubai') | nan | nan | Not Performed | 50 | | ('2002', 'Abu Dhabi') | nan | nan | Performed | 78 | | ('2002', 'Abu Dhabi') | nan | nan | Not Performed | 45 | | ('2002', 'Dubai') | nan | nan | Performed | 76 | | ('2002', 'Dubai') | nan | nan | Not Performed | 58 |

所以连接在一起的两个数据框不应该有重复(如第一次合并)或向下移动(如 concat 变体)。 有什么解决方案可以使 2 个数据框很好地对齐?

以下是所需输出的样子:

| | Year | Emirate | Capacity | Number | Type | Value | |---:|-------:|:----------|:-----------|---------:|:--------------|--------:| | 0 | | | 100- | 462 | Performed | 45 | | 1 | | Abu Dhabi | 100 | 44 | Not Performed | 76 | | 2 | | | 200 | 657 | NaN | nan | | 3 | 2001 | | 100- | 40 | Performed | 90 | | 4 | | Dubai | 100 | 30 | Not Performed | 50 | | 5 | | | 200 | 51 | NaN | nan | | 6 | | | 100- | 300 | Performed | 78 | | 7 | | Abu Dhabi | 100 | 220 | Not Performed | 45 | | 8 | 2002 | | 200 | 56 | NaN | nan | | 9 | | | 100- | 55 | Performed | 76 | | 10 | | Dubai | 100 | 67 | Not Performed | 58 | | 11 | | | 200 | 89 | NaN | nan |

enter code here

【问题讨论】:

  • 您的预期输出是什么?您认为合并数据框中的哪些行是重复的?
  • 我做了 merge1=first.merge(second,how='inner',right_index=True,left_index=True).drop_duplicates() 并且行数相同。正如已经评论的那样,请对问题进行所有“重复”
  • @Erfan 是的,我已经添加了预期输出的方式
  • @Ravi 是的,之前确实尝试过 drop_duplicates() 但没有得到很好的对齐

标签: python sql database pandas dataframe


【解决方案1】:

我在这里看到了问题,当您加入 ['year','Emirate'] 时,您的数据会导致交叉连接。例如,2001 Abu Dhabi 与 2001 Abu Dhabi 在“执行和未执行”的两个数据框中加入。基本上这是 m x n 关系连接数据集。除非您指定一个可以唯一标识每一行的主键,否则您最终会得到相同的结果。

【讨论】:

  • 是的,这看起来是个好主意。在sql中使用它。你能指定主键如何与熊猫一起使用吗?从来没有遇到过,谷歌搜索后没有找到很多......
  • pandas 中的索引像主键一样工作,在这里您使用了类似于复合键(如在 sql 中)的 2 列,但是,即使您的复合键也无法识别唯一行。例如在你的 df 中,如果你搜索“2001 Abu Dhabi”,你会得到 3 行而不是唯一的
  • 嗯,事情是所有的行都是独一无二的 - 没有绝对相同的行 - 至少一行中的值与另一行不同。这就是为什么某些值会被重复以最终带来独特的匹配。
  • 您有什么建议特别是在 pandas 库中处理它的解决方案吗?
【解决方案2】:

我假设您的数据还不正确,因为您的预期输出是可能的,但现在不符合您的逻辑。

您在second_df 中缺少第三个key column,即capacity。如果我们添加此列并执行left merge,我们可以实现您的预​​期输出。

顺便说一句,我们不需要将列设置为索引,所以解决方案如下所示。

# Clean up and create correct dataframes
first_df=pd.DataFrame([['2001','Abu Dhabi','100-','462'],
                       ['2001','Abu Dhabi','100','44'],
                       ['2001','Abu Dhabi','200','657'],
                       ['2001','Dubai','100-','40'],
                       ['2001','Dubai','100','30'],
                       ['2001','Dubai','200','51'],
                       ['2002','Abu Dhabi','100-','300'],
                       ['2002','Abu Dhabi','100','220'],
                       ['2002','Abu Dhabi','200','56'],
                       ['2002','Dubai','100-','55'],
                       ['2002','Dubai','100','67'],
                       ['2002','Dubai','200','89']],columns=['Year','Emirate','Capacity','Number'])
second_df=pd.DataFrame([['2001','Abu Dhabi','100-','Performed','45'],
                        ['2001','Abu Dhabi','100','Not Performed','76'],
                        ['2001','Abu Dhabi','','',''],
                        ['2001','Dubai','100-','Performed','90'],
                        ['2001','Dubai','100','Not Performed','50'],
                        ['2001','Dubai','','',''],
                        ['2002','Abu Dhabi','100-','Performed','78'],
                        ['2002','Abu Dhabi','100','Not Performed','45'],
                        ['2002','Abu Dhabi','', '', ''],
                        ['2002','Dubai','100-','Performed','76'],
                        ['2002','Dubai','100','Not Performed','58'],
                        ['2002','Dubai', '', '', '']],columns=['Year','Emirate','Capacity','Type','Value'])

# Perform a left merge to get correct output
merged=first_df.merge(second_df,how='left',on=['Year', 'Emirate', 'Capacity'])

输出

    Year    Emirate     Capacity    Number  Type            Value
0   2001    Abu Dhabi   100-        462     Performed       45
1   2001    Abu Dhabi   100         44      Not Performed   76
2   2001    Abu Dhabi   200         657     NaN             NaN
3   2001    Dubai       100-        40      Performed       90
4   2001    Dubai       100         30      Not Performed   50
5   2001    Dubai       200         51      NaN             NaN
6   2002    Abu Dhabi   100-        300     Performed       78
7   2002    Abu Dhabi   100         220     Not Performed   45
8   2002    Abu Dhabi   200         56      NaN             NaN
9   2002    Dubai       100-        55      Performed       76
10  2002    Dubai       100         67      Not Performed   58
11  2002    Dubai       200         89      NaN             NaN

【讨论】:

  • 这是一个有趣的解决方案。输出看起来不错。我担心的是原始数据的容量类型更广泛-我的意思是,在第一个数据集中有 8 个容量:例如:100-,100,200,400,600,800,1000,1000+ 并且对于这两个数据集都有 7 个州:迪拜、阿布扎比、沙迦等......
  • 我会尝试添加文件的链接
猜你喜欢
  • 2018-11-04
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2020-03-12
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2013-09-26
相关资源
最近更新 更多