【问题标题】:Concatenate two dataframes by column按列连接两个数据框
【发布时间】:2019-01-05 12:30:51
【问题描述】:

我有 2 个数据框。第一个数据框包含年数,计数为 0:

    year  count
0    1890      0
1    1891      0
2    1892      0
3    1893      0
4    1894      0
5    1895      0
6    1896      0
7    1897      0
8    1898      0
9    1899      0
10   1900      0
11   1901      0
12   1902      0
13   1903      0
14   1904      0
15   1905      0
16   1906      0
17   1907      0
18   1908      0
19   1909      0
20   1910      0
21   1911      0
22   1912      0
23   1913      0
24   1914      0
25   1915      0
26   1916      0
27   1917      0
28   1918      0
29   1919      0
..    ...    ...
90   1980      0
91   1981      0
92   1982      0
93   1983      0
94   1984      0
95   1985      0
96   1986      0
97   1987      0
98   1988      0
99   1989      0
100  1990      0
101  1991      0
102  1992      0
103  1993      0
104  1994      0
105  1995      0
106  1996      0
107  1997      0
108  1998      0
109  1999      0
110  2000      0
111  2001      0
112  2002      0
113  2003      0
114  2004      0
115  2005      0
116  2006      0
117  2007      0
118  2008      0
119  2009      0

[120 rows x 2 columns]

第二个数据框有类似的列,但填充的年份和填充计数较少:

  year  count
0   1970      1
1   1957      7
2   1947     19
3   1987     12
4   1979      7
5   1940      1
6   1950     19
7   1972      4
8   1954     15
9   1976     15
10  2006      3
11  1963     16
12  1980      6
13  1956     13
14  1967      5
15  1893      1
16  1985      5
17  1964      6
18  1949     11
19  1945     15
20  1948     16
21  1959     16
22  1958     12
23  1929      1
24  1965     12
25  1969     15
26  1946     12
27  1961      1
28  1988      1
29  1918      1
30  1999      3
31  1986      3
32  1981      2
33  1960      2
34  1974      4
35  1953      9
36  1968     11
37  1916      2
38  1955      5
39  1978      1
40  2003      1
41  1982      4
42  1984      3
43  1966      4
44  1983      3
45  1962      3
46  1952      4
47  1992      2
48  1973      4
49  1993     10
50  1975      2
51  1900      1
52  1991      1
53  1907      1
54  1977      4
55  1908      1
56  1998      2
57  1997      3
58  1895      1

我想创建第三个数据框 df3。对于每一行,如果 df1 和 df2 中的年份相等,则 df3["count"] = df2["count"] 否则 df3["count"] = df1["count"]。 我尝试使用 join 来执行此操作:

df_new = df2.join(df1, on='year', how='left')
df_new['count'] = df_new['count'].fillna(0)
print(df_new)

但出现错误:

ValueError: columns overlap but no suffix specified: Index(['year'], dtype='object')

我找到了这个错误的解决方案(Pandas join issue: columns overlap but no suffix specified)但是在我运行带有这些更改的代码之后:

df_new = df2.join(df1, on='year', how='left', lsuffix='_left', rsuffix='_right')
df_new['count'] = df_new['count'].fillna(0)
print(df_new)

但输出不是我想要的:

    count  year
0     NaN  1890
1     NaN  1891
2     NaN  1892
3     NaN  1893
4     NaN  1894
5     NaN  1895
6     NaN  1896
7     NaN  1897
8     NaN  1898
9     NaN  1899
10    NaN  1900
11    NaN  1901
12    NaN  1902
13    NaN  1903
14    NaN  1904
15    NaN  1905
16    NaN  1906
17    NaN  1907
18    NaN  1908
19    NaN  1909
20    NaN  1910
21    NaN  1911
22    NaN  1912
23    NaN  1913
24    NaN  1914
25    NaN  1915
26    NaN  1916
27    NaN  1917
28    NaN  1918
29    NaN  1919
..    ...   ...
29    1.0  1918
30    3.0  1999
31    3.0  1986
32    2.0  1981
33    2.0  1960
34    4.0  1974
35    9.0  1953
36   11.0  1968
37    2.0  1916
38    5.0  1955
39    1.0  1978
40    1.0  2003
41    4.0  1982
42    3.0  1984
43    4.0  1966
44    3.0  1983
45    3.0  1962
46    4.0  1952
47    2.0  1992
48    4.0  1973
49   10.0  1993
50    2.0  1975
51    1.0  1900
52    1.0  1991
53    1.0  1907
54    4.0  1977
55    1.0  1908
56    2.0  1998
57    3.0  1997
58    1.0  1895

[179 rows x 2 columns]

期望的输出是:

     year  count
0    1890      0
1    1891      0
2    1892      0
3    1893      1
4    1894      0
5    1895      1
6    1896      0
7    1897      0
8    1898      0
9    1899      0
10   1900      1
11   1901      0
12   1902      0
13   1903      0
14   1904      0
15   1905      0
16   1906      0
17   1907      1
18   1908      1
19   1909      0
20   1910      0
21   1911      0
22   1912      0
23   1913      0
24   1914      0
25   1915      0
26   1916      2
27   1917      0
28   1918      1
29   1919      0
..    ...    ...
90   1980      6
91   1981      2
92   1982      4
93   1983      3
94   1984      3
95   1985      5
96   1986      3
97   1987     12
98   1988      1
99   1989      0
100  1990      0
101  1991      1
102  1992      2
103  1993     10
104  1994      0
105  1995      0
106  1996      0
107  1997      3
108  1998      2
109  1999      3
110  2000      0
111  2001      0
112  2002      0
113  2003      1
114  2004      0
115  2005      0
116  2006      3
117  2007      0
118  2008      0
119  2009      0

[120 rows x 2 columns]

【问题讨论】:

    标签: python-3.x pandas dataframe join


    【解决方案1】:

    问题是因为您应该将year 作为索引。另外,如果你不想丢失数据,你应该加入outer而不是left

    这是我的代码:

    df = pd.DataFrame({
        "year" : np.random.randint(1850, 2000, size=(100,)),
        "qty" : np.random.randint(0, 10, size=(100,)),
    })
    
    df2 = pd.DataFrame({
        "year" : np.random.randint(1850, 2000, size=(100,)),
        "qty" : np.random.randint(0, 10, size=(100,)),
    })
    
    df = df.set_index("year")
    df2 = df2.set_index("year")
    
    df3 = df.join(df2["qty"], how = "outer", lsuffix='_left', rsuffix='_right')
    df3 = df3.fillna(0)
    

    在此步骤中,您有 2 列的值来自 df1 或 df2。在你的合并规则中,我没有得到你想要的。你说:

    • 如果 df1["qty"] == df2["qty"] => df3["qty"] = df2["qty"]
    • 如果 df1["qty"] != df2["qty"] => df3["qty"] = df1["qty"]

    这意味着您每次都想要df1["qty"],因为df1["qty"] == df2["qty"]。我说的对吗?

    以防万一。如果您想调整代码,可以使用apply,如下所示:

    def foo(x1, x2):
        if x1 == x2:
            return x2
        else:
            return x1
    
    df3["count"] = df3.apply(lambda row: foo(row["qty_left"], row["qty_left"]), axis=1)
    df3.drop(["qty_left","qty_right"], axis = 1, inplace = True)
    

    希望对你有帮助

    尼古拉斯

    【讨论】:

    • 谢谢,这正是我想要的。我只是删除索引,转换为 int 列“计数”并重命名列标题
    • 不客气 :) 感谢您接受答案顺便说一句
    猜你喜欢
    • 2022-08-14
    • 1970-01-01
    • 2023-04-02
    • 1970-01-01
    • 2019-11-22
    • 1970-01-01
    • 1970-01-01
    • 2015-12-28
    • 1970-01-01
    相关资源
    最近更新 更多