【问题标题】:Pandas: Transplant column values from one dataframe based on matching condition of another (and do it in vectorized form)Pandas: Transplant column values from one dataframe based on matching condition of another (and do it in vectorized form)
【发布时间】:2022-12-28 03:06:23
【问题描述】:

I have two large dataframes where df1 has more rows than df2 due to df1 operating in a finer time resolution of the logistics in question. I want to match two value columns of df2 to df1, and created a time reference column using the df.dt.floor() function so that a df1.time_ref == df2.time surjective mapping can be applied. Imagine something like this:

df1:                    df2:
    time    time_ref        time    sale    nbr
0    10.10    01.10        01.10    27344    4
1    17.10    01.10        01.11    31160    5
2    24.10    01.10        01.12    19482    3
3    31.10    01.10
4    07.11    01.11
5    14.11    01.11
6    21.11    01.11
7    28.11    01.11
8    05.12    01.12

The goal is to display the fraction of sale/nbr of a month to every week of the month for reference. It should therefore end up like this:

df1:
    time    time_ref    monthlyObjAvg
0    10.10    01.10        6836
1    17.10    01.10        6836
2    24.10    01.10        6836
3    31.10    01.10        6836
4    07.11    01.11        6232
5    14.11    01.11        6232
6    21.11    01.11        6232
7    28.11    01.11        6232
8    05.12    01.12        6494

Though I have not thought it through, in SQL this would probably be really easy. Using some near-pseudo SQL, the operation would likely be something of this nature:

SELECT df1.*FROM df1, df2
JOIN df2.sale/df2.nbr AS "monthlyObjAvg" WHERE df1.time_ref = df2.time

In Pandas I had a much harder time to solve and even research this problem, since all search engine results only lead to either .map() functions, or conditional column selection problems. Note that classical condition selection of the like df1[df1["time_ref"] == df2["time"]]["sale"] can not be applied, because comparisons between two dataframes are illegal in Pandas. My instinct was also that Pandas probably had some detection feature that noticed the existence of surjective unambiguous mapping and then rationalized such an expression, but that turned out to be false.

Note that I had already solved this problem using loops before this. Looks like this:

advIdx = 0
for n in range(df1.shape[0]):
    for m in range(advIdx, df2.shape[0]):
        if df1['time_ref'][n] == df2['time'][m]:
            df1.loc[n, 'monthlyObjAvg'] = df2.loc[m, 'sale'] / df2.loc[m, 'nbr']
            advIdx = m
            break

Employing a forward moving index (since old times are never relevant again), one can even reduce the complexity from n*m to roughly n+m. Yet even with such a dramatic improvement, applying the loop solution to datasets of 10,000-1,000,000+ million rows still takes a couple seconds to even minutes to run through, which means it still yet cries for a proper vectorized solution.

【问题讨论】:

    标签: python pandas dataframe


    【解决方案1】:

    While it took a while, I eventually figured out what the Pandas function that managed to simulate the conditional argument. Although it can't do it very straightforwardly, pandas.merge pretty much implements the SQL statement above. You have to rename the columns that the "join"/merge will be applied to, which is df2.time here:

    df2.rename(columns = {'time':'time_ref'}, inplace=True)
    

    Then apply the leftside join:

    df1 = pd.merge(df1, df2, how='left', on=['time_ref'])
    

    And finally create the target column and drop the rest:

    df1['monthlyObjAvg'] = df1['sale']/df2['nbr']
    df1.drop(["time_ref", "sale", "nbr"], axis=1, inplace=True)
    

    This properly produces the lightning fast solution I searched for (runs in millisecond range on a 50,000+ sample), but it still seems kind of inelegant. I wanted to leave this here as a buoy for future me-like people wondering about this and using these search terms to find nothing much, since itisa proper solution after all.

    If anyone can donate a more elegant and intuitive way to do this (e.g. directly projecting the fraction into the place that fulfills the conditions, instead of copying all, then calculating, then deleting the copies), then feel welcome to do it.

    【讨论】:

      猜你喜欢
      • 2022-12-28
      • 2022-12-02
      • 2022-12-02
      • 2022-12-02
      • 2022-12-01
      • 2022-11-09
      • 2022-12-02
      • 2022-12-01
      • 2022-12-02
      相关资源
      最近更新 更多