Python在熊猫数据框中进行字典映射的最有效方法答案

【问题标题】：Python most efficient way to dictionary mapping in pandas dataframePython在熊猫数据框中进行字典映射的最有效方法
【发布时间】：2021-03-20 09:38:53
【问题描述】：

我有一个字典字典，每个字典都包含我数据框每一列的映射。

我的目标是找到最有效的方法来为我的 1 行 300 列的数据框执行映射。

我的数据框是从range(mapping_size)中随机抽样的；我的字典将值从range(mapping_size) 映射到random.randint(mapping_size+1,mapping_size*2)。

我可以从the answer provided by jpp 看到map 可能是最有效的方法，但我正在寻找比map 更快的方法。你能想到任何吗？如果输入的数据结构是其他东西而不是 pandas 数据框，我很高兴。

这是使用map 和replace 设置问题和结果的代码：

# import packages
import random
import pandas as pd
import numpy as np
import timeit

# specify paramters
ncol = 300 # number of columns
nrow =  1 #number of rows
mapping_size = 10 # length of each dictionary

# create a dictionary of dictionaries for mapping
mapping_dict = {}

random.seed(123)

for idx1 in range(ncol):
    # create empty dictionary
    mapping_dict['col_' + str(idx1)] = {}
    for inx2 in range(mapping_size):
        # create dictionary of length mapping_size and maps value from range(mapping_size) to  random.randint(mapping_size +1 ,mapping_size*2)
        mapping_dict['col_' + str(idx1)][inx2+1] = random.randint(mapping_size+1,mapping_size*2)
        
# Create a dataframe with values sampled from range(mapping_size)
d={}

random.seed(123)

for idx1 in range(ncol):
    d['col_' + str(idx1)] = np.random.choice(range(mapping_size),nrow)
    
df = pd.DataFrame(data=d)

使用map 和replace 的结果：

%%timeit -n 20
df.replace(mapping_dict) #296 ms

%%timeit -n 20
for key in mapping_dict.keys():
    df[key] = df[key].map(mapping_dict[key]).fillna(df[key]) #221ms

%%timeit -n 20
for key in mapping_dict.keys():
    df[key] = df[key].map(mapping_dict[key]) #181ms

【问题讨论】：

标签： python pandas dictionary mapping

【解决方案1】：

只使用 pandas 而不使用 python for 迭代。

# runtime  ~ 1s (1000rows)

# creat a map_serials with multi_index
df_dict = pd.DataFrame(mapping_dict)
obj_dict = df_dict.T.stack()

# obj_dict

    # col_0    1     10
    #          2     14
    #          3     11
    # Length: 3000, dtype: int64

# convert df to map_serials's index, df can have more then 1 row
obj_idx = pd.Series(df.values.flatten())
obj_idx.index = pd.Index(df.columns.to_list() * df.shape[0])
idx = obj_idx.to_frame().reset_index().set_index(['index', 0]).index
result = obj_dict[idx]

# handle null values
cond = result.isnull()
result[cond] = pd.Series(result[cond].index.values).str[1].values

# transform to reslut DataFrame
df_result = pd.DataFrame(result.values.reshape(df.shape))
df_result.columns = df.columns

df_result

【讨论】：

感谢您回来！我收到了来自result = obj_dict[idx] 的错误说Passing list-likes to .loc or [] with any missing labels is no longer supported.
我也尝试过使用 map 运行 1000 行，它花了 287 毫秒，快了 1 秒