利用对每一行进行排序然后简单地根据NaNs 选择第一列或第二列来对三列数据进行很少的优化,由于已排序,该列将被推到每一行的末尾。这让我们之后可以使用slicing 进行选择并为每一行获取所需的median_low 值。
这里将它们组装成一个矢量化解决方案 -
a = df.values
a_sorted = np.sort(a,1)
df['median'] = np.where(np.isnan(a_sorted[:,2]), a_sorted[:,0], a_sorted[:,1])
运行时测试
方法-
# Proposed in this post
def vectorized_app(df):
a = df.values
a_sorted = np.sort(a,1)
df['median'] = np.where(np.isnan(a_sorted[:,2]), a_sorted[:,0], a_sorted[:,1])
return df
# @piRSquared's new soln
def vectorized_app2(df):
v = np.sort(df.values, axis=1)
n = np.count_nonzero(~np.isnan(v), axis=1)
j = (n - 1) // 2
i = np.arange(len(v))
return df.assign(median_low=v[i, j])
# @piRSquared's old soln
from statistics import median_low
def apply_app(df):
med = lambda x: median_low(x.dropna())
return df.apply(med, 1)
时间安排 -
In [433]: # Setup input dataframe and set one per row as NaN
...: np.random.seed(0)
...: a = np.random.randint(0,9,(10000,3)).astype(float)
...: idx = np.random.randint(0,3,a.shape[0])
...: a[np.arange(a.shape[0]), idx] = np.nan
...: df = pd.DataFrame(a)
...: df.columns = [['val1','val2','val3']]
...:
In [435]: %timeit vectorized_app(df)
1000 loops, best of 3: 481 µs per loop
In [436]: %timeit vectorized_app2(df)
1000 loops, best of 3: 892 µs per loop
In [434]: %timeit apply_app(df)
1 loop, best of 3: 1.15 s per loop