这是一种 NumPy 方法 -
def group_duplicate_cols(df):
a = df.values
sidx = np.lexsort(a)
b = a[:,sidx]
m = np.concatenate(([False], (b[:,1:] == b[:,:-1]).all(0), [False] ))
idx = np.flatnonzero(m[1:] != m[:-1])
C = df.columns[sidx].tolist()
return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]
示例运行 -
In [100]: df
Out[100]:
A B C D E F
a1 1 2 1 2 3 1
a2 2 4 2 4 4 1
a3 3 2 3 2 2 1
a4 4 1 4 1 1 1
a5 5 9 5 9 2 1
In [101]: group_duplicate_cols(df)
Out[101]: [['A', 'C'], ['B', 'D']]
# Let's add one more duplicate into group containing 'A'
In [102]: df.F = df.A
In [103]: group_duplicate_cols(df)
Out[103]: [['A', 'C', 'F'], ['B', 'D']]
转换做同样的事情,但是对于行(索引),我们只需要沿着另一个轴切换操作,就像这样 -
def group_duplicate_rows(df):
a = df.values
sidx = np.lexsort(a.T)
b = a[sidx]
m = np.concatenate(([False], (b[1:] == b[:-1]).all(1), [False] ))
idx = np.flatnonzero(m[1:] != m[:-1])
C = df.index[sidx].tolist()
return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]
示例运行 -
In [260]: df2
Out[260]:
a1 a2 a3 a4 a5
A 3 5 3 4 5
B 1 1 1 1 1
C 3 5 3 4 5
D 2 9 2 1 9
E 2 2 2 1 2
F 1 1 1 1 1
In [261]: group_duplicate_rows(df2)
Out[261]: [['B', 'F'], ['A', 'C']]
基准测试
方法-
# @John Galt's soln-1
from itertools import combinations
def combinations_app(df):
return[x for x in combinations(df.columns, 2) if (df[x[0]] == df[x[-1]]).all()]
# @Abdou's soln
def pandas_groupby_app(df):
return [tuple(d.index) for _,d in df.T.groupby(list(df.T.columns)) if len(d) > 1]
# @COLDSPEED's soln
def triu_app(df):
c = df.columns.tolist()
i, j = np.triu_indices(len(c), 1)
x = [(c[_i], c[_j]) for _i, _j in zip(i, j) if (df[c[_i]] == df[c[_j]]).all()]
return x
# @cmaher's soln
def lambda_set_app(df):
return list(filter(lambda x: len(x) > 1, list(set([tuple([x for x in df.columns if all(df[x] == df[y])]) for y in df.columns]))))
注意:@John Galt's soln-2 不包括在内,因为输入的大小为 (8000,500) 会与为那个建议的 broadcasting 相冲突。
时间安排 -
In [179]: # Setup inputs with sizes as mentioned in the question
...: df = pd.DataFrame(np.random.randint(0,10,(8000,500)))
...: df.columns = ['C'+str(i) for i in range(df.shape[1])]
...: idx0 = np.random.choice(df.shape[1], df.shape[1]//2,replace=0)
...: idx1 = np.random.choice(df.shape[1], df.shape[1]//2,replace=0)
...: df.iloc[:,idx0] = df.iloc[:,idx1].values
...:
# @John Galt's soln-1
In [180]: %timeit combinations_app(df)
1 loops, best of 3: 24.6 s per loop
# @Abdou's soln
In [181]: %timeit pandas_groupby_app(df)
1 loops, best of 3: 3.81 s per loop
# @COLDSPEED's soln
In [182]: %timeit triu_app(df)
1 loops, best of 3: 25.5 s per loop
# @cmaher's soln
In [183]: %timeit lambda_set_app(df)
1 loops, best of 3: 27.1 s per loop
# Proposed in this post
In [184]: %timeit group_duplicate_cols(df)
10 loops, best of 3: 188 ms per loop
NumPy 的视图功能的超级提升
利用 NumPy 的视图功能,我们可以将每组元素视为一个 dtype,我们可以获得进一步显着的性能提升,就像这样 -
def view1D(a): # a is array
a = np.ascontiguousarray(a)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel()
def group_duplicate_cols_v2(df):
a = df.values
sidx = view1D(a.T).argsort()
b = a[:,sidx]
m = np.concatenate(([False], (b[:,1:] == b[:,:-1]).all(0), [False] ))
idx = np.flatnonzero(m[1:] != m[:-1])
C = df.columns[sidx].tolist()
return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]
时间安排 -
In [322]: %timeit group_duplicate_cols(df)
10 loops, best of 3: 185 ms per loop
In [323]: %timeit group_duplicate_cols_v2(df)
10 loops, best of 3: 69.3 ms per loop
只是疯狂的加速!