您可以通过“文档#”合并两个数据框,然后使用axis=1 跨常见类别执行特定计算。这是一种可扩展的方法:
# Merge dataframes
df = df1.merge(df2, on='Document #')
# Get common category names across datasets
common_cat_cols = df1.columns[df1.columns.str.startswith('Category') & (df1.columns.isin(df2.columns))]
# Get common category names across datasets with suffixes for posterior removal
common_cat_cols2remove = [cat + suffix for cat in common_cat_cols for suffix in ['_x','_y']]
# Calculate max value per common category
common_cat_max = {col: df[[f'{col}_x', f'{col}_y']].max(axis=1) for col in common_cat_cols}
# Insert the target-columns in the merged dataframe
df = df.assign(**common_cat_max).drop(columns=common_cat_cols2remove)
# Now, we just re-order the columns to get the expected output
new_col_order = sum([['Document #'], ['Document Text_' + s for s in ['x','y']] + sorted(df.columns[df.columns.str.startswith('Category')])], [])
df = df[new_col_order]
print(df)
输出:
Document # Document Text_x Document Text_y Category A Category B Category C Category D Category E Category F Category G
0 1 a a 1 0 1 0 1 0 1
1 2 as as 1 0 1 0 1 0 1
2 3 asd asd 1 1 1 0 1 0 1
3 4 asdf asdf 1 1 1 0 0 1 1
4 5 asdfa asdfa 1 0 1 1 1 1 1
5 6 asdfas asdfas 1 0 1 1 1 0 0
6 7 asdfasd asdfasd 0 1 1 1 1 0 0
7 8 asdfasdf asdfasdf 1 1 1 1 1 0 0
如您所见,我保持“文档文本”不变,因为我不知道您想对它们做什么。最好的!
数据:
df1 = pd.DataFrame({
'Document #': range(1,9),
'Document Text': ['a','as','asd','asdf','asdfa','asdfas','asdfasd','asdfasdf'],
'Category A': [1,0,0,0,0,0,0,1],
'Category B': [0,0,1,1,0,0,1,1],
'Category C': [1,0,0,0,0,0,0,0],
'Category D': [0,0,0,0,1,1,1,1],
'Category E': [1,1,1,0,1,1,1,1]
})
df2 = pd.DataFrame({
'Document #': range(1,9),
'Document Text': ['a','as','asd','asdf','asdfa','asdfas','asdfasd','asdfasdf'],
'Category A': [1,1,1,1,1,1,0,0],
'Category C': [0,1,1,1,1,1,1,1],
'Category F': [0,0,0,1,1,0,0,0],
'Category G': [1,1,1,1,1,0,0,0]
})