我们可以尝试使用Series.get_dummies 创建一个指标DataFrame,为ID 列中的每个值创建指标列,然后通过groupby max 将Patient 减少为一行:
# Convert to ID columns to binary indicators
indicator_df = df.set_index('Patient')['ID'].str.get_dummies()
# Reduce to 1 row per Patient
indicator_df = indicator_df.groupby(level=0).max()
indicator_df:
71 72 74 SD75
Patient
A 0 1 1 1
B 1 0 0 0
C 0 1 0 0
现在我们可以reindex 从表达式中创建缺失的列。 np.unique 用于确保表达式中的重复项不会导致indicator_df 中的重复列(如果保证没有重复项,则可以省略):
exp = '(((71+72)*((73+75)+SD75))*((74+76)+SD76))'
# Extract terms from expression
cols = re.sub(r'[^\w]', ' ', exp).split()
# Convert to ID columns to binary indicators
indicator_df = df.set_index('Patient')['ID'].str.get_dummies()
# Reduce to 1 row per Patient
indicator_df = indicator_df.groupby(level=0).max()
# Ensure All expression terms are present
indicator_df = indicator_df.reindex(
columns=np.unique(cols), # prevent duplicate cols
fill_value=0 # Added cols are filled with 0
)
indicator_df:
71 72 73 74 75 76 SD75 SD76
Patient
A 0 1 0 1 0 0 1 0
B 1 0 0 0 0 0 0 0
C 0 1 0 0 0 0 0 0
现在,如果我们稍微改变exp,用反引号(`)包围这些新列名,我们可以使用DataFrame.eval 来计算表达式:
exp = '(((71+72)*((73+75)+SD75))*((74+76)+SD76))'
# Extract terms from expression
cols = re.sub(r'[^\w]', ' ', exp).split()
# create indicator_df (chained)
indicator_df = (
df.set_index('Patient')['ID']
.str.get_dummies()
.groupby(level=0).max()
.reindex(columns=np.unique(cols), fill_value=0)
)
# Eval the expression and create the resulting DataFrame
result = indicator_df.eval(
# Add Backticks around columns names
re.sub(r'(\w+)', r'`\1`', exp)
).reset_index(name='FinalVal')
result:
Patient FinalVal
0 A 1
1 B 0
2 C 0
反引号是必要的,以表明这些值代表列名,而不是数值:
re.sub(r'(\w+)', r'`\1`', exp)
# (((`71`+`72`)*((`73`+`75`)+`SD75`))*((`74`+`76`)+`SD76`))
注意 71 带反引号和不带反引号的区别:
# Column '71' + the number 71
pd.DataFrame({'71': [1, 2, 3]}).eval('B = `71` + 71')
71 B
0 1 72
1 2 73
2 3 74
或者,indicator_df 可以使用crosstab 和clip 创建:
exp = '(((71+72)*((73+75)+SD75))*((74+76)+SD76))'
# Extract terms from expression
cols = re.sub(r'[^\w]', ' ', exp).split()
indicator_df = (
pd.crosstab(df['Patient'], df['ID'])
.clip(upper=1) # Restrict upperbound to 1
.reindex(columns=np.unique(cols), fill_value=0)
)
# Eval the expression and create the resulting DataFrame
result = indicator_df.eval(
# Add Backticks around columns names
re.sub(r'(\w+)', r'`\1`', exp)
).reset_index(name='FinalVal')
使用的设置和导入:
import re
import numpy as np
import pandas as pd
df = pd.DataFrame({
'Patient': ['A', 'A', 'A', 'A', 'B', 'C'],
'ID': ['72', 'SD75', '74', '74', '71', '72']
})