使用数据框评估表达式答案

【问题标题】：Evaluating an Expression using data frames使用数据框评估表达式
【发布时间】：2021-11-22 17:45:12
【问题描述】：

我有一个 df

病人编号 72 一个 SD75 74 74 B 71 C 72

并且我有一个表情

exp = '((71+72)*((73+75)+SD75))*((74+76)+SD76))'

现在，如果三个患者 A、B、C 中的每一个在 df 中都存在匹配项，我需要用 1 和 0 来评估这个表达式。
A 与 ID 72、SD75、74 匹配，所以表达式应该是

A-'((0+1)*((0+0)+1))*((1+0)+0))' B-'((1+0)*((0+0)+0))*((0+0)+0))' C-'((0+1)*((0+0)+0))*((0+0)+0))'
我最终的 df_output 应该是这样的患者最终值 1 0 0 将 ID 替换为 1 和 O 后，可以通过 eval(exp) 获得 FinalVal

到目前为止，这是我到达的地方。当我用 0 替换 ID 75 时，SD75 变成了 SD0，这就是我卡住的地方

import pandas as pd
import re
exp = '((71+72)*((73+75)+SD75))*((74+76)+SD76))'
mylist = re.sub(r'[^\w]', ' ', exp).split()
distinct_pt = df.Patient.drop_duplicates().dropna()
df_output = pd.DataFrame(distinct_pt)
df_output['Exp'] = exp
for index, row in df_output.iterrows():
  new_df = df[df.Patient == row['Patient']]
  new_dfl = new_df['ID'].tolist()
  #print(new_dfl)
  for j in mylist:
    if j in new_dfl:
      #print(j)
      row['Exp'] = row['Exp'].replace(j,'1')
    else:
      row['Exp'] = row['Exp'].replace(j,'1')

【问题讨论】：

它可以..就像 A 低于 72 以及 c 也低于 72 ..

标签： python pandas dataframe

【解决方案1】：

使用 sub 而不是 replace 应该可以：

for j in mylist:
    if j in new_dfl:
        exp = re.sub(r'\b{}'.format(j), '1', exp)
    else:
        exp = re.sub(r'\b{}'.format(j), '0', exp)

另一种适用于这种确切情况的方法是按降序对 mylist 进行排序，这样 SD 前面的项目在其他项目之前被迭代。

mylist = re.sub(r'[^\w]', ' ', exp).split()   
mylist.sort(reverse=True)

【讨论】：

编辑了以前的答案以包含另一个选项。

【解决方案2】：

我们可以尝试使用Series.get_dummies 创建一个指标DataFrame，为ID 列中的每个值创建指标列，然后通过groupby max 将Patient 减少为一行：

# Convert to ID columns to binary indicators
indicator_df = df.set_index('Patient')['ID'].str.get_dummies()
# Reduce to 1 row per Patient
indicator_df = indicator_df.groupby(level=0).max()

indicator_df:

         71  72  74  SD75
Patient                  
A         0   1   1     1
B         1   0   0     0
C         0   1   0     0

现在我们可以reindex 从表达式中创建缺失的列。 np.unique 用于确保表达式中的重复项不会导致indicator_df 中的重复列（如果保证没有重复项，则可以省略）：

exp = '(((71+72)*((73+75)+SD75))*((74+76)+SD76))'
# Extract terms from expression
cols = re.sub(r'[^\w]', ' ', exp).split()
# Convert to ID columns to binary indicators
indicator_df = df.set_index('Patient')['ID'].str.get_dummies()
# Reduce to 1 row per Patient
indicator_df = indicator_df.groupby(level=0).max()
# Ensure All expression terms are present
indicator_df = indicator_df.reindex(
    columns=np.unique(cols),  # prevent duplicate cols
    fill_value=0  # Added cols are filled with 0
)

indicator_df:

         71  72  73  74  75  76  SD75  SD76
Patient                                    
A         0   1   0   1   0   0     1     0
B         1   0   0   0   0   0     0     0
C         0   1   0   0   0   0     0     0

现在，如果我们稍微改变exp，用反引号（`）包围这些新列名，我们可以使用DataFrame.eval 来计算表达式：

exp = '(((71+72)*((73+75)+SD75))*((74+76)+SD76))'
# Extract terms from expression
cols = re.sub(r'[^\w]', ' ', exp).split()
# create indicator_df (chained)
indicator_df = (
    df.set_index('Patient')['ID']
        .str.get_dummies()
        .groupby(level=0).max()
        .reindex(columns=np.unique(cols), fill_value=0)
)
# Eval the expression and create the resulting DataFrame
result = indicator_df.eval(
    # Add Backticks around columns names
    re.sub(r'(\w+)', r'`\1`', exp)
).reset_index(name='FinalVal')

result:

  Patient  FinalVal
0       A         1
1       B         0
2       C         0

反引号是必要的，以表明这些值代表列名，而不是数值：

re.sub(r'(\w+)', r'`\1`', exp)

# (((`71`+`72`)*((`73`+`75`)+`SD75`))*((`74`+`76`)+`SD76`))

注意 71 带反引号和不带反引号的区别：

# Column '71' + the number 71
pd.DataFrame({'71': [1, 2, 3]}).eval('B = `71` + 71')

   71   B
0   1  72
1   2  73
2   3  74

或者，indicator_df 可以使用crosstab 和clip 创建：

exp = '(((71+72)*((73+75)+SD75))*((74+76)+SD76))'
# Extract terms from expression
cols = re.sub(r'[^\w]', ' ', exp).split()
indicator_df = (
    pd.crosstab(df['Patient'], df['ID'])
        .clip(upper=1)  # Restrict upperbound to 1
        .reindex(columns=np.unique(cols), fill_value=0)
)
# Eval the expression and create the resulting DataFrame
result = indicator_df.eval(
    # Add Backticks around columns names
    re.sub(r'(\w+)', r'`\1`', exp)
).reset_index(name='FinalVal')

使用的设置和导入：

import re

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'Patient': ['A', 'A', 'A', 'A', 'B', 'C'],
    'ID': ['72', 'SD75', '74', '74', '71', '72']
})

【讨论】：

我真的很喜欢这种方法，但想知道如果我的患者人数约为 120 万且不同的 ID 约为 6000，这是否仍然有效？
肯定比迭代更节省时间，但不是更节省空间。问题是您是否需要空间来支持 1.2m x 6k 指标 DataFrame。 eval 比转换为 indicator_df ['71'] + indicator_df['72'] 等实际计算要慢。但是，我将其完全依赖于 exp。
这完全有道理！这解决了我正在寻找的 80% 的问题，但我无法动态解析多个表达式以获得我想要的结果。我为该问题创建了另一个问题，如果您能看一下并指出正确的方向，我将非常感激。 stackoverflow.com/questions/69413841/…

【解决方案3】：

我不会尝试解析该表达式并对其进行评估。相反，我会为ID 列创建dummy or indicator variables。（指标变量也称为 one-hot 编码变量。）使用这些指标，您可以使用标准函数计算表达式。

以下是使用 Pandas 和 scikit-learn 的方法。我正在使用 scikit-learn 的 OneHotEncoder。替代方案可能是 Panda 的 get_dummies()，但 OneHotEncoder 允许您指定类别。

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

variables = [71, 72, 73, 74, 75, 76, "SD75", "SD76"]
enc = OneHotEncoder(categories=[variables], sparse=False)
df = pd.DataFrame({
    "Patient": ["A"] * 4 + ["B", "C"],
    "ID": [72, "SD75", 74, 74, 71, 72]
})
# Create one-hot encoded variables, also called dummy or indicator variables
df_one_hot = pd.DataFrame(
    enc.fit_transform(df[["ID"]]),
    columns=variables,
    index=df.Patient
)

# Aggregate dummy or one-hot variables, so there's one for each patient
# You may need to alter the aggretaion function
# I chose max because it matched your example
# but perhaps sum might be better (e.g. patient A has two entires for 74, should that be a value of 2 for variable 74?
one_hot_patient = df_one_hot.groupby(level="Patient").agg(max)

# Finally, evaluate your expression
# Create a function to calcualte the output given a data frame
def my_expr(DF):
    out = (DF[71] + DF[72]) \
        * (DF[73] + DF[75] + DF["SD75"]) \
        * (DF[74]+DF[76]+DF["SD76"])
    return out
    
output = one_hot_patient.assign(FinalVal=my_expr)

结果

          71   72   73   74   75   76  SD75  SD76  FinalVal
Patient
A        0.0  1.0  0.0  1.0  0.0  0.0   1.0   0.0       1.0
B        1.0  0.0  0.0  0.0  0.0  0.0   0.0   0.0       0.0
C        0.0  1.0  0.0  0.0  0.0  0.0   0.0   0.0       0.0

【讨论】：