sagemaker 中实时预测中的特征提取答案

【问题标题】：Features extraction in Real-time prediction in sagemakersagemaker 中实时预测中的特征提取
【发布时间】：2021-05-06 16:37:27
【问题描述】：

我想使用 sagemaker 部署用于欺诈检测的实时预测机器学习模型。

我使用 sagemaker jupyter 实例来：

-load my training data from s3 contains transactions
-preprocessing data and features engineering (i use category_encoders to encode the categorical value)
-training the model and configure the endpoint

对于推理步骤，我使用了一个 lambda 函数，该函数调用我的端点来获取每个实时交易的预测。

should i calculte again all the features for this real time transactions in lambda function ?

for the features when i use category_encoders with fit_transform() function to transform my categorical feature to numerical one, what should I do because the result will not be the same as training set?

is there another method not to redo the calculation of the features in the inference step?

【问题讨论】：

您对此有更多了解吗？我正在尝试做同样的事情。我读过您可以构建一个“推理管道”，它可以在同一个端点中包含预处理（特征工程）、推理和后处理。这个推理管道也可以从 lambda 中命中。

标签： amazon-web-services machine-learning lambda amazon-sagemaker fraud-prevention

【解决方案1】：

我应该在 lambda 函数中再次计算此实时事务的所有特征吗？

是的，当推断经过训练的模型（或根据实时数据进行预测）时，您应该传递与用于训练模型的完全相同的特征列表。如果您在训练时计算某些特征（例如 part of the day 来自 timestamp），您还应该在推理时计算这些特征。

当我使用 category_encoders 和 fit_transform() 函数将我的分类特征转换为数字特征时，我应该怎么做，因为结果与训练集不同？

您应该存储用于训练模型的所有转换：数字 scalers、分类 encoders 等。

对于 python，它看起来像这样：

import joblib # for dump fitted transformers
import category_encoders as ce

# 1. while training model
# fit encoder on historical data
encoder = ce.OneHotEncoder(cols=[...])
encoder.fit(X, y)
# and dump it
joblib.dump(encoder, 'filename.joblib') 

# 2. while inference a trained model
# load fitted encoder
encoder = joblib.load('filename.joblib')
# and apply transformation to new data
encoder.transform(X_new)

【讨论】：