Data Leakage in Machine Learning 机器学习训练中的数据泄漏

refer to: https://www.kaggle.com/dansbecker/data-leakage

There are two main types of leakage: Leaky Predictors and a Leaky Validation Strategies.

This occurs when your predictors include data that will not be available at the time you make predictions.

模型中用了预测前不可用的feature/data，这会导致在validation中accuracy很高，而在实际环境中部署后，accuracy很低，因为得不到这样的数据。

如，预测肺炎，如果使用“服用抗生素”作为feature，就是这种情况，因为一般是得了肺炎自然会服用抗生素，在预测肺炎这格模型中，不应该使用“服用抗生素”这个feature。

Leaky Validation Strategies

在模型处理过程中，让Validation Data影响到了模型的参数。

For example, this happens if you run preprocessing (like fitting the Imputer for missing values) before calling train_test_split.

例如，当你在调用train_test_split之前，对数据进行了预处理(如Imputer)，而预处理所用数据包含了spit之后的validation data。