python分类，无需估算缺失值答案

【问题标题】：python classification without having to impute missing valuespython分类，无需估算缺失值
【发布时间】：2016-02-27 15:54:46
【问题描述】：

我有一个在 weka 中运行良好的数据集。它有很多用“？”表示的缺失值。使用决策树，我能够处理缺失值。

但是，在 sci-kit learn 中，我发现估计器不能用于缺失值的数据。有没有我可以使用的替代库来支持这个？

否则，有没有办法在 sci-kit learn 中解决这个问题？

【问题讨论】：

我不想将您的问题标记为与stackoverflow.com/questions/9365982/… 重复的问题？但是希望它已经回答了您的问题
@AnthonyKong 是的，我看到了那个帖子。但他们似乎都建议将插补作为解决方案，这是我想要避免的
根据文档，似乎没有别的办法scikit-learn.org/stable/modules/…
R 中的一些包支持这个。

标签： python machine-learning scikit-learn

【解决方案1】：

py-earth 包支持缺失数据。它仍在开发中，还没有在 pypi 上，但它现在非常可用且经过良好测试，并且与 scikit-learn 交互良好。如this paper 中所述处理缺失。它不假设随机缺失，事实上缺失被视为潜在的预测。重要的假设是，您的训练数据中的缺失分布必须与您在操作中使用该模型的任何数据中的相同。

py-earth 提供的Earth 类是一个回归器。要创建分类器，您需要将它与其他一些 scikit-learn 分类器放在管道中（我通常为此使用 LogisticRegression）。这是一个例子：

from pyearth import Earth
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.pipeline import Pipeline

# X and y are some training data (numpy arrays, pandas DataFrames, or
# similar) and X may have some values that are missing (nan, None, or 
# some other standard signifier of missingness)
from your_data import X, y

# Create an Earth based classifer that accepts missing data
earth_classifier = Pipeline([('earth', Earth(allow_missing=True)),
                             ('logistic', LogisticRegression())])

# Fit on the training data
earth_classifier.fit(X, y)

Earth 模型以一种很好的方式处理缺失，LogisticRegression 只看到来自Earth.transform 的转换数据。

免责声明：我是 py-earth 的作者。

【讨论】：