Cloud ML Engine 和 Scikit-Learn：“LatentDirichletAllocation”对象没有“predict”属性答案

【问题标题】：Cloud ML Engine and Scikit-Learn: 'LatentDirichletAllocation' object has no attribute 'predict'Cloud ML Engine 和 Scikit-Learn：“LatentDirichletAllocation”对象没有“predict”属性
【发布时间】：2018-12-31 00:26:48
【问题描述】：

我正在实施简单的 Scikit-Learn Pipeline 以在 Google Cloud ML Engine 中执行 LatentDirichletAllocation。目标是从新数据中预测主题。下面是生成管道的代码：

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups

dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
train, test = train_test_split(dataset.data[:2000])

pipeline = Pipeline([
    ('CountVectorizer', CountVectorizer(
        max_df          = 0.95,
        min_df          = 2,
        stop_words      = 'english')),
    ('LatentDirichletAllocation', LatentDirichletAllocation(
        n_components    = 10,
        learning_method ='online'))
])

pipeline.fit(train)

现在（如果我理解正确的话）预测我可以运行的测试数据的主题：

pipeline.transform(test)

但是，当将管道上传到 Google Cloud Storage 并尝试使用它通过 Google Cloud ML Engine 生成本地预测时，我收到错误消息，提示 LatentDirichletAllocation has no attribute predict。

gcloud ml-engine local predict \
    --model-dir=$MODEL_DIR \
    --json-instances $INPUT_FILE \
    --framework SCIKIT_LEARN
...
"Exception during sklearn prediction: " + str(e)) cloud.ml.prediction.prediction_utils.PredictionError: Failed to run the provided model: Exception during sklearn prediction: 'LatentDirichletAllocation' object has no attribute 'predict' (Error code: 2)

从文档中也可以看到缺少预测方法，所以我想这不是解决方法。 http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

现在的问题是：要走的路是什么？如何通过 Google Cloud ML Engine 在 Scikit-Learn Pipelines 中使用 LatentDirichletAllocation（或类似名称）？

【问题讨论】：

有趣的案例...事实是，CountVectorizer 也没有 predict 方法（它有一个 transform 方法），但它不会产生错误...
@desertnaut from Pipeline 文档我知道predict 仅适用于最后一个估计器。这就是CountVectorizer 不会产生错误的原因。 scikit-learn.org/stable/modules/generated/…
（免责声明：我不是 python 专家..）我研究了源代码，BaseEstimator 确实 not 实际上有一个 predict() 方法（也没有LatentDirichletAllocation 本身）。但是mixins 的BaseEstimator 确实提到了predict() 方法。因此，查看predict() 的实施方式/位置有点挑战性。那么谷歌appEngine返回的错误是否有效？
@pipo。根据我在下面的回答，目前不支持此功能，但我们即将推出一些可能的解决方法。您愿意通过电子邮件讨论您的用例吗？如果是这样，请发送电子邮件至 cloudml-feedback@ 并参考此帖子。

标签： python machine-learning scikit-learn text-classification google-cloud-ml

【解决方案1】：

目前，管道的最后一个估计器必须实现predict 方法。

【讨论】：

解决方法有更新吗？我也有兴趣。
我们有一个解决方案，目前正在进行 alpha 测试。如果您试一试并向我们提供反馈，我们将不胜感激。请通过 cloudml-feedback@google.com 联系我们，了解如何开始使用。