GCP AI Platform 自定义预测例程无法下载 nltk 资源答案

【问题标题】：GCP AI Platform custom prediction routine fails downloading nltk resourcesGCP AI Platform 自定义预测例程无法下载 nltk 资源
【发布时间】：2020-07-14 00:08:00
【问题描述】：

我为电子邮件分类器创建了自定义预测例程。在预处理时，我使用的是 nltk。模型创建成功，但是当我发送请求时，GCP 无法下载所需的 nltk 文件。当我的预处理文件是这样的时候

import nltk

class MyPreprocess(object):
    def __init__(self):
        pass
    
    def to_sentences(self, text):
        sentences = nltk.sent_tokenize(text)

我收到以下错误：

Resource [93mpunkt[0m not found.
Please use the NLTK Downloader to obtain the resource:
nltk.download('punkt')
Attempted to load [93mtokenizers/punkt/PY3/english.pickle[0m

如果我在 import 语句后添加 nltk.download('punkt')，我会收到另一个类似这样的错误：

ERROR:root:Unexpected error when loading the model: problem in predictor - OSError: [Errno 30] Read-only file system: '/root/nltk_data'

【问题讨论】：

标签： google-cloud-platform nltk google-cloud-ml

【解决方案1】：

更新：当前解决方案

显然，在ai-platform模型中，工作目录是''。这个目录是只读的，你不能在这里下载任何东西。我刚刚将 nltk punkt 的下载路径更改为 tmp 并且它有效。 nltk.download('punkt', download_dir='/tmp')

以前的临时解决方案

我做了一个临时解决方法，但我认为这不是解决此问题的好方法：

在部署模型之前，我在本地下载了nltk punkt，并定位到english.punkt文件。然后在我的预处理代码中手动加载它。

from nltk.data import load

class MyPreprocess(object):
    def __init__(self):
        self.nltk_tokenizer = load('english.punkt')
    
    def sentence_tokenizer(self, text):
        return self.nltk_tokenizer.tokenize(text)

通过这种方式，已经加载的 tokenizer 将被打包到 pickle 文件中，因此在部署过程中不需要下载 punkt。

【讨论】：