【问题标题】:Not able to use JohnSnowLabs pretrained model in Zeppelin无法在 Zeppelin 中使用 JohnSnowLabs 预训练模型
【发布时间】:2019-09-08 17:22:32
【问题描述】:

我想在我的 Zeppelin 笔记本中使用 JohnSnowLabs 预训练的拼写检查模块。正如here 提到的,我已将com.johnsnowlabs.nlp:spark-nlp_2.11:1.7.3 添加到Zeppelin 依赖项部分,如下所示:

但是,当我尝试运行以下简单代码时

import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.NorvigSweetingModel
import com.johnsnowlabs.nlp.annotators.Tokenizer
import org.apache.spark.ml.Pipeline
import com.johnsnowlabs.nlp.Finisher

val df = Seq("tiolt cde", "eefg efa efb").toDF("names")

val nlpPipeline = new Pipeline().setStages(Array(
  new DocumentAssembler().setInputCol("names").setOutputCol("document"),
  new Tokenizer().setInputCols("document").setOutputCol("tokens"),
  NorvigSweetingModel.pretrained().setInputCols("tokens").setOutputCol("corrected"),
  new Finisher().setInputCols("corrected")
))

df.transform(df => nlpPipeline.fit(df).transform(df)).show(false)

报错如下:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, xxx.xxx.xxx.xxx, executor 0): java.io.FileNotFoundException: File file:/root/cache_pretrained/spell_fast_en_1.6.2_2_1534781328404/metadata/part-00000 does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
    at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142)
    at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
    at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:109)
    at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
    at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257)
    at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:256)
    at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
    at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
...

如何在 Zeppelin 中添加这个 JohnSnowLabs 拼写检查预训练模型?上述代码直接在 Spark-shell 上运行时有效。

【问题讨论】:

  • 那个配置问题不是和Zeppelin无关吗?我您使用集群,但默认文件系统以某种方式解析为本地文件系统。

标签: apache-spark apache-zeppelin johnsnowlabs-spark-nlp


【解决方案1】:

每当您因环境原因自动下载预训练模型/管道时遇到问题,您始终可以手动加载它们。

这是加载法语模型的示例(任何其他注释器的概念相同):

val french_pos = PerceptronModel.load("/tmp/pos_ud_gsd_fr_2.0.2_2.4_1556531457346/")
      .setInputCols("document", "token")
      .setOutputCol("pos")

来源: https://nlp.johnsnowlabs.com/docs/en/models

【讨论】:

    猜你喜欢
    • 2021-07-09
    • 2017-08-19
    • 2019-01-20
    • 1970-01-01
    • 2020-08-25
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多