【问题标题】:running sparknlp DocumentAssembler on EMR在 EMR 上运行 sparknlp DocumentAssembler
【发布时间】:2022-05-17 22:32:48
【问题描述】:

我正在尝试在 EMR 上运行 sparknlp。我登录到我的 zeppelin notebook 并运行以下命令

import sparknlp
spark = SparkSession.builder \
    .appName("BBC Text Categorization")\
    .config("spark.driver.memory","8G")\
    .config("spark.memory.offHeap.enabled",True)\
    .config("spark.memory.offHeap.size","8G") \
    .config("spark.driver.maxResultSize", "2G") \
    .config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.4.5")\
    .config("spark.kryoserializer.buffer.max", "1000M")\
    .config("spark.network.timeout","3600s")\
    .getOrCreate()
from sparknlp.base import DocumentAssembler
documentAssembler = DocumentAssembler()\
     .setInputCol("description") \
     .setOutputCol('document')

这导致了以下错误:

Fail to execute line 1: documentAssembler = DocumentAssembler()\
Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-4581426413302524147.py", line 380, in <module>
    exec(code, _zcUserQueryNameSpace)
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 110, in wrapper
    return func(self, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/sparknlp/base.py", line 148, in __init__
    super(DocumentAssembler, self).__init__(classname="com.johnsnowlabs.nlp.DocumentAssembler")
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 110, in wrapper
    return func(self, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/sparknlp/internal.py", line 72, in __init__
    self._java_obj = self._new_java_obj(classname, self.uid)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 67, in _new_java_obj
    return java_obj(*java_args)
TypeError: 'JavaPackage' object is not callable

为了理解这个问题,我尝试登录到 master 并在 pyspark 控制台中运行上述命令。 一切运行良好,如果我使用以下命令启动 pyspark 控制台,我不会收到上述错误: pyspark --packages JohnSnowLabs:spark-nlp:2.4.5

但我在使用命令pyspark时遇到与以前相同的错误

如何在我的 zeppelin notebook 上完成这项工作?

设置详情:

EMR 5.27.0
spark 2.4.4
openjdk version "1.8.0_272"
OpenJDK Runtime Environment (build 1.8.0_272-b10)
OpenJDK 64-Bit Server VM (build 25.272-b10, mixed mode)

这是我的引导脚本:

#!/bin/bash
sudo yum install -y python36-devel python36-pip python36-setuptools python36-virtualenv

sudo python36 -m pip install --upgrade pip

sudo python36 -m pip install pandas

sudo python36 -m pip install boto3

sudo python36 -m pip install re

sudo python36 -m pip install spark-nlp==2.7.2

【问题讨论】:

标签: python apache-spark pyspark amazon-emr


【解决方案1】:
  1. 确保您使用受支持的 EMR 版本,see here 支持版本

  2. 你的引导脚本应该包含

#!/bin/bash
set -x -e

echo -e 'export PYSPARK_PYTHON=/usr/bin/python3
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_JARS_DIR=/usr/lib/spark/jars
export SPARK_HOME=/usr/lib/spark' >> $HOME/.bashrc && source $HOME/.bashrc

sudo python3 -m pip install awscli boto spark-nlp

set +x
exit 0
  1. 提供conf文件,可以存储在S3中传给集群
[{
  "Classification": "spark-env",
  "Configurations": [{
    "Classification": "export",
    "Properties": {
      "PYSPARK_PYTHON": "/usr/bin/python3"
    }
  }]
},
{
  "Classification": "spark-defaults",
    "Properties": {
      "spark.yarn.stagingDir": "hdfs:///tmp",
      "spark.yarn.preserve.staging.files": "true",
      "spark.kryoserializer.buffer.max": "2000M",
      "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
      "spark.driver.maxResultSize": "0",
      "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.4"
    }
}
]
  1. 最后,启动 MER 集群,即从 CLI 启动
aws emr create-cluster \
--name "Spark NLP 3.4.4" \
--release-label emr-6.2.0 \
--applications Name=Hadoop Name=Spark Name=Hive \
--instance-type m4.4xlarge \
--instance-count 3 \
--use-default-roles \
--log-uri "s3://<S3_BUCKET>/" \
--bootstrap-actions Path=s3://<S3_BUCKET>/emr-bootstrap.sh,Name=custome \
--configurations "https://<public_access>/sparknlp-config.json" \
--ec2-attributes KeyName=<your_ssh_key>,EmrManagedMasterSecurityGroup=<security_group_with_ssh>,EmrManagedSlaveSecurityGroup=<security_group_with_ssh> \
--profile <aws_profile_credentials>

See this tutorial aswell

【讨论】:

    猜你喜欢
    • 2012-07-27
    • 1970-01-01
    • 1970-01-01
    • 2020-06-09
    • 2014-06-11
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-04-15
    相关资源
    最近更新 更多