Databricks 连接 & PyCharm & 远程 SSH 连接答案

【问题标题】：Databricks connect & PyCharm & remote SSH connectionDatabricks 连接 & PyCharm & 远程 SSH 连接
【发布时间】：2020-10-31 12:42:23
【问题描述】：

嘿 StackOverflowers！

我遇到了问题。

我已将 PyCharm 设置为通过 SSH 连接与（天蓝色）VM 连接。

首先我为 ssh 连接进行配置
我设置了映射
我通过在 vm 中启动一个终端来创建一个 conda 环境，然后我下载并连接到 databricks-connect。我在终端上测试它，它工作正常。
我在 pycharm 配置上设置了控制台

但是当我尝试运行 spark 会话 (spark = SparkSession.builder.getOrCreate()) 时，databricks-connect 在错误的文件夹中搜索 .databricks-connect 文件并给我以下错误：

Caused by: java.lang.RuntimeException: Config file /root/.databricks-connect not found. Please run databricks-connect configure to accept the end user license agreement and configure Databricks Connect. A copy of the EULA is provided below: Copyright (2018) Databricks, Inc.

以及完整的错误 + 一些警告。

20/07/10 17:23:05 WARN Utils: Your hostname, george resolves to a loopback address: 127.0.0.1; using 10.0.0.4 instead (on interface eth0)
20/07/10 17:23:05 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/07/10 17:23:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Traceback (most recent call last):
  File "/anaconda/envs/py37/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-23fe18298795>", line 1, in <module>
    runfile('/home/azureuser/code/model/check_vm.py')
  File "/home/azureuser/.pycharm_helpers/pydev/_pydev_bundle/pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "/home/azureuser/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/azureuser/code/model/check_vm.py", line 13, in <module>
    spark = SparkSession.builder.getOrCreate()
  File "/anaconda/envs/py37/lib/python3.7/site-packages/pyspark/sql/session.py", line 185, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/anaconda/envs/py37/lib/python3.7/site-packages/pyspark/context.py", line 373, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/anaconda/envs/py37/lib/python3.7/site-packages/pyspark/context.py", line 137, in __init__
    conf, jsc, profiler_cls)
  File "/anaconda/envs/py37/lib/python3.7/site-packages/pyspark/context.py", line 199, in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File "/anaconda/envs/py37/lib/python3.7/site-packages/pyspark/context.py", line 312, in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
  File "/anaconda/envs/py37/lib/python3.7/site-packages/py4j/java_gateway.py", line 1525, in __call__
    answer, self._gateway_client, None, self._fqn)
  File "/anaconda/envs/py37/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.ExceptionInInitializerError
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:99)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
    at py4j.Gateway.invoke(Gateway.java:250)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:251)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Config file /root/.databricks-connect not found. Please run `databricks-connect configure` to accept the end user license agreement and configure Databricks Connect. A copy of the EULA is provided below: Copyright (2018) Databricks, Inc.
This library (the "Software") may not be used except in connection with the Licensee's use of the Databricks Platform Services pursuant to an Agreement (defined below) between Licensee (defined below) and Databricks, Inc. ("Databricks"). This Software shall be deemed part of the “Subscription Services” under the Agreement, or if the Agreement does not define Subscription Services, then the term in such Agreement that refers to the applicable Databricks Platform Services (as defined below) shall be substituted herein for “Subscription Services.”  Licensee's use of the Software must comply at all times with any restrictions applicable to the Subscription Services, generally, and must be used in accordance with any applicable documentation. If you have not agreed to an Agreement or otherwise do not agree to these terms, you may not use the Software.  This license terminates automatically upon the termination of the Agreement or Licensee's breach of these terms.
Agreement: the agreement between Databricks and Licensee governing the use of the Databricks Platform Services, which shall be, with respect to Databricks, the Databricks Terms of Service located at www.databricks.com/termsofservice, and with respect to Databricks Community Edition, the Community Edition Terms of Service located at www.databricks.com/ce-termsofuse, in each case unless Licensee has entered into a separate written agreement with Databricks governing the use of the applicable Databricks Platform Services. Databricks Platform Services: the Databricks services or the Databricks Community Edition services, according to where the Software is used.
Licensee: the user of the Software, or, if the Software is being used on behalf of a company, the company.
To accept this agreement and start using Databricks Connect, run `databricks-connect configure` in a shell.
    at com.databricks.spark.util.DatabricksConnectConf$.checkEula(DatabricksConnectConf.scala:41)
    at org.apache.spark.SparkContext$.<init>(SparkContext.scala:2679)
    at org.apache.spark.SparkContext$.<clinit>(SparkContext.scala)
    ... 13 more

但是，我没有对该文件夹的访问权限，因此我无法将 databricks 连接文件放在那里。

还有什么奇怪的是，如果我运行在：Pycharm -> ssh terminal -> activate conda env -> python 以下

这是一种方法吗：

1. Point out to java where the databricks-connect file is

2. Configure databricks-connect in another way throughout the script or enviromental variables inside pycharm

3. Other way? 

or do I miss something?

【问题讨论】：

我猜它在 /root/ 目录中查找，因为您以 root 身份运行。您是否尝试过以普通用户身份运行该命令？它看起来在同一个目录中吗？你能在虚拟环境中安装databricks-connect 吗？

标签： python ssh pycharm databricks-connect

【解决方案1】：

This 似乎是关于如何做你想做的事情的官方教程（即 databricks 连接）。

很可能，您的 .databricks-connect 文件版本错误。

您需要使用 Java 8 而不是 11、Databricks Runtime 5.5 LTS 或 Databricks Runtime 6.1-6.6，并且您的 python 版本应该两端相同。

这是他们给出的步骤：

conda create --name dbconnect python=3.5
pip uninstall pyspark
pip install -U databricks-connect==5.5.*  # or 6.*.* to match your cluster version. 6.1-6.6 are supported

然后你需要 url、token、cluster id、org id 和 port。最后在终端上运行这个命令：

databricks-connect configure
databricks-connect test

在那之后还有更多工作要做，但希望能奏效。请记住，您需要确保您使用的所有程序都兼容。完成所有设置后，尝试设置 ide (pycharm) 以使其工作。

【讨论】：

我已按照本教程进行操作，并且在本地机器上一切正常。当我在终端的 VM 中执行此操作时，它也可以正常工作。当我想使用 Pycharm & ssh 连接时出现问题
你能用另一个 ide（比如 spyder）试试，看看错误是否重复？
我试过VScode，它工作正常。所以它必须与Pycharm中的配置有关
那么很可能你是正确的并且你的 Pycharm 配置有问题，或者 Pycharm 在连接数据块时仍然存在一些错误。也许您应该联系 JetBrains？

【解决方案2】：

从错误中我看到您需要接受 databricks 的条款和条件，其次按照 pycharm IDE databricks 的这些说明进行操作

CLI

运行
```
databricks-connect configure
```
许可证显示：

复制到剪贴板复制版权所有 (2018) Databricks, Inc.

此库（“软件”）只能用于连接被许可方使用 Databricks 平台服务达成协议...

接受许可并提供配置值。
```
Do you accept the above agreement? [y/N] y
```
设置新的配置值（将输入留空以接受默认值）： Databricks 主机 [无当前值，必须以 https:// 开头]： Databricks 令牌 [无当前值]：集群 ID（例如，0921-001415-jelly628）[无当前 value]：组织 ID（仅限 Azure，请参阅 URL 中的 ?o=orgId）[0]：端口 [15001]：
Databricks Connect 配置脚本会自动添加打包到您的项目配置中。

Python 3 集群转到运行 > 编辑配置。

添加 PYSPARK_PYTHON=python3 作为环境变量。

Python 3 集群配置

【讨论】：

感谢 Yash 的回答。不幸的是，我遵循了这些步骤，虽然它在终端中运行良好，但它不能通过 PyCharm 运行
我猜一些配置文件可能会干扰您是否可以在删除所有配置的情况下卸载它并重新安装以恢复默认值然后重试。

【解决方案3】：

最后，您是否设法在 Databricks 上设置了远程 Pycharm ssh 解释器。我目前正在评估 Databricks 是否可以为我正在进行的项目完成这项工作。

据我了解databricks-connect 仅有助于在远程机器上启动 Spark 作业，而您的其余非 Spark 代码在本地执行...

【讨论】：