【问题标题】:spark-examples job failed on single node kubernetes cluster due to java.net.UnknownHostException: kubernetes.default.svc由于 java.net.UnknownHostException: kubernetes.default.svc 在单节点 kubernetes 集群上的 spark-examples 作业失败
【发布时间】:2019-07-02 19:01:32
【问题描述】:

由于 java.net.UnknownHostException: kubernetes.default.svc,我向 k8s 集群提交了示例 spark(在 Spark 代码中提供的作业)。如果您能帮我解决这个问题,那将非常有帮助。

我的环境:

  • Ubuntu 18.04 LTS amd64 仿生镜像构建于 2019-06-17
  • 2 个 vCPU 7.5 GB 内存
  • 云服务:Google Coud 引擎
  • 仅单个主节点(无工作节点)

如何重现我的问题:

$ kubectl cluster-info
Kubernetes master is running at https://10.128.0.10:6443
KubeDNS is running at https://10.128.0.10:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
$ bin/spark-submit \
    --master k8s://https://10.128.0.10:6443 \
    --deploy-mode cluster \
    --conf spark.executor.instances=3 \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
    --conf spark.kubernetes.container.image=yohei1126/spark:v2.3.3 \
    --class org.apache.spark.examples.SparkPi \
    --name spark-pi \
    local:///opt/spark/examples/jars/spark-examples_2.11-2.3.3.jar

错误日志:

  • KubeDNS 正在运行,但名称解析可能无法正常运行。
    $ kubectl logs spark-pi-67ed1ddda23e32799371677bf1e795c4-driver
    ...
    2019-06-24 08:40:16 INFO  SparkContext:54 - Successfully stopped SparkContext
    Exception in thread "main" org.apache.spark.SparkException: External scheduler
    cannot be instantiated
    ...
    Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation:
    [get]  for kind: [Pod]  with name: [spark-pi-67ed1ddda23e32799371677bf1e795c4-driver]
    in namespace: [default]  failed.
    ...
    Caused by: java.net.UnknownHostException: kubernetes.default.svc: Try again

我如何在干净的 ubuntu 上安装 k8s:

https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/

$ apt-get update && apt-get install -y apt-transport-https curl
$ curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
$ cat <<EOF >/etc/apt/sources.list.d/kubernetes.list
deb https://apt.kubernetes.io/ kubernetes-xenial main
EOF
$ apt-get update
$ apt-get install -y kubelet kubeadm kubectl
$ apt-mark hold kubelet kubeadm kubectl

我还安装了 Docker-ce,因为 kubeadm 需要它。

$ sudo apt update
$ sudo apt install -y \
    apt-transport-https \
    ca-certificates \
    curl \
    software-properties-common
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
$ sudo apt-key fingerprint 0EBFCD88
$ sudo add-apt-repository \
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \
   stable"
$ sudo apt update
$ sudo apt install -y docker-ce

我如何初始化集群:

  • 为 --pod-network-cidr 指定网络地址。
    $ sudo kubeadm init --pod-network-cidr=10.128.0.0/20
    $ mkdir -p $HOME/.kube
    $ sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
    $ sudo chown $(id -u):$(id -g) $HOME/.kube/config

    $ sudo sysctl net.bridge.bridge-nf-call-iptables=1
    $ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
    $ kubectl taint nodes test-k8s node-role.kubernetes.io/master:NoSchedule-

我是如何创建 docker 映像的:

  • 我使用了预构建的 Spark tar。
$ wget http://ftp.meisei-u.ac.jp/mirror/apache/dist/spark/spark-2.3.3/spark-2.3.3-bin-hadoop2.7.tgz
$ tar zxvf spark-2.3.3-bin-hadoop2.7.tgz
$ cd spark-2.3.3-bin-hadoop2.7
$ sudo bin/docker-image-tool.sh -r yohei1126 -t v2.3.3 build
$ sudo bin/docker-image-tool.sh -r yohei1126 -t v2.3.3 push

【问题讨论】:

    标签: docker apache-spark kubernetes pyspark google-compute-engine


    【解决方案1】:

    您能否根据本文档 [1] 与我们分享 Stackdriver Logging 中与 kube-system 日志相关的条目是什么?我之前看到过同样的问题,它与 403 的权限错误或 IO 超时有关。

    您也可以尝试重新创建节点池。它可以解决问题。


    [1]https://cloud.google.com/monitoring/kubernetes-engine/legacy-stackdriver/logging

    【讨论】:

    • 习惯上在 cmets 部分而不是答案部分中要求原作者提供澄清/附加信息。
    • 我没有使用 GKE
    【解决方案2】:

    找到了一种使用 minikube 在 Ubuntu 18.04 LTS 上构建单节点 Kubernetes 集群的方法。

    • 机器类型:n1-standard-4(4 vCPU,15GB内存)
    • 磁盘大小:30GB
    • CPU 平台:Intel Haswell
    • Ubuntu 18.04 LTS
    • Docker v18.09.7
    • minikube v.1.2.0
    • kubectl v1.15.0

    安装 Docker

    $ sudo apt-get update
    $ sudo apt-get install -y \
        apt-transport-https \
        ca-certificates \
        curl \
        gnupg-agent \
        software-properties-common
    $ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
    $ sudo apt-key fingerprint 0EBFCD88
    $ sudo add-apt-repository \
       "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
       $(lsb_release -cs) \
       stable"
    $ sudo apt-get update
    $ sudo apt-get install -y docker-ce docker-ce-cli containerd.io
    

    安装 kubectl

    $ sudo snap install kubectl --classic
    

    安装 minikube

    $ curl -Lo minikube https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 \
      && chmod +x minikube
    $ sudo install minikube /usr/local/bin
    

    启动 Kubernetes 集群

    • 注意:示例 Spark 作业需要 8GB 内存
    $ sudo minikube start --vm-driver=none --cpu 4 --memory 8192
    $ sudo mv /home/yohei/.kube /home/yohei/.minikube $HOME
    $ sudo chown -R $USER $HOME/.kube $HOME/.minikube
    

    为 Spark 作业创建 Kubernetes 服务帐号

    $ kubectl create serviceaccount spark
    $ kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
    

    检查集群 IP 地址

    $ kubectl cluster-info
    Kubernetes master is running at https://10.128.0.11:6443
    KubeDNS is running at https://10.128.0.11:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
    To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
    

    运行 Spark 作业

    $ sudo apt-get install openjdk-8-jdk -y
    $ wget https://www-us.apache.org/dist/spark/spark-2.3.3/spark-2.3.3-bin-hadoop2.7.tgz
    $ tar zxvf spark-2.3.3-bin-hadoop2.7.tgz
    $ cd spark-2.3.3-bin-hadoop2.7
    $ bin/spark-submit \
      --master k8s://https://10.128.0.11:6443 \
      --deploy-mode cluster \
      --conf spark.executor.instances=3 \
      --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
      --conf spark.kubernetes.container.image=yohei1126/spark:v2.3.3 \
      --class org.apache.spark.examples.SparkPi \
      --name spark-pi \
      local:///opt/spark/examples/jars/spark-examples_2.11-2.3.3.jar
    

    检查结果

    2019-07-02 08:57:56 INFO  LoggingPodStatusWatcherImpl:54 - State changed, new state: 
             pod name: spark-pi-e39a8e8f7faf3c9fa861ae024e93b742-driver
             namespace: default
             labels: spark-app-selector -> spark-d0860239ee0f4118aeb8fee83bd00fa2, spark-role -> driver
             pod uid: 01e8f4c0-ae85-4252-92f7-11dbdd2e2b0d
             creation time: 2019-07-02T08:57:10Z
             service account name: spark
             volumes: spark-token-bnm7w
             node name: minikube
             start time: 2019-07-02T08:57:10Z
             container images: yohei1126/spark:v2.3.3
             phase: Succeeded
             status: [ContainerStatus(containerID=docker://c8c7584c7b704b8f2321943967f84d58267a9ca9d1e1852c2ac9eafb76816dc1, image=yohei1126/spark:v2.3.3, imageID=docker-pullable://yohei11
    26/spark@sha256:d3524f24fe199dcb78fd3e1d640261e5337544aefa4aa302ac72523656fe2af1, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=s
    park-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://c8c7584c7b704b8f2321943967f84d58267a9ca
    9d1e1852c2ac9eafb76816dc1, exitCode=0, finishedAt=Time(time=2019-07-02T08:57:56Z, additionalProperties={}), message=null, reason=Completed, signal=null, startedAt=Time(time=2019-07-02T
    08:57:21Z, additionalProperties={}), additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})]
    2019-07-02 08:57:56 INFO  LoggingPodStatusWatcherImpl:54 - Container final statuses:
             Container name: spark-kubernetes-driver
             Container image: yohei1126/spark:v2.3.3
             Container state: Terminated
             Exit code: 0
    2019-07-02 08:57:56 INFO  Client:54 - Application spark-pi finished.
    2019-07-02 08:57:56 INFO  ShutdownHookManager:54 - Shutdown hook called
    2019-07-02 08:57:56 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-9cbaba30-9277-4bc9-9c55-8acf78711e1d
    $ kubectl get pods
    NAME                                               READY   STATUS      RESTARTS   AGE
    spark-pi-e39a8e8f7faf3c9fa861ae024e93b742-driver   0/1     Completed   0          43m
    yohei_onishi@test-k8s:~$ kubectl logs spark-pi-e39a8e8f7faf3c9fa861ae024e93b742-driver
    ...
    Pi is roughly 3.141395706978535
    ...
    

    【讨论】:

      猜你喜欢
      • 2014-06-28
      • 2023-01-30
      • 2019-04-17
      • 1970-01-01
      • 2020-05-25
      • 1970-01-01
      • 1970-01-01
      • 2023-04-08
      • 2013-04-24
      相关资源
      最近更新 更多