【问题标题】:Building fat spark jars & bundles for kubernetes deployment为 kubernetes 部署构建 fat spark jars 和 bundles
【发布时间】:2026-01-07 18:05:01
【问题描述】:

很长一段时间以来,我一直在为 spark-submits 构建胖罐子,它们的作用就像一个魅力。

现在我想在 kubernetes 上部署 spark-jobs。

spark 网站 (https://spark.apache.org/docs/latest/running-on-kubernetes.html) 上描述的方式只是调用脚本 docker-image-tool.sh 将基本 jar 捆绑到 docker 容器中。

我想知道:

通过使用sbt-native-packagersbt-assembly 来构建包含启动 spark 驱动程序所需的所有代码、运行代码(捆绑了所有库)的 docker 镜像并可能提供一种捆绑方式,这会更好吗类路径库(如 postgres jar)到单个图像中。

以这种方式运行 pod 会启动 spark k8s master(客户端模式或集群模式,无论哪种方式效果最好),触发工作 pod 的创建 spark 提交本地 jar(包括所有需要的库)并运行直到完成。

也许我错过了为什么这不起作用或者是一个坏主意,但我觉得配置会比当前的方法更加集中和直接?

或者还有其他最佳做法吗?

【问题讨论】:

    标签: scala docker apache-spark kubernetes


    【解决方案1】:

    所以最后我使用 helm 让一切正常工作,spark-on-k8s-operatorsbt-docker

    首先,我将一些配置提取到 build.sbt 中的变量中,以便程序集和 docker 生成器都可以使用它们。

    // define some dependencies that should not be compiled, but copied into docker
    val externalDependencies = Seq(
      "org.postgresql" % "postgresql" % postgresVersion,
      "io.prometheus.jmx" % "jmx_prometheus_javaagent" % jmxPrometheusVersion
    )
    
    // Settings
    val team = "hazelnut"
    val importerDescription = "..."
    val importerMainClass = "..."
    val targetDockerJarPath = "/opt/spark/jars"
    val externalPaths = externalDependencies.map(module => {
      val parts = module.toString().split(""":""")
      val orgDir = parts(0).replaceAll("""\.""","""/""")
      val moduleName = parts(1).replaceAll("""\.""","""/""")
      val version = parts(2)
      var jarFile = moduleName + "-" + version + ".jar"
      (orgDir, moduleName, version, jarFile)
    })
    

    接下来我定义程序集设置以创建 fat jar(可以是你需要的任何东西):

    lazy val assemblySettings = Seq(
      // Assembly options
      assembly / assemblyOption := (assemblyOption in assembly).value.copy(includeScala = false),
      assembly / assemblyMergeStrategy := {
        case PathList("reference.conf") => MergeStrategy.concat
        case PathList("META-INF", _@_*) => MergeStrategy.discard
        case "log4j.properties" => MergeStrategy.concat
        case _ => MergeStrategy.first
      },
      assembly / logLevel := sbt.util.Level.Error,
      assembly / test := {},
      pomIncludeRepository := { _ => false }
    )
    

    然后定义docker设置:

    lazy val dockerSettings = Seq(
      imageNames in docker := Seq(
        ImageName(s"$team/${name.value}:latest"),
        ImageName(s"$team/${name.value}:${version.value}"),
      ),
      dockerfile in docker := {
        // The assembly task generates a fat JAR file
        val artifact: File = assembly.value
        val artifactTargetPath = s"$targetDockerJarPath/$team-${name.value}.jar"
        externalPaths.map {
          case (extOrgDir, extModuleName, extVersion, jarFile) =>
            val url = List("https://repo1.maven.org/maven2", extOrgDir, extModuleName, extVersion, jarFile).mkString("/")
            val target = s"$targetDockerJarPath/$jarFile"
            Instructions.Run.exec(List("curl", url, "--output", target, "--silent"))
        }
          .foldLeft(new Dockerfile {
            //       https://hub.docker.com/r/lightbend/spark/tags
            from(s"lightbend/spark:${openShiftVersion}-OpenShift-${sparkVersion}-ubuntu-${scalaBaseVersion}")
          }) {
            case (df, run) => df.addInstruction(run)
          }.add(artifact, artifactTargetPath)    
      }
    )
    

    我创建了一些Task 来生成一些掌舵图表/值:

    lazy val createImporterHelmChart: Def.Initialize[Task[Seq[File]]] = Def.task {
      val chartFile = baseDirectory.value / "../helm" / "Chart.yaml"
      val valuesFile = baseDirectory.value / "../helm" / "values.yaml"
      val jarDependencies = externalPaths.map {
        case (_, extModuleName, _, jarFile) =>
          extModuleName -> s""""local://$targetDockerJarPath/$jarFile""""
      }.toMap
    
      val chartContents =
        s"""# Generated by build.sbt. Please don't manually update
           |apiVersion: v1
           |name: $team-${name.value}
           |version: ${version.value}
           |description: $importerDescription 
           |""".stripMargin
    
      val valuesContents =
        s"""# Generated by build.sbt. Please don't manually update      
           |version: ${version.value}
           |sparkVersion: $sparkVersion
           |image: $team/${name.value}:${version.value}
           |jar: local://$targetDockerJarPath/$team-${name.value}.jar
           |mainClass: $importerMainClass
           |jarDependencies: [${jarDependencies.values.mkString(", ")}]
           |fileDependencies: []
           |jmxExporterJar: ${jarDependencies.getOrElse("jmx_prometheus_javaagent", "null").replace("local://","")}
           |""".stripMargin
    
      IO.write(chartFile, chartContents)
      IO.write(valuesFile, valuesContents)
      Seq(chartFile, valuesFile)
    }
    

    最后,这一切都在 build.sbt 中组合成一个项目定义

    lazy val importer = (project in file("importer"))
      .enablePlugins(JavaAppPackaging)
      .enablePlugins(sbtdocker.DockerPlugin)
      .enablePlugins(AshScriptPlugin)
      .dependsOn(util)
      .settings(
        commonSettings,
        testSettings,
        assemblySettings,
        dockerSettings,
        scalafmtSettings,
        name := "etl-importer",
        Compile / mainClass := Some(importerMainClass),
        Compile / resourceGenerators += createImporterHelmChart.taskValue
      )
    

    最后加上每个环境的值文件和一个 helm 模板:

    apiVersion: sparkoperator.k8s.io/v1beta1
    kind: SparkApplication
    metadata:
      name: {{ .Chart.Name | trunc 64 }}
      labels:
        name: {{ .Chart.Name | trunc 63 | quote }}
        release: {{ .Release.Name | trunc 63 | quote }}
        revision: {{ .Release.Revision | quote }}
        sparkVersion: {{ .Values.sparkVersion | quote }}
        version: {{ .Chart.Version | quote }}
    spec:
      type: Scala
      mode: cluster
      image: {{ .Values.image | quote }}
      imagePullPolicy: {{ .Values.imagePullPolicy }}
      mainClass: {{ .Values.mainClass | quote }}
      mainApplicationFile: {{ .Values.jar | quote }}
      sparkVersion: {{ .Values.sparkVersion | quote }}
      restartPolicy:
        type: Never
      deps:
        {{- if .Values.jarDependencies }}
        jars:
        {{- range .Values.jarDependencies }}
          - {{ . | quote }}
        {{- end }}
        {{- end }}
    ...
    

    我现在可以使用

    构建包

    sbt [project name]/docker

    并使用

    部署它们

    helm install ./helm -f ./helm/values-minikube.yaml --namespace=[ns] --name [name]

    它可能会变得更漂亮,但现在它就像一个魅力

    【讨论】: