【问题标题】:How to create a table from avro schema (.avsc)?如何从 avro 模式(.avsc)创建表?
【发布时间】:2019-06-03 18:43:20
【问题描述】:

我有一个 avro 架构文件,我需要通过 pyspark 在 Databricks 中创建一个表。我不需要加载数据,只想创建表。简单的方法是加载 JSON 字符串并从 fields 数组中获取 "name""type"。然后生成CREATE SQL 查询。我想知道是否有任何编程方式可以使用任何 API 来做到这一点。示例架构 -

{
  "type" : "record",
  "name" : "kylosample",
  "doc" : "Schema generated by Kite",
  "fields" : [ {
    "name" : "registration_dttm",
    "type" : "string",
    "doc" : "Type inferred from '2016-02-03T07:55:29Z'"
  }, {
    "name" : "id",
    "type" : "long",
    "doc" : "Type inferred from '1'"
  }, {
    "name" : "first_name",
    "type" : "string",
    "doc" : "Type inferred from 'Amanda'"
  }, {
    "name" : "last_name",
    "type" : "string",
    "doc" : "Type inferred from 'Jordan'"
  }, {
    "name" : "email",
    "type" : "string",
    "doc" : "Type inferred from 'ajordan0@com.com'"
  }, {
    "name" : "gender",
    "type" : "string",
    "doc" : "Type inferred from 'Female'"
  }, {
    "name" : "ip_address",
    "type" : "string",
    "doc" : "Type inferred from '1.197.201.2'"
  }, {
    "name" : "cc",
    "type" : [ "null", "long" ],
    "doc" : "Type inferred from '6759521864920116'",
    "default" : null
  }, {
    "name" : "country",
    "type" : "string",
    "doc" : "Type inferred from 'Indonesia'"
  }, {
    "name" : "birthdate",
    "type" : "string",
    "doc" : "Type inferred from '3/8/1971'"
  }, {
    "name" : "salary",
    "type" : [ "null", "double" ],
    "doc" : "Type inferred from '49756.53'",
    "default" : null
  }, {
    "name" : "title",
    "type" : "string",
    "doc" : "Type inferred from 'Internal Auditor'"
  }, {
    "name" : "comments",
    "type" : "string",
    "doc" : "Type inferred from '1E+02'"
  } ]
}

【问题讨论】:

    标签: python pyspark avro databricks


    【解决方案1】:

    这似乎还不能通过 Python API 获得......这就是我过去通过 Spark SQL 创建一个指向导出的 .avsc 的外部表来完成它的方式,因为您只想创建一个表和不加载任何数据...示例:

    spark.sql("""
    create external table db.table_name
    STORED AS AVRO
    LOCATION 'PATH/WHERE/DATA/WILL/BE/STORED'
    TBLPROPERTIES('avro.schema.url'='PATH/TO/SCHEMA.avsc')
    """)
    

    Spark 2.4 中的本机 Scala API 看起来现在可以使用 .avsc 阅读器...由于您使用的是 Databricks,您可以在笔记本中更改您的内核,如 %scala or %python or %sql ... Scala 示例:

    import org.apache.avro.Schema
    
    val schema = new Schema.Parser().parse(new File("user.avsc"))
    
    spark
      .read
      .format("avro")
      .option("avroSchema", schema.toString)
      .load("/tmp/episodes.avro")
      .show()
    

    Spark 2.4 Avro 集成参考文档 =>

    https://spark.apache.org/docs/latest/sql-data-sources-avro.html#configuration

    https://databricks.com/blog/2018/11/30/apache-avro-as-a-built-in-data-source-in-apache-spark-2-4.html

    【讨论】:

    • 收到错误unable to infer schema. The schema specification is required to create the table
    • 您是否从您的 avro 文件生成 avro 架构?例如:=> avro-tools getschema part-m-00000.avro > orders.avsc
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2021-06-19
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-08-09
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多