【问题标题】:Writing JSON data into BigQuery table as just one field of type string将 JSON 数据作为字符串类型的一个字段写入 BigQuery 表
【发布时间】:2018-02-20 06:49:00
【问题描述】:

我的输入数据如下所示:

[someGarbagevalue]{"Id": 1, "Address": {"Street":"MG Road","City":"Pune"}}
[someGarbagevalue]{"Id": 2, "Address": {"City":"Mumbai"}}
[someGarbagevalue]{"Id": 3, "Address": {"Street":"XYZ Road"}}
[someGarbagevalue]{"Id": 4}
[someGarbagevalue]{"Id": 5, "PhoneNumber": 12345678, "Address": {"Street":"ABCD Road", "City":"Bangalore"}}

在读取数据后,我正在剥离 [someGarbagevalue],然后尝试写入 BigQuery:

class processFunction(beam.DoFn):
  def process(self, element):
    global line
    line = element[element.find(']') + 1:].strip()
    return [line]

def run(argv=None):
    pipeline_options = PipelineOptions()
    p = beam.Pipeline(options=pipeline_options)
      first = p | 'read' >> ReadFromText(wordcount_options.input)
      second = (first
                | 'process' >> (beam.ParDo(processFunction()))
                | 'write' >> beam.io.WriteToBigQuery(
                  'myBucket:tableFolder.test_table')

问题

  1. 如何将数据作为每个 line 类型写入 BigQuery STRING
  2. 如果我将数据作为每一行写入 BigQuery,我将如何查询 BigQuery 表?

当前错误:

Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Error while reading data, error message: JSON parsing error in row starting at position 0: Value encountered without start of object.

【问题讨论】:

    标签: python google-bigquery google-cloud-dataflow apache-beam


    【解决方案1】:

    您的代码有一些缺失/错误的地方:

    1. 为什么在processFunction 中使用global line?那里不需要。
    2. 您应该在 WriteToBigQuery 中指定 BigQuery 表架构

    3. processFunction 应返回包含架构中字段的字典。该字段的值应该是您的字符串。

    您的代码应该或多或少像这样:

    class processFunction(beam.DoFn):
      def process(self, element):
        line = element[element.find(']') + 1:].strip()
        return {
            "line": line
        }
    
    def run(argv=None):
        pipeline_options = PipelineOptions()
        p = beam.Pipeline(options=pipeline_options)
          first = p | 'read' >> ReadFromText(wordcount_options.input)
          second = (first
                    | 'process' >> (beam.ParDo(processFunction()))
                    | 'write' >> beam.io.WriteToBigQuery(
                      'myBucket:tableFolder.test_table',schema="line:STRING")
    

    【讨论】:

    • 删除了 global line 并在 WriteToBigQuery 中指定了架构。我仍然遇到同样的错误。 Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1., error: Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1., error: Error while reading data, error message: JSON parsing error in row starting at position 0: Value encountered without start of object.
    • BigQuery 表是您自己创建的吗?检查类型。还要检查您的输入数据是否正确。
    • 是的,我创建了表。 [ { "name": "line", "type": "STRING", "mode": "nullable" } ]。输入数据类似于问题中提到的数据,我将删除初始方​​括号并将嵌套的 JSON 返回到 WriteToBigQuery。
    • 写入文件而不是 BigQuery 时会得到什么输出?是预期的输出吗?
    • 啊!我现在这样做了。我的文本文件的输出是单词line。我该如何解决这个问题?
    猜你喜欢
    • 2019-10-02
    • 1970-01-01
    • 1970-01-01
    • 2013-12-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多