【问题标题】:save rest api get method response as a json document将 rest api get 方法响应保存为 json 文档
【发布时间】:2020-12-09 02:02:41
【问题描述】:

我正在使用下面的代码从 rest api 读取并将响应写入 pyspark 中的 json 文档并将文件保存到 Azure Data Lake Gen2。当响应没有空白数据时,代码可以正常工作,但是当我尝试取回所有数据时,会遇到以下错误。

错误信息:ValueError: 有些类型在推断后无法确定

代码:

import requests
response = requests.get('https://apiurl.com/demo/api/v3/data',
                         auth=('user', 'password'))
data = response.json()
from pyspark.sql import *
df=spark.createDataFrame([Row(**i) for i in data])
df.show()
df.write.mode("overwrite").json("wasbs://<file_system>@<storage-account-name>.blob.core.windows.net/demo/data")

回应:

[
    {
        "ProductID": "156528",
        "ProductType": "Home Improvement",
        "Description": "",
        "SaleDate": "0001-01-01T00:00:00",
        "UpdateDate": "2015-02-01T16:43:18.247"
    },
    {
        "ProductID": "126789",
        "ProductType": "Pharmacy",
        "Description": "",
        "SaleDate": "0001-01-01T00:00:00",
        "UpdateDate": "2015-02-01T16:43:18.247"
    }
]

尝试修复如下架构。

from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([StructField("ProductID", StringType(), True), StructField("ProductType", StringType(), True), "Description", StringType(), True), StructField("SaleDate", StringType(), True), StructField("UpdateDate", StringType(), True)])
df = spark.createDataFrame([[None, None, None, None, None]], schema=schema)
df.show()

不确定如何创建数据框并将数据写入 json 文档。

【问题讨论】:

    标签: pyspark azure-databricks azure-data-lake-gen2


    【解决方案1】:

    您可以将data,schema 变量传递给spark.createDataFrame(),然后spark 将创建一个数据框。

    Example:

    from pyspark.sql.functions import *
    from pyspark.sql import *
    from pyspark.sql.types import *
    
    
    data=[
        {
            "ProductID": "156528",
            "ProductType": "Home Improvement",
            "Description": "",
            "SaleDate": "0001-01-01T00:00:00",
            "UpdateDate": "2015-02-01T16:43:18.247"
        },
        {
            "ProductID": "126789",
            "ProductType": "Pharmacy",
            "Description": "",
            "SaleDate": "0001-01-01T00:00:00",
            "UpdateDate": "2015-02-01T16:43:18.247"
        }
    ]
    
    schema = StructType([StructField("ProductID", StringType(), True), StructField("ProductType", StringType(), True), StructField("Description", StringType(), True), StructField("SaleDate", StringType(), True), StructField("UpdateDate", StringType(), True)])
    
    
    df = spark.createDataFrame(data, schema=schema)
    
    df.show()
    #+---------+----------------+-----------+-------------------+--------------------+
    #|ProductID|     ProductType|Description|           SaleDate|          UpdateDate|
    #+---------+----------------+-----------+-------------------+--------------------+
    #|   156528|Home Improvement|           |0001-01-01T00:00:00|2015-02-01T16:43:...|
    #|   126789|        Pharmacy|           |0001-01-01T00:00:00|2015-02-01T16:43:...|
    #+---------+----------------+-----------+-------------------+--------------------+
    

    【讨论】:

    • 非常感谢。
    猜你喜欢
    • 2014-09-01
    • 2022-01-09
    • 1970-01-01
    • 2015-01-07
    • 2017-08-25
    • 1970-01-01
    • 1970-01-01
    • 2022-12-16
    • 2011-09-22
    相关资源
    最近更新 更多