【问题标题】:U-SQL with Python to convert JSON to CSV in Azure Data Lake store使用 Python 的 U-SQL 在 Azure Data Lake 存储中将 JSON 转换为 CSV
【发布时间】:2017-03-25 13:27:16
【问题描述】:

我们需要将存储在 Azure 数据湖存储中的一些大文件从嵌套的 JSON 转换为 CSV。由于除了标准模块之外,Azure 数据湖分析还支持 python 模块 pandas、numpy,我相信使用 python 实现这一目标几乎是可能的。有没有人有python代码来实现这个?

源格式:

{"Loc":"TDM","Topic":"location","LocMac":"location/fe:7a:xx:xx:xx:xx","seq":"296083773","timestamp ":1488986751,"op":"OP_UPDATE","topicSeq":"46478211","sourceId":"AFBWmHSe","location":{"staEthMac":{"addr":"/xxxxx"},"staLocationX ":1643.8915,"staLocationY":571.04205,"errorLevel":1076,"associated":0,"campusId":"n5THo6IINuOSVZ/cTidNVA==","buildingId":"7hY/xx==","floorId": "xxxxxxxxxxx+BYoo0A==","hashedStaEthMac":"xxxx/pMVyK4Gu9qG6w=","locAlgorithm":"ALGORITHM_ESTIMATION","unit":"FEET"},"EventProcessedUtcTime":"2017-03-08T15:35:02.3847947 Z","PartitionId":3,"EventEnqueuedUtcTime":"2017-03-08T15:35:03.7510000Z","IoTHub":{"MessageId":null,"CorrelationId":null,"ConnectionDeviceId":"xxxxx" ,"ConnectionDeviceGenerationId":"636243184116591838","EnqueuedTime":"0001-01-01T00:00:00.0000000","StreamId":null}}

预期输出

TDM,位置,位置/80:7a:bf:d4:d6:50,974851970,1490004475,OP_UPDATE,151002334,xxxxxxx,gHq/1NZQ,977.7259,638.8827,490,1,n5THo6IINuOSVZ/cTidNVA==,7hY /jVh9NRqqxF6gbqT7Jw==,LV/ZiQRQMS2wwKiKTvYNBQ==,H5rrAD/jg1Fnkmo1Zmquau/Qn1U=,ALGORITHM_ESTIMATION,英尺

【问题讨论】:

    标签: python json csv azure azure-data-lake


    【解决方案1】:

    根据您的描述,根据我的理解,我认为您的关键需求是如何使用pandas/numpy 包将存储在 Azure Data Lake Store 中的数据从 JSON 格式转换为 Python 中的 CSV 格式。所以我查看了你的源数据,假设JSON中没有数组类型,然后我设计了下面的代码用于示例数据转换。

    这是我的 JSON 格式对象字符串的示例代码。作为参考,我添加了一些 cmets 以了解我的想法,其中关键是 flattern 将结构 {"A": 0, "B": {"C": 1}} 转换为结构 [["A", "B.C"], [0, 1]] 的方法。

    import json
    import pandas as pd
    
    # Source Data string
    json_raw = '''{"Loc":"TDM","Topic":"location","LocMac":"location/fe:7a:xx:xx:xx:xx","seq":"296083773","timestamp":1488986751,"op":"OP_UPDATE","topicSeq":"46478211","sourceId":"AFBWmHSe","location":{"staEthMac":{"addr":"/xxxxx"},"staLocationX":1643.8915,"staLocationY":571.04205,"errorLevel":1076,"associated":0,"campusId":"n5THo6IINuOSVZ/cTidNVA==","buildingId":"7hY/xx==","floorId":"xxxxxxxxxx+BYoo0A==","hashedStaEthMac":"xxxx/pMVyK4Gu9qG6w=","locAlgorithm":"ALGORITHM_ESTIMATION","unit":"FEET"},"EventProcessedUtcTime":"2017-03-08T15:35:02.3847947Z","PartitionId":3,"EventEnqueuedUtcTime":"2017-03-08T15:35:03.7510000Z","IoTHub":{"MessageId":null,"CorrelationId":null,"ConnectionDeviceId":"xxxxx","ConnectionDeviceGenerationId":"636243184116591838","EnqueuedTime":"0001-01-01T00:00:00.0000000","StreamId":null}}'''
    
    # Load source data string to a Python dict
    json_data = json.loads(json_raw)
    
    # The key method `flattern` for converting `dict` to `2D-list`
    def flattern(data, key):
        keys = []
        values = []
        if key is None:
            for key in data:
                if type(data[key]) is dict:
                    keys.extend(flattern(data[key], key)[0])
                    values.extend(flattern(data[key], key)[1])
                else:
                    keys.append(key)
                    values.append(data[key])
        else:
            for subkey in data:
                if type(data[subkey]) is dict:
                    keys.extend(flattern(data[subkey], key+"."+subkey)[0])
                    values.extend(flattern(data[subkey], subkey)[1])
                else:
                    keys.append(key+"."+subkey)
                    values.append(data[subkey])
        return [keys, values]
    
    list2D = flattern(json_data, None)
    df = pd.DataFrame([list2D[1],], columns=list2D[0])
    
    # If you want to extract the items `Loc` & `Topic` & others like `location.staEthMac.addr`, you just need to create a list for them.
    selected = ["Loc", "Topic"]
    # Use `selected` list to select the columns you want.
    result = df.ix[:,selected]
    # Transform DataFrame to csv string
    csv_raw = "\n".join([",".join(lst) for lst in pd.np.array(result)])
    

    希望对你有帮助。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2016-05-16
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多