【问题标题】:JSON / struct column type in AWS GLUE + AWS Athena / Hive?AWS GLUE + AWS Athena / Hive中的JSON / struct列类型?
【发布时间】:2022-01-01 19:39:37
【问题描述】:

我有一个从嵌套 JSON 创建的 CSV 文件。它既有常规类型列(例如 int、string),也有从嵌套 JSON 创建的 JSON 列:

attributes;business_id;categories;city;days_open;latitude;longitude;name;review_count;stars;state
{"AcceptsInsurance": False, "AgesAllowed": "allages", "Alcohol": "beer_and_wine", "Ambience": {"casual": True, "classy": False, "divey": False, "hipster": False, "intimate": False, "romantic": False, "touristy": False, "trendy": False, "upscale": False}, "BYOB": False, "BikeParking": True, "BusinessAcceptsBitcoin": False, "BusinessAcceptsCreditCards": True, "BusinessParking": {"garage": False, "lot": False, "street": True, "valet": False, "validated": False}, "ByAppointmentOnly": False, "Caters": True, "CoatCheck": False, "Corkage": False, "DogsAllowed": False, "DriveThru": False, "GoodForDancing": False, "GoodForKids": False, "GoodForMeal": {"breakfast": False, "brunch": False, "dessert": False, "dinner": False, "latenight": False, "lunch": False}, "HappyHour": True, "HasTV": True, "Music": None, "NoiseLevel": "average", "Open24Hours": False, "OutdoorSeating": True, "RestaurantsAttire": "casual", "RestaurantsCounterService": False, "RestaurantsDelivery": False, "RestaurantsGoodForGroups": True, "RestaurantsPriceRange": 2, "RestaurantsReservations": False, "RestaurantsTableService": True, "RestaurantsTakeOut": True, "Smoking": "no", "WheelchairAccessible": True, "WiFi": "free"};6iYb2HFDywm3zjuRg0shjw;["Gastropubs", "Food", "Beer Gardens", "Restaurants", "Bars", "American (Traditional)", "Beer Bar", "Nightlife", "Breweries"];Boulder;["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"];40.0175444;-105.2833481;Oskar Blues Taproom;86;4.0;CO
{"AcceptsInsurance": False, "AgesAllowed": "allages", "Alcohol": "beer_and_wine", "Ambience": {"casual": True, "classy": False, "divey": False, "hipster": False, "intimate": False, "romantic": False, "touristy": False, "trendy": False, "upscale": False}, "BYOB": False, "BikeParking": False, "BusinessAcceptsBitcoin": False, "BusinessAcceptsCreditCards": True, "BusinessParking": {"garage": True, "lot": False, "street": False, "valet": False, "validated": False}, "ByAppointmentOnly": False, "Caters": True, "CoatCheck": False, "Corkage": False, "DogsAllowed": False, "DriveThru": False, "GoodForDancing": False, "GoodForKids": True, "GoodForMeal": {"breakfast": True, "brunch": False, "dessert": False, "dinner": False, "latenight": False, "lunch": True}, "HappyHour": False, "HasTV": False, "Music": None, "NoiseLevel": "average", "Open24Hours": False, "OutdoorSeating": False, "RestaurantsAttire": "casual", "RestaurantsCounterService": False, "RestaurantsDelivery": False, "RestaurantsGoodForGroups": False, "RestaurantsPriceRange": 2, "RestaurantsReservations": False, "RestaurantsTableService": True, "RestaurantsTakeOut": True, "Smoking": "no", "WheelchairAccessible": False, "WiFi": "free"};tCbdrRPZA0oiIYSmHG3J0w;["Salad", "Soup", "Sandwiches", "Delis", "Restaurants", "Cafes", "Vegetarian"];Portland;["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"];45.5889058992;-122.5933307507;Flying Elephants at PDX;126;4.0;OR
{"AcceptsInsurance": False, "AgesAllowed": "allages", "Alcohol": "none", "Ambience": None, "BYOB": False, "BikeParking": False, "BusinessAcceptsBitcoin": False, "BusinessAcceptsCreditCards": True, "BusinessParking": {"garage": False, "lot": False, "street": True, "valet": False, "validated": False}, "ByAppointmentOnly": False, "Caters": False, "CoatCheck": False, "Corkage": False, "DogsAllowed": False, "DriveThru": False, "GoodForDancing": False, "GoodForKids": False, "GoodForMeal": None, "HappyHour": False, "HasTV": False, "Music": None, "NoiseLevel": "average", "Open24Hours": False, "OutdoorSeating": False, "RestaurantsAttire": "casual", "RestaurantsCounterService": False, "RestaurantsDelivery": False, "RestaurantsGoodForGroups": False, "RestaurantsPriceRange": 2, "RestaurantsReservations": True, "RestaurantsTableService": True, "RestaurantsTakeOut": False, "Smoking": "no", "WheelchairAccessible": False, "WiFi": "no"};bvN78flM8NLprQ1a1y5dRg;["Antiques", "Fashion", "Used", "Vintage & Consignment", "Shopping", "Furniture Stores", "Home & Garden"];Portland;["Thursday", "Friday", "Saturday", "Sunday"];45.5119069956;-122.6136928797;The Reclaimory;13;4.5;OR

是否可以使用 AWS Glue 处理此文件以输入 AWS Athena / Hive(在 Athena 内部使用)?特别是,如何指定 JSON 列的数据类型?我必须手动执行此操作吗? JSON 写得好吗,还是应该重新格式化?

【问题讨论】:

    标签: json csv hive aws-glue amazon-athena


    【解决方案1】:

    将尝试回答您的所有问题。
    是否可以使用 AWS Glue 处理此文件以输入 AWS Athena / Hive(在 Athena 内部使用)?
    应该。如果你正确地构建 hive 表,任何 csv 文件都可以上传到那里。

    如何指定 JSON 列的数据类型?
    在 hive 中,您可以存储为 string。查看您的 json 结构,您可以使用这样的表达式轻松访问元素 - get_json_object(json_col_str,'$.BusinessParking.garage')

    我必须手动完成吗?
    我猜是这样,除非你有一些自动 DDL 创建实用程序。您可以将示例行放入 xl 中,轻松找出表结构。

    JSON 写得好吗,还是应该重新格式化?
    从您给出的示例中,我检查了最后一行,json 对象对我来说似乎很好。我还使用https://jsonformatter.curiousconcept.com/ 进行了检查,它以漂亮的格式对其进行验证和格式化。如果有任何差异,您可以使用它。

    【讨论】:

    • 谢谢你,第二个问题的答案真的让我明白了!
    【解决方案2】:

    只要 JSON 列中没有分号,您应该能够使用 Athena 查询此数据。将您的表格定义为 CSV,以分号作为分隔符,并使用 string 作为 JSON 列的类型。

    查询此表时可以使用the JSON functions查询JSON列,例如:

    SELECT json_extract_scalar(attributes, '$.AcceptsInsurance')
    …
    

    【讨论】:

      猜你喜欢
      • 2020-06-27
      • 2021-12-22
      • 1970-01-01
      • 2019-07-21
      • 1970-01-01
      • 2018-09-26
      • 2017-08-06
      • 2020-10-01
      • 2021-04-16
      相关资源
      最近更新 更多