【发布时间】:2021-09-17 22:39:43
【问题描述】:
我正在尝试使用 Azure 数据工厂从 FHIR 服务器读取数据并将结果转换为 Azure Blob 存储中以换行符分隔的 JSON (ndjson) 文件。具体来说,如果您查询 FHIR 服务器,您可能会得到如下信息:
{
"resourceType": "Bundle",
"id": "som-id",
"type": "searchset",
"link": [
{
"relation": "next",
"url": "https://fhirserver/?ct=token"
},
{
"relation": "self",
"url": "https://fhirserver/"
}
],
"entry": [
{
"fullUrl": "https://fhirserver/Organization/1234",
"resource": {
"resourceType": "Organization",
"id": "1234",
// More fields
},
{
"fullUrl": "https://fhirserver/Organization/456",
"resource": {
"resourceType": "Organization",
"id": "456",
// More fields
},
// More resources
]
}
基本上是一堆资源。我想将其转换为一个换行符分隔(又名 ndjson)文件,其中每一行只是资源的 json:
{"resourceType": "Organization", "id": "1234", // More fields }
{"resourceType": "Organization", "id": "456", // More fields }
// More lines with resources
我能够设置 REST 连接器并且它可以查询 FHIR 服务器(包括分页),但无论我尝试什么,我似乎都无法生成我想要的输出。我设置了一个 Azure Blob 存储数据集:
{
"name": "AzureBlob1",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "setOfObjects"
},
"fileName": "myout.json",
"folderPath": "outfhirfromadf"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
并配置复制活动:
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Copy Data1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "RestSource",
"httpRequestTimeout": "00:01:40",
"requestInterval": "00.00:00:00.010"
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"schemaMapping": {
"resource": "resource"
},
"collectionReference": "$.entry"
}
},
"inputs": [
{
"referenceName": "FHIRSource",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob1",
"type": "DatasetReference"
}
]
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
但最终(尽管配置了架构映射),blob 中的最终结果始终只是从服务器返回的原始包。如果我将输出 blob 配置为逗号分隔的文本,我可以提取字段并创建一个扁平的表格视图,但这并不是我真正想要的。
任何建议将不胜感激。
【问题讨论】:
-
到目前为止,我对 Azure 数据工厂的复制活动的经验是,它只会将数据从一个地方复制到另一个地方,每次我需要某种转换时,我都会受伤 :) 你可以考虑接受吗Databricks 并使用一些 python/scala 脚本来完成您需要的转换?
-
@Kzrystof,感谢您的评论。那将是一种可能的选择。我想我想看看我能在这条道路上用 ADF 走多远。所以是的,绝对是一个选择,但我也想知道 ADF 是否(或将)可能。
-
哦,我明白你的意思 :) 实际上我在几周前 asked a similar question 谈到了复制活动如何排除不符合特定条件的行...
标签: azure azure-data-factory hl7-fhir