【问题标题】:Extract data from invalid JSON using bash, sed, grep or awk?使用 bash、sed、grep 或 awk 从无效 JSON 中提取数据?
【发布时间】:2021-11-28 05:15:10
【问题描述】:

我正在尝试在 bash 中解析无效的 JSON

x="{componentId: 00N5E000005vm9e, componentName: Field, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}, {componentId: 00N5E000005vm9e, componentName: Field, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}, {componentId: 00N5E000005vm9e, componentName: Field, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}, {componentId: 0Rb5E000000BGVi, componentName: Versions, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}, {componentId: 0Rb5E000000BGVj, componentName: Approves, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}, {componentId: 0Rb5E000000BGVe, componentName: activityThreads, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}, {componentId: 0Rb5E000000BGVf, componentName: Attachments, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}, {componentId: 0Rb5E000000BGVh, componentName: Details, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}"

使用以下脚本

for each in $(echo $x | sed 's/{componentId: /\n/g' ); do
    echo "Each: $each"
    echo [[ $each == 0Rb* ]]
    if [[ $each == 0Rb* ]]; then
        component=echo $each | awk -v FS="(componentName: |,|referenceName: |,)" '{print $3}'
        reference=echo $each | awk -v FS="(componentName: |,|referenceName: |,)" '{print $6}'
        echo "component: $component"
        echo "reference: $component"
    fi
done

但它不起作用。我不明白为什么它不起作用。当我在控制台中执行这一行时,

echo $x | sed 's/{componentId: /\n/g' 

我可以看到这个无效的 json 被正确地分割成行,但是当我尝试将它传递到 for 循环中时,每个变量都会接收到更小的块值

Each: 00N5E000005vm9e,

我很困惑。

componentId不以@开头时,我要做的是从无效的json中为每个项目提取componentName: ,之间的值以及referenceName: ,之间的另一个值987654331@。有没有办法做到这一点?

我也尝试使用jq -n $x,但使用jq: error: syntax error, unexpected IDENT, expecting '}' (Unix shell quoting issues?) at <top-level>, line 1: 失败

【问题讨论】:

  • 对于“for var in value”,拆分基于空格(空格、制表符、换行符...)而不仅仅是换行符。使用while IFS= read line 循环,而不是将您的sed 命令通过管道传递给它。
  • 当我尝试使用echo $x | while IFS= read -r each; do时,它将整个json作为一个变量
  • 只要把它想象成非json,而不是无效的json。编写一个工具来解析它。 component=echo 为什么要将echo 分配给component?请使用 shellcheck 检查您的脚本并修复错误。为什么它是无效的 json,为什么它无效? do is to extract the value 要在 shell 中提取值,通常使用带有 awksed 的正则表达式。但是要以某种格式解析文件,请用更好的语言编写解析器,例如pythonperljq -n $x您所有的变量扩展都缺少引号。请用 shellcheck 检查你的脚本。
  • I am confused. 不带引号的命令替换的结果$(...) 经历分词扩展,取决于IFS,默认情况下是空格(制表符、换行符空格)。 $(...) 的结果在任何空格或换行符上被拆分为单词,因此 $each 一次变成一个单词。要阅读行,请阅读mywiki.wooledge.org/BashFAQ/001
  • 也许你可以使用 YAML 解析器,因为这个数据是 YAML 而不是 JSON

标签: json bash yaml


【解决方案1】:

将数据视为 JSON

使用sed将其转换回有效的json,例如:

# Remove redundant space (assuming the text is in the `x` variable)
<<<"$x" sed 's/: /:/g; s/, /,/g' |

# Quote all "words"
sed -E 's/[^"{}:,]+/"&"/g'       |

# Separate objects
sed 's/},{/}\n{/g'               |

# Parse json
jq .

输出:

{
  "componentId": "00N5E000005vm9e",
  "componentName": "Field",
  "referenceId": "0M05E0000002XbV",
  "referenceName": "RecordPageName1",
  "referenceUrl": "null",
  "message": "Component is in use by another component in your organization.",
  "reasonCode": "10"
}
{
  "componentId": "00N5E000005vm9e",
  "componentName": "Field",
  "referenceId": "0M05E0000002XbV",
  "referenceName": "RecordPageName1",
  "referenceUrl": "null",
  "message": "Component is in use by another component in your organization.",
  "reasonCode": "10"
}
{
  "componentId": "00N5E000005vm9e",
  "componentName": "Field",
  "referenceId": "0M05E0000002XbV",
  "referenceName": "RecordPageName1",
  "referenceUrl": "null",
  "message": "Component is in use by another component in your organization.",
  "reasonCode": "10"
}
{
  "componentId": "0Rb5E000000BGVi",
  "componentName": "Versions",
  "referenceId": "0M05E0000002XbV",
  "referenceName": "RecordPageName1",
  "referenceUrl": "null",
  "message": "Component is in use by another component in your organization.",
  "reasonCode": "10"
}
{
  "componentId": "0Rb5E000000BGVj",
  "componentName": "Approves",
  "referenceId": "0M05E0000002XbV",
  "referenceName": "RecordPageName1",
  "referenceUrl": "null",
  "message": "Component is in use by another component in your organization.",
  "reasonCode": "10"
}
{
  "componentId": "0Rb5E000000BGVe",
  "componentName": "activityThreads",
  "referenceId": "0M05E0000002XbV",
  "referenceName": "RecordPageName1",
  "referenceUrl": "null",
  "message": "Component is in use by another component in your organization.",
  "reasonCode": "10"
}
{
  "componentId": "0Rb5E000000BGVf",
  "componentName": "Attachments",
  "referenceId": "0M05E0000002XbV",
  "referenceName": "RecordPageName1",
  "referenceUrl": "null",
  "message": "Component is in use by another component in your organization.",
  "reasonCode": "10"
}
{
  "componentId": "0Rb5E000000BGVh",
  "componentName": "Details",
  "referenceId": "0M05E0000002XbV",
  "referenceName": "RecordPageName1",
  "referenceUrl": "null",
  "message": "Component is in use by another component in your organization.",
  "reasonCode": "10"
}

要遍历 componentIdreferenceId,您可以使用 jq 的 @tsv 格式化运算符,例如:

... | jq -r '[ .componentId, .referenceId ] | @tsv'

输出:

00N5E000005vm9e 0M05E0000002XbV
00N5E000005vm9e 0M05E0000002XbV
00N5E000005vm9e 0M05E0000002XbV
0Rb5E000000BGVi 0M05E0000002XbV
0Rb5E000000BGVj 0M05E0000002XbV
0Rb5E000000BGVe 0M05E0000002XbV
0Rb5E000000BGVf 0M05E0000002XbV
0Rb5E000000BGVh 0M05E0000002XbV

将数据视为 YAML

正如@léa 所述,您可以使用yq 将此字符串解析为 YAML 数组。这里 是我使用Mike Farah's yq 的4.13.2 版对这种方法的看法:

<<<"[$x]" yq e '.[] | .componentId + " " + .referenceId' -

输出:

00N5E000005vm9e 0M05E0000002XbV
00N5E000005vm9e 0M05E0000002XbV
00N5E000005vm9e 0M05E0000002XbV
0Rb5E000000BGVi 0M05E0000002XbV
0Rb5E000000BGVj 0M05E0000002XbV
0Rb5E000000BGVe 0M05E0000002XbV
0Rb5E000000BGVf 0M05E0000002XbV
0Rb5E000000BGVh 0M05E0000002XbV

在 bash 循环中解析变量

您可以将上述解决方案的结果通过管道传输到while read 循环,例如:

... | while read componentId referenceId; do 
  : Do your processing here with $componentId and $referenceId
done

【讨论】:

  • 你的意思是echo $x | sed 's/: /:/g'
  • 我的意思是 sed 's/: /:/g' 应该代替 sed 's/: //g'
  • @Patlatus:是的,只有空格应该被删除。出现复制错误,抱歉。
【解决方案2】:

此输入字符串是 YAML 对象数组容器的一部分。所以用 YAML 解析器来解析它。

使用 Python:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import sys
import yaml
import json

# Your input invalid JSON but valid YAML elements part of an array
x = "{componentId: 00N5E000005vm9e, componentName: Field, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}, {componentId: 00N5E000005vm9e, componentName: Field, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}, {componentId: 00N5E000005vm9e, componentName: Field, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}, {componentId: 0Rb5E000000BGVi, componentName: Versions, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}, {componentId: 0Rb5E000000BGVj, componentName: Approves, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}, {componentId: 0Rb5E000000BGVe, componentName: activityThreads, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}, {componentId: 0Rb5E000000BGVf, componentName: Attachments, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}, {componentId: 0Rb5E000000BGVh, componentName: Details, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}"

# Compose yamlstring from x by adding the missing data array container
yamlstring = "data: [" + x + "]"

# Load data from the yamlstring
data = yaml.load(yamlstring, yaml.SafeLoader)

# Output data as JSON
json.dump(data, sys.stdout, indent=2)

或者从使用yq作为解析器的shell:

#!/usr/bin/env sh

x="{componentId: 00N5E000005vm9e, componentName: Field, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}, {componentId: 00N5E000005vm9e, componentName: Field, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}, {componentId: 00N5E000005vm9e, componentName: Field, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}, {componentId: 0Rb5E000000BGVi, componentName: Versions, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}, {componentId: 0Rb5E000000BGVj, componentName: Approves, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}, {componentId: 0Rb5E000000BGVe, componentName: activityThreads, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}, {componentId: 0Rb5E000000BGVf, componentName: Attachments, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}, {componentId: 0Rb5E000000BGVh, componentName: Details, referenceId: 0M05E0000002XbV, referenceName: RecordPageName1, referenceUrl: null, message: Component is in use by another component in your organization., reasonCode: 10}"

yamlstring="data: [$x]"

printf %s "$yamlstring" | yq -I 4 -o json e '.' -

【讨论】:

  • yq 不是 bash 发行版的一部分吗?我必须安装它吗?如何安装 yq?
  • @Patlatus yq 不是 Bash 分发的一部分。它必须安装。 python3yamljson 库已经可用,而不是 yq 的分布集成度要低得多,通常需要 snap 或 docker 映像。此外,python 是一种比 shell 更适合处理任何这些结构化数据的脚本语言。
  • 听起来是个有趣的方法,但是我只想有一些没有python的简单脚本,而且yq很难安装
【解决方案3】:

感谢cmets,看来我已经想通了。

echo $x | sed 's/{componentId: /\n/g' | while IFS=\n read -r each; do
    #echo "Each: $each"
    #echo [[ $each == 0Rb* ]]
    if [[ $each == 0Rb* ]]; then
        component=$(echo $each | awk -v FS="(componentName: |,|referenceName: |,)" '{print $3}')
        reference=$(echo $each | awk -v FS="(componentName: |,|referenceName: |,)" '{print $6}')
        echo "component: $component"
        echo "reference: $reference"
    fi
done

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2018-08-23
    • 1970-01-01
    • 2012-08-02
    • 2013-10-23
    • 1970-01-01
    • 1970-01-01
    • 2010-11-29
    • 2013-09-09
    相关资源
    最近更新 更多