【问题标题】:split json file into sentences将json文件拆分成句子
【发布时间】:2019-11-12 04:33:55
【问题描述】:

我有一个通过语音到文本生成的 json 文件,它返回所有检测到的带有标点符号的单词。现在我想用它创建句子。

我可以创建一个 while 循环,直到检测到一个点,将所有单词附加到一个列表中并从中返回一个句子。但是这个 while 循环在第一个点处停止。我怎样才能让这个循环一直持续到 json 文件的末尾?

with open(json_file) as f:
    data = json.load(f)

for word in data['words']:
    while not data['words'][i]['name'] == '.':
        sentenceList.append(data['words'][i]['name'])
        i +=1
    sentence = ' '.join(word for word in sentenceList)
print (sentence)

json 示例:

"words": [
    {
      "duration": "0.18", 
      "confidence": "0.990", 
      "name": "Is", 
      "time": "0.80"
    }, 
    {
      "duration": "0.27", 
      "confidence": "1.000", 
      "name": "dit", 
      "time": "0.99"
    }, 
    {
      "duration": "0.24", 
      "confidence": "1.000", 
      "name": "met", 
      "time": "1.50"
    }, 
    {
      "duration": "0.54", 
      "confidence": "0.990", 
      "name": "vaart", 
      "time": "1.86"
    }, 
    {
      "duration": "0.33", 
      "confidence": "0.990", 
      "name": ".", 
      "time": "2.40"
    }, 
    {
      "duration": "0.06", 
      "confidence": "0.910", 
      "name": "We", 
      "time": "2.73"
    }, 
    {
      "duration": "0.21", 
      "confidence": "1.000", 
      "name": "hebben", 
      "time": "2.79"
    }, 
    {
      "duration": "0.09", 
      "confidence": "1.000", 
      "name": "het", 
      "time": "3.00"
    }, 
    {
      "duration": "0.42", 
      "confidence": "1.000", 
      "name": "vandaag", 
      "time": "3.09"
    }, 
    {
      "duration": "0.30", 
      "confidence": "1.000", 
      "name": "over", 
      "time": "3.51"
    }, 
    {
      "duration": "0.60", 
      "confidence": "1.000", 
      "name": "België", 
      "time": "3.81"
    }, 
    {
      "duration": "0.18", 
      "confidence": "1.000", 
      "name": ".", 
      "time": "4.50"
    }

【问题讨论】:

  • 您可以将第二个 while 循环 while not 更改为条件循环,并在完成所有所需逻辑后的一段时间内清除 sentenceList 的内容。可能有很多方法可以实现这一点,包括使用 lambda 函数。
  • 你能发布预期的输出吗?

标签: python json loops


【解决方案1】:

我认为解决方案很简单。你说“但是这个 while 循环在第一个点处停止。”这就是 while 所做的,它循环直到满足条件。因此,只需将其替换为 if 结构即可。

with open(json_file) as f:
    data = json.load(f)

for word in data['words']:
    # Check if it's a word or a dot
    if not data['words'][i]['name'] == '.':
        # If word, add it to the array
        sentenceList.append(data['words'][i]['name'])
        i +=1
# All words are appended, now join.
sentence = ' '.join(word for word in sentenceList)
print(sentence)

【讨论】:

    【解决方案2】:

    在您的情况下,简单的 if 语句足以检查句子的结尾(因为输入结构中的每个 单词序列 都以 "name": "." 结尾):

    sentenceList = []
    for word in data['words']:
        if word['name'] == '.':
            sentence = ' '.join(word for word in sentenceList)
            sentenceList = []
            print(sentence)
        else:
            sentenceList.append(word['name'])
    

    输出:

    Is dit met vaart
    We hebben het vandaag over België
    

    【讨论】:

    • 我也在寻找添加句子开始时间的方法。也就是说,循环遍历json文件,获取第一个单词的时间。输出:0.80 Is dit met vaart2.73 We hebben het vandaag over België
    • @user2811144,假设第一个单词会有0时间(作为起点)?
    • 在这种情况下,它将是0.80,但可以是任何东西(取决于 json 文件中的内容)
    • @user2811144,如果第一项是0.8,那你在哪里设置时间起点呢?
    • 在 json 中,我得到了每个单词的时间,在这种情况下,第一个单词 'Is' 的开始时间是 0.8 秒,所以这是第一句话。对于第二句的第一个词(点号后的第一个词),开始时间为2.73 s,以此类推。
    【解决方案3】:

    使用itertools.groupby

    data = '''{"words": [
        {
          "duration": "0.18",
          "confidence": "0.990",
          "name": "Is",
          "time": "0.80"
        },
        {
          "duration": "0.27",
          "confidence": "1.000",
          "name": "dit",
          "time": "0.99"
        },
        {
          "duration": "0.24",
          "confidence": "1.000",
          "name": "met",
          "time": "1.50"
        },
        {
          "duration": "0.54",
          "confidence": "0.990",
          "name": "vaart",
          "time": "1.86"
        },
        {
          "duration": "0.33",
          "confidence": "0.990",
          "name": ".",
          "time": "2.40"
        },
        {
          "duration": "0.06",
          "confidence": "0.910",
          "name": "We",
          "time": "2.73"
        },
        {
          "duration": "0.21",
          "confidence": "1.000",
          "name": "hebben",
          "time": "2.79"
        },
        {
          "duration": "0.09",
          "confidence": "1.000",
          "name": "het",
          "time": "3.00"
        },
        {
          "duration": "0.42",
          "confidence": "1.000",
          "name": "vandaag",
          "time": "3.09"
        },
        {
          "duration": "0.30",
          "confidence": "1.000",
          "name": "over",
          "time": "3.51"
        },
        {
          "duration": "0.60",
          "confidence": "1.000",
          "name": "België",
          "time": "3.81"
        },
        {
          "duration": "0.18",
          "confidence": "1.000",
          "name": ".",
          "time": "4.50"
        }
    ]}'''
    
    import json
    from itertools import groupby
    d = json.loads(data)
    lst = [' '.join(i['name'] for i in g) + '.' for v, g in groupby(d['words'], lambda w: w['name'] != '.') if v]
    
    print(lst)
    

    打印:

    ['Is dit met vaart.', 'We hebben het vandaag over België.']
    

    【讨论】:

      猜你喜欢
      • 2017-04-18
      • 2011-11-03
      • 2021-12-25
      • 1970-01-01
      • 2015-02-11
      • 1970-01-01
      • 2019-08-27
      • 2014-02-11
      • 2013-04-28
      相关资源
      最近更新 更多