【问题标题】:In Python, how do I create and array from words from multiple lists, with word occurrences [duplicate]在Python中,如何从多个列表中的单词创建和排列单词,并出现单词[重复]
【发布时间】:2019-06-27 23:41:40
【问题描述】:

我有一个 JSON 文件,其中包含多个带有文本字段的对象:

{
"messages": 
[
    {"timestamp": "123456789", "timestampIso": "2019-06-26 09:51:00", "agentId": "2001-100001", "skillId": "2001-20000", "agentText": "That customer was great"},
    {"timestamp": "123456789", "timestampIso": "2019-06-26 09:55:00", "agentId": "2001-100001", "skillId": "2001-20001", "agentText": "That customer was stupid\nI hope they don't phone back"},
    {"timestamp": "123456789", "timestampIso": "2019-06-26 09:57:00", "agentId": "2001-100001", "skillId": "2001-20002", "agentText": "Line number 3"},
    {"timestamp": "123456789", "timestampIso": "2019-06-26 09:59:00", "agentId": "2001-100001", "skillId": "2001-20003", "agentText": ""}
]
}

我只对“agentText”字段感兴趣。

我基本上需要删除 agentText 字段中的每个单词并计算单词的出现次数。

所以我的python代码:

import json

with open('20190626-101200-text-messages.json') as f:
  data = json.load(f)

for message in data['messages']:
    splittext= message['agentText'].strip().replace('\n',' ').replace('\r',' ')
    if len(splittext)>0:
        splittext2 = splittext.split(' ')
        print(splittext2)

给我这个:

['That', 'customer', 'was', 'great']
['That', 'customer', 'was', 'stupid', 'I', 'hope', 'they', "don't", 'phone', 'back']
['Line', 'number', '3']

如何将每个单词添加到具有计数的数组中? 太喜欢了;

That 2
customer 2
was 2
great 1
..

等等?

【问题讨论】:

    标签: python json


    【解决方案1】:

    看看这个。

    data = {
        "messages": 
            [
                {"timestamp": "123456789", "timestampIso": "2019-06-26 09:51:00", "agentId": "2001-100001", "skillId": "2001-20000", "agentText": "That customer was great"},
                {"timestamp": "123456789", "timestampIso": "2019-06-26 09:55:00", "agentId": "2001-100001", "skillId": "2001-20001", "agentText": "That customer was stupid\nI hope they don't phone back"},
                {"timestamp": "123456789", "timestampIso": "2019-06-26 09:57:00", "agentId": "2001-100001", "skillId": "2001-20002", "agentText": "Line number 3"},
                {"timestamp": "123456789", "timestampIso": "2019-06-26 09:59:00", "agentId": "2001-100001", "skillId": "2001-20003", "agentText": ""}
            ]
    }
    
    var = []
    
    for row in data['messages']:
        new_row = row['agentText'].split()
        if new_row:
            var.append(new_row)
    
    temp = dict()
    
    for e in var:
        for j in e:
            if j in temp:
                temp[j] = temp[j] + 1
            else:
                temp[j] = 1
    
    for key, value in temp.items():
        print(f'{key}: {value}')
    

    【讨论】:

    • 这种工作,但是,我的 3 条消息行没有用逗号(,)分隔 - 它们只是 for 循环中的一行接一行。如何将一行附加到另一行并用逗号分隔?
    • data.split() 它将帮助您用逗号分隔
    • 看上面我已经对我的回复进行了更改。使用 split 以逗号分隔
    【解决方案2】:
    data = '''{"messages":
    [
        {"timestamp": "123456789", "timestampIso": "2019-06-26 09:51:00", "agentId": "2001-100001", "skillId": "2001-20000", "agentText": "That customer was great"},
        {"timestamp": "123456789", "timestampIso": "2019-06-26 09:55:00", "agentId": "2001-100001", "skillId": "2001-20001", "agentText": "That customer was stupid I hope they don't phone back"},
        {"timestamp": "123456789", "timestampIso": "2019-06-26 09:57:00", "agentId": "2001-100001", "skillId": "2001-20002", "agentText": "Line number 3"},
        {"timestamp": "123456789", "timestampIso": "2019-06-26 09:59:00", "agentId": "2001-100001", "skillId": "2001-20003", "agentText": ""}
    ]
    }
    '''
    
    import json
    from collections import Counter
    from pprint import pprint
    
    def words(data):
        for m in data['messages']:
            yield from m['agentText'].split()
    
    c = Counter(words(json.loads(data)))
    pprint(c.most_common())
    

    打印:

    [('That', 2),
     ('customer', 2),
     ('was', 2),
     ('great', 1),
     ('stupid', 1),
     ('I', 1),
     ('hope', 1),
     ('they', 1),
     ("don't", 1),
     ('phone', 1),
     ('back', 1),
     ('Line', 1),
     ('number', 1),
     ('3', 1)]
    

    【讨论】:

    • 似乎不喜欢:c = Counter(words(json.loads(data))) pprint(c.most_common()) 它的 pprint 带有红色下划线
    猜你喜欢
    • 2021-05-09
    • 2016-08-09
    • 2018-07-26
    • 2012-12-14
    • 2020-01-13
    • 1970-01-01
    • 2015-05-31
    • 1970-01-01
    • 2020-04-19
    相关资源
    最近更新 更多