【问题标题】:How to Read a Text File of Dictionaries into a DataFrame如何将字典的文本文件读入 DataFrame
【发布时间】:2019-06-26 15:04:04
【问题描述】:

我有一个来自 Kaggle of Clash Royale 统计数据的文本文件。它采用 Python 字典的格式。我正在努力找出如何以有意义的方式将其读入文件。好奇最好的方法是什么。这是一个相当复杂的带有列表的字典。

这里的原始数据集: https://www.kaggle.com/s1m0n38/clash-royale-matches-dataset

{'players': {'right': {'deck': [['Mega Minion', '9'], ['Electro Wizard', '3'], ['Arrows', '11'], ['Lightning', '5'], ['Tombstone', '9'], ['The Log', '2'], ['Giant', '9'], ['Bowler', '5']], 'trophy': '4258', 'clan': 'TwoFiveOne', 'name': 'gpa raid'}, 'left': {'deck': [['Fireball', '9'], ['Archers', '12'], ['Goblins', '12'], ['Minions', '11'], ['Bomber', '12'], ['The Log', '2'], ['Barbarians', '12'], ['Royal Giant', '13']], 'trophy': '4325', 'clan': 'battusai', 'name': 'Supr4'}}, 'type': 'ladder', 'result': ['2', '0'], 'time': '2017-07-12'}
{'players': {'right': {'deck': [['Ice Spirit', '10'], ['Valkyrie', '9'], ['Hog Rider', '9'], ['Inferno Tower', '9'], ['Goblins', '12'], ['Musketeer', '9'], ['Zap', '12'], ['Fireball', '9']], 'trophy': '4237', 'clan': 'The Wolves', 'name': 'TITAN'}, 'left': {'deck': [['Royal Giant', '13'], ['Ice Wizard', '2'], ['Bomber', '12'], ['Knight', '12'], ['Fireball', '9'], ['Barbarians', '12'], ['The Log', '2'], ['Archers', '12']], 'trophy': '4296', 'clan': 'battusai', 'name': 'Supr4'}}, 'type': 'ladder', 'result': ['1', '0'], 'time': '2017-07-12'}
{'players': {'right': {'deck': [['Miner', '3'], ['Ice Golem', '9'], ['Spear Goblins', '12'], ['Minion Horde', '12'], ['Inferno Tower', '8'], ['The Log', '2'], ['Skeleton Army', '6'], ['Fireball', '10']], 'trophy': '4300', 'clan': '@LA PERLA NEGRA', 'name': 'Victor'}, 'left': {'deck': [['Royal Giant', '13'], ['Ice Wizard', '2'], ['Bomber', '12'], ['Knight', '12'], ['Fireball', '9'], ['Barbarians', '12'], ['The Log', '2'], ['Archers', '12']], 'trophy': '4267', 'clan': 'battusai', 'name': 'Supr4'}}, 'type': 'ladder', 'result': ['0', '1'], 'time': '2017-07-12'}

【问题讨论】:

    标签: python pandas dictionary text json-normalize


    【解决方案1】:

    我将您的数据保存到 .json 文件中,然后只需要遍历每一行并将其视为自己的 JSON 文件,然后我使用 pandas.json_normalize 将其加载到 DataFrame 中,我做了一些猜测您希望 df 的外观如何,但我想出了这个:

    注意: 正确的JSON 需要有双引号而不是单引号,所以我使用替换来解决这个问题。小心不要使用 this 破坏里面的数据。

    注意: 我得到这个工作的方式,我必须合并 'right''left' 所以你丢失了这些数据。如果需要,您可以使用 dict comp 作为解决方法

    import json
    import pandas as pd
    
    with open('cr.json', 'r') as f:
        df = None
        for line in f:
            data = json.loads(line.replace("'", '"'))
            #needed to put the right and left keys together, maybe you can find a way around this, I wasn't
            df1 = pd.json_normalize([data['players']['right'], data['players']['left']],
                         'deck',
                         ['name', 'trophy', 'clan'],
                         meta_prefix='player.',
                         errors='ignore')
            df = pd.concat([df, df1])
        df.rename(columns={0: 'player.troop.name', 1: 'player.troop.level'}, 
                  inplace=True)
        print(df)
    

    打印如下:

       player.troop.name player.troop.level player.name      player.clan  \
    0        Mega Minion                  9    gpa raid       TwoFiveOne   
    1     Electro Wizard                  3    gpa raid       TwoFiveOne   
    2             Arrows                 11    gpa raid       TwoFiveOne   
    3          Lightning                  5    gpa raid       TwoFiveOne   
    4          Tombstone                  9    gpa raid       TwoFiveOne   
    5            The Log                  2    gpa raid       TwoFiveOne   
    6              Giant                  9    gpa raid       TwoFiveOne   
    7             Bowler                  5    gpa raid       TwoFiveOne   
    8           Fireball                  9       Supr4         battusai   
    9            Archers                 12       Supr4         battusai   
    10           Goblins                 12       Supr4         battusai   
    11           Minions                 11       Supr4         battusai   
    12            Bomber                 12       Supr4         battusai   
    13           The Log                  2       Supr4         battusai   
    14        Barbarians                 12       Supr4         battusai   
    15       Royal Giant                 13       Supr4         battusai   
    0         Ice Spirit                 10       TITAN       The Wolves   
    1           Valkyrie                  9       TITAN       The Wolves   
    2          Hog Rider                  9       TITAN       The Wolves   
    3      Inferno Tower                  9       TITAN       The Wolves   
    4            Goblins                 12       TITAN       The Wolves   
    5          Musketeer                  9       TITAN       The Wolves   
    6                Zap                 12       TITAN       The Wolves   
    7           Fireball                  9       TITAN       The Wolves   
    8        Royal Giant                 13       Supr4         battusai   
    9         Ice Wizard                  2       Supr4         battusai   
    10            Bomber                 12       Supr4         battusai   
    11            Knight                 12       Supr4         battusai   
    12          Fireball                  9       Supr4         battusai   
    13        Barbarians                 12       Supr4         battusai   
    14           The Log                  2       Supr4         battusai   
    15           Archers                 12       Supr4         battusai   
    0              Miner                  3      Victor  @LA PERLA NEGRA   
    1          Ice Golem                  9      Victor  @LA PERLA NEGRA   
    2      Spear Goblins                 12      Victor  @LA PERLA NEGRA   
    3       Minion Horde                 12      Victor  @LA PERLA NEGRA   
    4      Inferno Tower                  8      Victor  @LA PERLA NEGRA   
    5            The Log                  2      Victor  @LA PERLA NEGRA   
    6      Skeleton Army                  6      Victor  @LA PERLA NEGRA   
    7           Fireball                 10      Victor  @LA PERLA NEGRA   
    8        Royal Giant                 13       Supr4         battusai   
    9         Ice Wizard                  2       Supr4         battusai   
    10            Bomber                 12       Supr4         battusai   
    11            Knight                 12       Supr4         battusai   
    12          Fireball                  9       Supr4         battusai   
    13        Barbarians                 12       Supr4         battusai   
    14           The Log                  2       Supr4         battusai   
    15           Archers                 12       Supr4         battusai   
    
       player.trophy  
    0           4258  
    1           4258  
    2           4258  
    3           4258  
    4           4258  
    5           4258  
    6           4258  
    7           4258  
    8           4325  
    9           4325  
    10          4325  
    11          4325  
    12          4325  
    13          4325  
    14          4325  
    15          4325  
    0           4237  
    1           4237  
    2           4237  
    3           4237  
    4           4237  
    5           4237  
    6           4237  
    7           4237  
    8           4296  
    9           4296  
    10          4296  
    11          4296  
    12          4296  
    13          4296  
    14          4296  
    15          4296  
    0           4300  
    1           4300  
    2           4300  
    3           4300  
    4           4300  
    5           4300  
    6           4300  
    7           4300  
    8           4267  
    9           4267  
    10          4267  
    11          4267  
    12          4267  
    13          4267  
    14          4267  
    15          4267
    

    df.iloc[0]如下:

    player.troop.name Mega Minion
    player.troop.level         9
    player.name         gpa raid
    player.trophy           4258
    player.clan       TwoFiveOne
    Name: 0, dtype: object
    

    您可以修改 json_normalize 参数,使其符合您的要求,但我希望这足以让您继续前进

    【讨论】:

      【解决方案2】:

      根据这个数据集的synopsis on kaggle,每个字典代表两个玩家之间的比赛。我觉得让数据框中的每一行代表单个匹配的所有特征是有意义的。

      这可以通过几个简短的步骤来完成。

      1. 将所有匹配字典(来自 kaggle 的数据集的每一行)存储在一个列表中:
      matches = [
      {'players': {'right': {'deck': [['Mega Minion', '9'], ['Electro Wizard', '3'], ['Arrows', '11'], ['Lightning', '5'], ['Tombstone', '9'], ['The Log', '2'], ['Giant', '9'], ['Bowler', '5']], 'trophy': '4258', 'clan': 'TwoFiveOne', 'name': 'gpa raid'}, 'left': {'deck': [['Fireball', '9'], ['Archers', '12'], ['Goblins', '12'], ['Minions', '11'], ['Bomber', '12'], ['The Log', '2'], ['Barbarians', '12'], ['Royal Giant', '13']], 'trophy': '4325', 'clan': 'battusai', 'name': 'Supr4'}}, 'type': 'ladder', 'result': ['2', '0'], 'time': '2017-07-12'},
      {'players': {'right': {'deck': [['Ice Spirit', '10'], ['Valkyrie', '9'], ['Hog Rider', '9'], ['Inferno Tower', '9'], ['Goblins', '12'], ['Musketeer', '9'], ['Zap', '12'], ['Fireball', '9']], 'trophy': '4237', 'clan': 'The Wolves', 'name': 'TITAN'}, 'left': {'deck': [['Royal Giant', '13'], ['Ice Wizard', '2'], ['Bomber', '12'], ['Knight', '12'], ['Fireball', '9'], ['Barbarians', '12'], ['The Log', '2'], ['Archers', '12']], 'trophy': '4296', 'clan': 'battusai', 'name': 'Supr4'}}, 'type': 'ladder', 'result': ['1', '0'], 'time': '2017-07-12'},
      {'players': {'right': {'deck': [['Miner', '3'], ['Ice Golem', '9'], ['Spear Goblins', '12'], ['Minion Horde', '12'], ['Inferno Tower', '8'], ['The Log', '2'], ['Skeleton Army', '6'], ['Fireball', '10']], 'trophy': '4300', 'clan': '@LA PERLA NEGRA', 'name': 'Victor'}, 'left': {'deck': [['Royal Giant', '13'], ['Ice Wizard', '2'], ['Bomber', '12'], ['Knight', '12'], ['Fireball', '9'], ['Barbarians', '12'], ['The Log', '2'], ['Archers', '12']], 'trophy': '4267', 'clan': 'battusai', 'name': 'Supr4'}}, 'type': 'ladder', 'result': ['0', '1'], 'time': '2017-07-12'}
      ]
      
      1. 从上面的列表中创建一个数据框,它将自动填充包含匹配的typetimeresult 信息的列:
      df = pd.DataFrame(matches)
      
      1. 然后,使用一些简单的逻辑来填充包含比赛中左右球员的decktrophyclanname 信息的列:
      sides = ['right', 'left']
      player_keys = ['deck', 'trophy', 'clan', 'name']
      
      for side in sides:
          for key in player_keys:
              for i, row in df.iterrows():
                  df[side + '_' + key] = df['players'].apply(lambda x: x[side][key])
      
      df = df.drop('players', axis=1) # no longer need this after populating the other columns
      
      df = df.iloc[:, ::-1] # made sense to display columns in order of player info from left to right,
                            # followed by general match info at the far right of the dataframe
      

      生成的数据框如下所示:

          left_name   left_clan   left_trophy   left_deck                                           right_name    right_clan  right_trophy    right_deck                                          type    time         result
      0   Supr4       battusai           4325   [[Fireball, 9], [Archers, 12], [Goblins, 12], ...   gpa raid      TwoFiveOne          4258    [[Mega Minion, 9], [Electro Wizard, 3], [Arrow...   ladder  2017-07-12   [2, 0]
      1   Supr4       battusai           4296   [[Royal Giant, 13], [Ice Wizard, 2], [Bomber, ...   TITAN The     Wolves              4237    [[Ice Spirit, 10], [Valkyrie, 9], [Hog Rider, ...   ladder  2017-07-12   [1, 0]
      2   Supr4       battusai           4267   [[Royal Giant, 13], [Ice Wizard, 2], [Bomber, ...   Victor        @LA PERLA NEGRA     4300    [[Miner, 3], [Ice Golem, 9], [Spear Goblins, 1...   ladder  2017-07-12   [0, 1]
      

      【讨论】:

        【解决方案3】:
        • 其他答案仅适用于玩具数据,如 OP 中所述。这个答案涉及来自 Kaggle 的实际文件,以及如何清理它。
        • Kaggle filematches.txt,是嵌套的行dicts
          • 在文件中,每行有 4 个顶级键,['players', 'type', 'result', 'time']
        • 读入文件,这将使每一行成为str类型
        • 将其从str 转换为dict 类型与ast.literal_eval
          • 某些行的格式不正确,将导致SyntaxError
        • 可以使用pandas.json_normalize将数据转换为数据帧

        进口

        import pandas as pd
        from ast import literal_eval
        

        清理文件

        # store the data
        data = list()
        
        # store the broken rows
        broken_row = list()
        
        # read in the file
        with open('matches.txt', 'r', encoding='utf-8') as f:  
            
            # read the rows
            rows = f.readlines()
            for row in rows:
                
                # try to convert a row from a string to dict
                try:
                    row = literal_eval(row)
                    data.append(row)
                except SyntaxError:
                    broken_row.append(row)
                    continue
        

        data转换为长DataFrame

        • 对于每场比赛,每个'players.right.deck', 'players.left.deck' 都有一个单独的行。
        # convert data to a dataframe
        players = pd.json_normalize(data)
        
        # add a unique id for each row, which can be used to identify players for a particular game
        df['id'] = df.index
        
        # split the list of lists in right.deck and left.deck to separate rows
        players = df[['id', 'players.right.deck', 'players.left.deck']].apply(pd.Series.explode).reset_index(drop=True)
        
        # drop the original columns
        df.drop(columns=['players.right.deck', 'players.left.deck'], inplace=True)
        
        # right.deck and left.deck are still a list with two values, which need to have separate columns
        players[['right.deck.name', 'right.deck.number']] = pd.DataFrame(players.pop('players.right.deck').values.tolist())
        players[['left.deck.name', 'left.deck.number']] = pd.DataFrame(players.pop('players.left.deck').values.tolist())
        
        # separate the result column into two columns
        df[['right.result', 'left.result']] = pd.DataFrame(df.pop('result').values.tolist())
        
        # merge df with players
        df = df.merge(players, on='id')
        

        df.head(8)

             type        time players.right.trophy players.right.clan players.right.name players.left.trophy players.left.clan players.left.name  id right.result left.result right.deck.name right.deck.number left.deck.name left.deck.number
        0  ladder  2017-07-12                 4258         TwoFiveOne           gpa raid                4325          battusai             Supr4   0            2           0     Mega Minion                 9       Fireball                9
        1  ladder  2017-07-12                 4258         TwoFiveOne           gpa raid                4325          battusai             Supr4   0            2           0  Electro Wizard                 3        Archers               12
        2  ladder  2017-07-12                 4258         TwoFiveOne           gpa raid                4325          battusai             Supr4   0            2           0          Arrows                11        Goblins               12
        3  ladder  2017-07-12                 4258         TwoFiveOne           gpa raid                4325          battusai             Supr4   0            2           0       Lightning                 5        Minions               11
        4  ladder  2017-07-12                 4258         TwoFiveOne           gpa raid                4325          battusai             Supr4   0            2           0       Tombstone                 9         Bomber               12
        5  ladder  2017-07-12                 4258         TwoFiveOne           gpa raid                4325          battusai             Supr4   0            2           0         The Log                 2        The Log                2
        6  ladder  2017-07-12                 4258         TwoFiveOne           gpa raid                4325          battusai             Supr4   0            2           0           Giant                 9     Barbarians               12
        7  ladder  2017-07-12                 4258         TwoFiveOne           gpa raid                4325          battusai             Supr4   0            2           0          Bowler                 5    Royal Giant               13
        

        data 转换为宽数据帧

        • 此选项使用flatten_json 函数。
        • 对于每个匹配项,每个 'players.right.deck', 'players.left.deck' 都有一个单独的列。
        # convert data to a wide dataframe
        df = pd.DataFrame([flatten_json(x) for x in data])
        
        # display(df.head(3))
          players_right_deck_0_0 players_right_deck_0_1 players_right_deck_1_0 players_right_deck_1_1 players_right_deck_2_0 players_right_deck_2_1 players_right_deck_3_0 players_right_deck_3_1 players_right_deck_4_0 players_right_deck_4_1 players_right_deck_5_0 players_right_deck_5_1 players_right_deck_6_0 players_right_deck_6_1 players_right_deck_7_0 players_right_deck_7_1 players_right_trophy players_right_clan players_right_name players_left_deck_0_0 players_left_deck_0_1 players_left_deck_1_0 players_left_deck_1_1 players_left_deck_2_0 players_left_deck_2_1 players_left_deck_3_0 players_left_deck_3_1 players_left_deck_4_0 players_left_deck_4_1 players_left_deck_5_0 players_left_deck_5_1 players_left_deck_6_0 players_left_deck_6_1 players_left_deck_7_0 players_left_deck_7_1 players_left_trophy players_left_clan players_left_name    type result_0 result_1        time
        0            Mega Minion                      9         Electro Wizard                      3                 Arrows                     11              Lightning                      5              Tombstone                      9                The Log                      2                  Giant                      9                 Bowler                      5                 4258         TwoFiveOne           gpa raid              Fireball                     9               Archers                    12               Goblins                    12               Minions                    11                Bomber                    12               The Log                     2            Barbarians                    12           Royal Giant                    13                4325          battusai             Supr4  ladder        2        0  2017-07-12
        1             Ice Spirit                     10               Valkyrie                      9              Hog Rider                      9          Inferno Tower                      9                Goblins                     12              Musketeer                      9                    Zap                     12               Fireball                      9                 4237         The Wolves              TITAN           Royal Giant                    13            Ice Wizard                     2                Bomber                    12                Knight                    12              Fireball                     9            Barbarians                    12               The Log                     2               Archers                    12                4296          battusai             Supr4  ladder        1        0  2017-07-12
        2                  Miner                      3              Ice Golem                      9          Spear Goblins                     12           Minion Horde                     12          Inferno Tower                      8                The Log                      2          Skeleton Army                      6               Fireball                     10                 4300    @LA PERLA NEGRA             Victor           Royal Giant                    13            Ice Wizard                     2                Bomber                    12                Knight                    12              Fireball                     9            Barbarians                    12               The Log                     2               Archers                    12                4267          battusai             Supr4  ladder        0        1  2017-07-12
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 2013-12-23
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多