【问题标题】:How to make list of list from dataframe pandas?如何从数据框熊猫中制作列表列表?
【发布时间】:2018-03-11 23:51:16
【问题描述】:

我有一个带有单词和标签的 Pandas 数据框

  words   tags
0 I       WW
1 am      XX
2 newbie  YY
3 .       ZZ
4 You     WW
5 are     XX
6 cool    YY
7 .       ZZ

有什么方法可以从数据框中创建列表列表,如下所示:

[[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.','ZZ')], 
 [('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.','ZZ')]]

它是元组列表的列表。对于列表中的每个列表,由('.','ZZ') 分隔。表示它是一个句子。

如果条件为真,我可以迭代数据帧的每一行并创建列表并附加它,但是有什么“熊猫”方法来解决它吗?

【问题讨论】:

    标签: python list pandas dataframe tuples


    【解决方案1】:

    这是一种方法

    In [5149]: dft = df.apply(tuple, 1)
    
    In [5150]: parts = (dft == ('.', 'ZZ')).shift().cumsum().bfill()
               # parts = (dft.shift() == ('.', 'ZZ')).cumsum()       from Alexander's
    
    In [5151]: [x.values.tolist() for _, x in dft.groupby(parts)]
    Out[5151]:
    [[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')],
     [('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]
    

    或者,

    In [5152]: dft.groupby(parts).apply(list).tolist()
    Out[5152]:
    [[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')],
     [('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]
    

    或者,

    In [5165]: list(dft.groupby(parts).apply(list))
    Out[5165]:
    [[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')],
     [('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]
    

    详情

    In [5153]: parts
    Out[5153]:
    0    0.0
    1    0.0
    2    0.0
    3    0.0
    4    1.0
    5    1.0
    6    1.0
    7    1.0
    dtype: float64
    

    【讨论】:

      【解决方案2】:

      第一部分 (df.groupby((df.shift().values == ['.', 'ZZ']).all(axis=1).cumsum())) 将根据数据帧的“单词”列中的连续值对数据帧进行分组,直到并包括第二列也等于 Z 的时段。这是shift-cumsum 模式的一个变体(在 SO 上搜索 pandas shift cumsum,你应该会发现很多变体)。

      第二部分 (.apply(lambda group: zip(group['words'], group['tags']))) 为每一行创建元组对,例如

      0     [(I, WW), (am, XX), (newbie, YY), (., ZZ)]
      1    [(You, WW), (are, XX), (cool, YY), (., ZZ)]
      dtype: object
      

      最后一部分 (.values.tolist()) 将数据框转换为您想要的格式作为列表列表。

      >>> df.groupby((df.shift().values == ['.', 'ZZ']).all(axis=1).cumsum()).apply(
              lambda group: zip(group['words'], group['tags'])).values.tolist()
      [[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')],
       [('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]
      

      【讨论】:

      • 我认为,Alex 假设由于这些是 NLP 标签,'.' 将始终被标记为 'ZZ'
      • 但我修改了它以适应特定要求。
      【解决方案3】:

      你也可以做 np.array_split 即

      li = list(filter(None,[i.apply(tuple,1).values.tolist() \
           for i in np.array_split(df,df[(df['words'] == '.') & (df['tags'] == 'ZZ')].index+1)]))
      

      x = df.apply(tuple,1)
      li = [ i.tolist() for i in np.array_split(x,x[x==('.','ZZ')].index+1) if len(i.tolist())>1]
      

      输出:

      [[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')],
       [('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]
      

      【讨论】:

        【解决方案4】:

        如果性能很重要,您可以先从所有值创建元组,然后将它们分成子列表:

        from  itertools import groupby
        
        L = list(zip(df['words'], df['tags']))
        print (L)
        [('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), 
         ('.', 'ZZ'), ('You', 'WW'), ('are', 'XX'), 
         ('cool', 'YY'), ('.', 'ZZ')]
        
        sep = ('.','ZZ')
        new_L = [list(g) + [sep] for k, g in groupby(L, lambda x: x==sep) if not k] 
        print (new_L)
        
        [[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')], 
         [('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]
        

        时间安排

        df = pd.concat([df]*1000).reset_index(drop=True)
        
        def zero(df):
            dft = df.apply(tuple, 1)
            return ([x.values.tolist() for _, x in dft.groupby((dft == ('.', 'ZZ')).shift().cumsum().bfill())])
        
        In [55]: %timeit ([list(g) + [('.','ZZ')] for k, g in groupby(list(zip(df['words'], df['tags'])), lambda x: x==('.','ZZ')) if not k] )
        100 loops, best of 3: 4.14 ms per loop
        
        def pir(df):
            v = df.values
            return ([list(map(tuple, x)) for x in np.split(v, np.where((v == ['.', 'ZZ']).all(1)[:-1])[0] + 1)])
        
        In [68]: %timeit (pir(df))
        10 loops, best of 3: 21.9 ms per loop
        
        
        In [56]: %timeit (zero(df))
        1 loop, best of 3: 328 ms per loop
        
        In [57]: %timeit (df.groupby((df.shift().values == ['.', 'ZZ']).all(axis=1).cumsum()).apply(lambda group: list(zip(group['words'], group['tags']))).values.tolist())
        1 loop, best of 3: 286 ms per loop
        
        In [58]: %timeit (list(filter(None,[i.apply(tuple,1).values.tolist() for i in np.array_split(df,df[(df['words'] == '.') & (df['tags'] == 'ZZ')].index+1)])))
        1 loop, best of 3: 1.31 s per loop
        

        对于我创建问题的子列表,您可以查看解决方案here

        def jez_coldspeed(df):
            L = list(zip(df['words'], df['tags']))
            L2 = []
            for i in L[::-1]:
                if i == ('.','ZZ'):
                    L2.append([])
        
                L2[-1].append(i)
        
            return [x[::-1] for x in L2[::-1]]
        
        def jez_coldspeed1(df):
            L = list(zip(df['words'], df['tags']))
            L2 = []
            sep = ('.','ZZ')
            for i in reversed(L):
                 if i == sep:
                     L2.append([])
        
                 L2[-1].append(i)
        
            return [x[::-1] for x in reversed(L2)]
        
        
        In [74]: %timeit (jez_coldspeed(df))
        100 loops, best of 3: 2.96 ms per loop
        
        In [75]: %timeit (jez_coldspeed1(df))
        100 loops, best of 3: 2.95 ms per loop
        

        def jez_theBuzzyCoder(df):
            L = list(zip(df['words'], df['tags']))
            a = list()
            start = 0
            sep = ('.', 'ZZ')
        
            while start < len(L) and (L.index(sep, start) != -1):
                end = L.index(sep, start) + 1
                a.append(L[start:end])
                start = end
            return a
        
        
        print (jez_theBuzzyCoder(df))
        
        In [81]: %timeit (jez_theBuzzyCoder(df))
        100 loops, best of 3: 3.16 ms per loop
        

        【讨论】:

        • 这个方法绝对是最快的。
        • 确实很快。
        • 啊哈!确实很快!
        • 哇!脑洞大开。感谢大家! (特别是对你 jezrael xD)
        【解决方案5】:
        v = df.values
        
        [
            list(map(tuple, x))
            for x in np.split(v, np.where((v == ['.', 'ZZ']).all(1)[:-1])[0] + 1)
        ]
        
        [[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')],
         [('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2018-01-07
          • 1970-01-01
          • 2015-07-31
          • 1970-01-01
          • 2017-08-09
          • 1970-01-01
          • 2021-06-08
          • 2021-09-29
          相关资源
          最近更新 更多