【问题标题】:Find elements between values in an array in datafrane Pandas在数据框 Pandas 的数组中查找值之间的元素
【发布时间】:2022-06-29 22:32:41
【问题描述】:

我有一个数据框:

import pandas as pd
data = {'token_1': [['cat', 'bag', 'sitting'],
                    ['dog', 'eats', 'bowls'],
                    ['mouse', 'mustache', 'tail'],
                   ['dog', 'eat', 'meat']],
        'token_2': [['cat', 'from', 'bag', 'cat', 'in', 'bag', 'sitting', 'whole', 'day'],
                    ['dog', 'eats', 'from', 'bowls', 'dog', 'eats', 'always', 'from', 'bowls', 'eats', 'bowl'],
                   ['mouse', 'with', 'a', 'big', 'tail', 'and,' 'ears', 'a', 'mouse', 'with', 'a', 'mustache', 'and', 'a', 'tail' ,'runs', 'fast'],
                   ['dog', 'eat', 'meat', 'chicken', 'from', 'bowl','dog','see','meat','eat']]}

df = pd.DataFrame(data)

token_1 列中没有连词和介词。我想从token_2 列中找到它们。即找到交叉标记之间的单词。 据我了解有几个步骤:

  1. 检查token_1的第一个交集
  2. 查看下一个单词是否小于4的长度,如果是则将其添加到列表中。如果不是,则转到token_1[1]的第一个交集
  3. 我们再次查看下一个单词是否小于长度 4
  4. 重复这个过程,直到我们到达最后一个 token_1[2]
  5. 如果令牌之间没有任何内容,则返回它们

或者有更简单的方法吗?最后,我想得到一个new_token专栏:

+-----------------------+---------------------------------+--------------------------------------------------------------------------------------------+
|token_1                |new_tokens                       |token_2                                                                                     |
+-----------------------+---------------------------------+--------------------------------------------------------------------------------------------+
|[cat, bag, sitting]    |[cat, in, bag, sitting]          |[cat, from, bag, cat, in, bag, sitting, whole, day]                                         |
|[dog, eats, bowls]     |[dog, eats, from, bowls]         |[dog, eats, from, bowls, dog, eats, always, from, bowls, eats, bowl]                        |
|[mouse, mustache, tail]|[mouse, with,mustache, and, tail]|[mouse, with, a, big, tail, and,ears, a, mouse, with, a, mustache, and, a, tail, runs, fast]|
|[dog, eat, meat]       |[dog, eat, meat]                 |[dog, eat, meat, chicken, from, bowl, dog, see, meat, eat]                                  |
+-----------------------+---------------------------------+--------------------------------------------------------------------------------------------+

【问题讨论】:

    标签: python pandas dataframe


    【解决方案1】:

    使用集合交集和熊猫系列应用

    data = {'token_1': [['cat', 'bag', 'sitting'],
                        ['dog', 'eats', 'bowls'],
                        ['mouse', 'mustache', 'tail'],
                       ['dog', 'eat', 'meat']],
            'token_2': [['cat', 'from', 'bag', 'cat', 'in', 'bag', 'sitting', 'whole', 'day'],
                        ['dog', 'eats', 'from', 'bowls', 'dog', 'eats', 'always', 'from', 'bowls', 'eats', 'bowl'],
                       ['mouse', 'with', 'a', 'big', 'tail', 'and,' 'ears', 'a', 'mouse', 'with', 'a', 'mustache', 'and', 'a', 'tail' ,'runs', 'fast'],
                       ['dog', 'eat', 'meat', 'chicken', 'from', 'bowl','dog','see','meat','eat']]}
    
    df = pd.DataFrame(data)
    df.reset_index(inplace=True)
    
    df['intersect']=df.apply(lambda x: set(x['token_1']).intersection(set(x['token_2'])),axis=1)
    print(df)
    

    输出:

    index                  token_1  \
    0      0      [cat, bag, sitting]   
    1      1       [dog, eats, bowls]   
    2      2  [mouse, mustache, tail]   
    3      3         [dog, eat, meat]   
    
                                                 token_2                intersect  
    0  [cat, from, bag, cat, in, bag, sitting, whole,...      {sitting, cat, bag}  
    1  [dog, eats, from, bowls, dog, eats, always, fr...       {dog, bowls, eats}  
    2  [mouse, with, a, big, tail, and,ears, a, mouse...  {tail, mouse, mustache}  
    3  [dog, eat, meat, chicken, from, bowl, dog, see...         {dog, meat, eat}  
    

    【讨论】:

    • 我不仅需要找到交叉点,还需要找到交叉词之间的元素
    猜你喜欢
    • 2022-11-18
    • 2017-04-04
    • 2020-12-09
    • 1970-01-01
    • 1970-01-01
    • 2018-03-15
    • 1970-01-01
    • 1970-01-01
    • 2020-06-01
    相关资源
    最近更新 更多