【问题标题】:Python - Looping through list and adding an OR optionPython - 遍历列表并添加 OR 选项
【发布时间】:2021-05-17 04:14:51
【问题描述】:

我在 Python 中有 2 个列表。 我想检查一个关键字是否出现在我的文本中,如果出现,那么我提取句子,如果没有,我输入“未找到”。我有另一个列表,其中包含所有关键字的所有列。

我的理解是输入 df 的数据帧长度和列表长度需要相同。

我想检查文本是否有“鹦鹉”或“鹦鹉”这个词,如果有,则添加到同一列下的数据框中。

我不想要一个额外的列,因为鹦鹉和鹦鹉非常相似 - 所以它们可以放在同一列下。

我不确定如何做到这一点 - 我是在字典中添加还是在嵌套列表中添加。请问有人可以建议吗?

下面的虚拟代码

df_cols=[column1, column2, column3, column4]

#here I have added parrot OR parrots to explain the example

keywords =['cat','dog','parrot', 'parrots','sheep']

text ='the cat was here today. my data is very long. dog is so cute. parrots are so colorful.'

code:
lst=[]
for i in text.split('.')):
    if j in i:
       lst.append('.'.join(text.split('.')) 
    if j == 'parrot' or j == 'parrots':

    #getting and error here - I want to check if parrot/parrots is in my text and 
    #then join it to append one element to my output list

       lst.append(' '.join(i.split('.')[0]))
    else:
       lst.append('not found')
    break

想要的输出:

lst = [the cat was here today,dog is so cute, parrots are so colorful, not found]

所需的数据框:


column1     the cat was here today 
column2     dog is so cute
column3     parrots are so colorful
column4     not found

谢谢

【问题讨论】:

  • 几个问题:你为什么使用数据框?为什么不直接使用['cat', 'dog', 'parrot', 'parrots', 'sheep'] 来包含parrotparrots?为什么for 循环的末尾有一个break
  • 我添加了break,因为我只想找到在我的文本中找到关键字的第一个实例。我使用数据框,因为我需要将此信息输出到 excel 中以进行进一步分析。所以本质上,它遍历每个关键字并将输出放入一个列表中,然后我将其作为数据框放入 - 希望这有意义吗?如果我包含 parrotparrots 则它作为 2 个不同的项目附加到列表中 - 我需要它作为一个项目附加,以便我可以在一列下分配输出
  • 那么数据框/表的完整结构是什么?你怎么知道第 1、2、3、4 列属于什么?
  • Column1 是 for 循环中的第一项 - 所以这将是 the cat was here today 并继续直到所有列都完成。
  • 您提出了一些不寻常的问题,这表明可能有更好的方法来实现您想要做的事情。你到底想完成什么?

标签: python python-3.x list dataframe for-loop


【解决方案1】:

我不知道这是否是最好的代码,但它可以满足您的要求。基本上,它会检查关键字是否为str/set,并对每种情况进行必要的检查。

keywords =['cat','dog',{'parrot', 'parrots'},'sheep']

text ='the cat was here today. my data is very long. dog is so cute. parrots are so colorful.'
text_list = text.split('.')

ls = []
for i in keywords:
    if type(i) is str:
        isFound = False
        for j in text_list:
            if i in j:
                ls.append(j)
                isFound = True
                break

        if isFound == False:
            ls.append('not found')
    elif type(i) is set:
        isFound = False
        for x in i:
            for j in text_list:
                if x in j:
                    ls.append(j)
                    isFound = True
                    break

            if isFound == True:
                break

        if isFound == False:
            ls.append('not found')
    

结果如下所示:

['the cat was here today', ' dog is so cute', ' parrots are so colorful', 'not found']

【讨论】:

    【解决方案2】:

    你可以使用正则表达式:

    import re
    import pandas as pd
    
    keywords = ['cat','dog', ('parrot', 'parrots'), 'sheep']
    sentence_regex = r"[^\.]+\s*(\w*)\s*[^\.]+"  # regex to split sentences
    animal_regex = re.compile("^.*\s+({})\s+.*$".format("|".join(w if isinstance(w, str) else "(?:{})".format("|".join(w)) for w in keywords)))
    
    from_animal_to_list = {ww: i for i, w in enumerate(keywords) for ww in (w if isinstance(w, tuple) else [w])}
    
    text = 'the cat was here today. my data is very long. dog is so cute. parrots are so colorful.'
    data_for_df = [[] for _ in range(max(from_animal_to_list.values()) + 1)]
    for sentence, m in map(lambda s: (s.group(0), animal_regex.match(s.group(0))), re.finditer(sentence_regex, text)):
        animal = m.groups()[0] if m is not None else 'not found'
        if animal != 'not found':
            data_for_df[from_animal_to_list[animal]].append(sentence)
    data_for_df = [d if len(d) else ['not found'] for d in data_for_df]
    
    sentence_df = pd.DataFrame(
        data=data_for_df,
        # index=keywords,  # uncomment here if you prefer the name of the animals as index
        index=[f'column{i}' for i, _ in enumerate(data_for_df, start=1)],
        columns=['sentence']
    )  # you can transpose this dataframe if you prefer, adding '.T' after ')'
    
    valid_sentences = [d[0] if len(d) else 'not found' for d in data_for_df]
    

    valid_sentences['the cat was here today', ' dog is so cute', ' parrots are so colorful', 'not found'],而sentence_df

                             sentence
    column1    the cat was here today
    column2            dog is so cute
    column3   parrots are so colorful
    column4                 not found
    

    【讨论】:

      【解决方案3】:

      这样的事情怎么样:

      >>> keywords ={1:['cat'], 2:['dog'], 3:['parrot', 'parrots'], 4:['sheep']}
      >>> def fun(text,split="."):
              for t in text.split(split):
                  for c,v in keywords.items():
                      if any(k in t for k in v):
                          yield c,t
                          break
      
      >>>
      >>> import collections
      >>> res = collections.defaultdict(list)
      >>> for c,t in fun(text):
              res[c].append(t)
      
          
      >>> for c in keywords:
              if c not in res:
                  res[c].append("not found")
      
              
      >>> print(res)
      defaultdict(<class 'list'>, {1: ['the cat was here today'], 2: [' dog is so cute'], 3: [' parrots are so colorful'], 4: ['not found']})
      >>> 
      

      因为我们可以为同一列有多个关键字,然后将其制成一个包含每列关键字的字典,然后简单搜索如果我们在文本中找到任何一个并产生结果,我还包括相应的列归属,构建结果,最后填补缺失的部分

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2017-05-19
        • 2015-04-21
        • 2015-03-16
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多