Python - 遍历列表并添加 OR 选项答案

【问题标题】：Python - Looping through list and adding an OR optionPython - 遍历列表并添加 OR 选项
【发布时间】：2021-05-17 04:14:51
【问题描述】：

我在 Python 中有 2 个列表。我想检查一个关键字是否出现在我的文本中，如果出现，那么我提取句子，如果没有，我输入“未找到”。我有另一个列表，其中包含所有关键字的所有列。

我的理解是输入 df 的数据帧长度和列表长度需要相同。

我想检查文本是否有“鹦鹉”或“鹦鹉”这个词，如果有，则添加到同一列下的数据框中。

我不想要一个额外的列，因为鹦鹉和鹦鹉非常相似 - 所以它们可以放在同一列下。

我不确定如何做到这一点 - 我是在字典中添加还是在嵌套列表中添加。请问有人可以建议吗？

下面的虚拟代码

df_cols=[column1, column2, column3, column4]

#here I have added parrot OR parrots to explain the example

keywords =['cat','dog','parrot', 'parrots','sheep']

text ='the cat was here today. my data is very long. dog is so cute. parrots are so colorful.'

code:
lst=[]
for i in text.split('.')):
    if j in i:
       lst.append('.'.join(text.split('.')) 
    if j == 'parrot' or j == 'parrots':

    #getting and error here - I want to check if parrot/parrots is in my text and 
    #then join it to append one element to my output list

       lst.append(' '.join(i.split('.')[0]))
    else:
       lst.append('not found')
    break

想要的输出：

lst = [the cat was here today,dog is so cute, parrots are so colorful, not found]

所需的数据框：


column1     the cat was here today 
column2     dog is so cute
column3     parrots are so colorful
column4     not found

谢谢

【问题讨论】：

几个问题：你为什么使用数据框？为什么不直接使用['cat', 'dog', 'parrot', 'parrots', 'sheep'] 来包含parrot 和parrots？为什么for 循环的末尾有一个break？
我添加了break，因为我只想找到在我的文本中找到关键字的第一个实例。我使用数据框，因为我需要将此信息输出到 excel 中以进行进一步分析。所以本质上，它遍历每个关键字并将输出放入一个列表中，然后我将其作为数据框放入 - 希望这有意义吗？如果我包含 parrot 和 parrots 则它作为 2 个不同的项目附加到列表中 - 我需要它作为一个项目附加，以便我可以在一列下分配输出
那么数据框/表的完整结构是什么？你怎么知道第 1、2、3、4 列属于什么？
Column1 是 for 循环中的第一项 - 所以这将是 the cat was here today 并继续直到所有列都完成。
您提出了一些不寻常的问题，这表明可能有更好的方法来实现您想要做的事情。你到底想完成什么？

标签： python python-3.x list dataframe for-loop

【解决方案1】：

我不知道这是否是最好的代码，但它可以满足您的要求。基本上，它会检查关键字是否为str/set，并对每种情况进行必要的检查。

keywords =['cat','dog',{'parrot', 'parrots'},'sheep']

text ='the cat was here today. my data is very long. dog is so cute. parrots are so colorful.'
text_list = text.split('.')

ls = []
for i in keywords:
    if type(i) is str:
        isFound = False
        for j in text_list:
            if i in j:
                ls.append(j)
                isFound = True
                break

        if isFound == False:
            ls.append('not found')
    elif type(i) is set:
        isFound = False
        for x in i:
            for j in text_list:
                if x in j:
                    ls.append(j)
                    isFound = True
                    break

            if isFound == True:
                break

        if isFound == False:
            ls.append('not found')

结果如下所示：

['the cat was here today', ' dog is so cute', ' parrots are so colorful', 'not found']

【讨论】：

【解决方案2】：

你可以使用正则表达式：

import re
import pandas as pd

keywords = ['cat','dog', ('parrot', 'parrots'), 'sheep']
sentence_regex = r"[^\.]+\s*(\w*)\s*[^\.]+"  # regex to split sentences
animal_regex = re.compile("^.*\s+({})\s+.*$".format("|".join(w if isinstance(w, str) else "(?:{})".format("|".join(w)) for w in keywords)))

from_animal_to_list = {ww: i for i, w in enumerate(keywords) for ww in (w if isinstance(w, tuple) else [w])}

text = 'the cat was here today. my data is very long. dog is so cute. parrots are so colorful.'
data_for_df = [[] for _ in range(max(from_animal_to_list.values()) + 1)]
for sentence, m in map(lambda s: (s.group(0), animal_regex.match(s.group(0))), re.finditer(sentence_regex, text)):
    animal = m.groups()[0] if m is not None else 'not found'
    if animal != 'not found':
        data_for_df[from_animal_to_list[animal]].append(sentence)
data_for_df = [d if len(d) else ['not found'] for d in data_for_df]

sentence_df = pd.DataFrame(
    data=data_for_df,
    # index=keywords,  # uncomment here if you prefer the name of the animals as index
    index=[f'column{i}' for i, _ in enumerate(data_for_df, start=1)],
    columns=['sentence']
)  # you can transpose this dataframe if you prefer, adding '.T' after ')'

valid_sentences = [d[0] if len(d) else 'not found' for d in data_for_df]

valid_sentences 是['the cat was here today', ' dog is so cute', ' parrots are so colorful', 'not found']，而sentence_df：

                         sentence
column1    the cat was here today
column2            dog is so cute
column3   parrots are so colorful
column4                 not found

【讨论】：

【解决方案3】：

这样的事情怎么样：

>>> keywords ={1:['cat'], 2:['dog'], 3:['parrot', 'parrots'], 4:['sheep']}
>>> def fun(text,split="."):
        for t in text.split(split):
            for c,v in keywords.items():
                if any(k in t for k in v):
                    yield c,t
                    break

>>>
>>> import collections
>>> res = collections.defaultdict(list)
>>> for c,t in fun(text):
        res[c].append(t)

    
>>> for c in keywords:
        if c not in res:
            res[c].append("not found")

        
>>> print(res)
defaultdict(<class 'list'>, {1: ['the cat was here today'], 2: [' dog is so cute'], 3: [' parrots are so colorful'], 4: ['not found']})
>>>

因为我们可以为同一列有多个关键字，然后将其制成一个包含每列关键字的字典，然后简单搜索如果我们在文本中找到任何一个并产生结果，我还包括相应的列归属，构建结果，最后填补缺失的部分

【讨论】：