如何根据多个索引和多个条件过滤一行？答案

【问题标题】：How to filter a row based on multiple indexes and multiple conditions?如何根据多个索引和多个条件过滤一行？
【发布时间】：2021-05-06 16:20:09
【问题描述】：

我有一个如下所示的文件：

#This is TEST-data
2020-09-07T00:00:03.230+02:00,ID-10,3,London,Manchester,London,1,1,1
2020-09-07T00:00:03.230+02:00,ID-10,3,London,London,Manchester,1,1
2020-09-07T00:00:03.230+02:00,ID-20,2,London,London,1,1
2020-09-07T00:00:03.230+02:00,ID-20,2,London,London1,1
2020-09-07T00:00:03.230+02:00,ID-30,3,Madrid,Sevila,Sevilla,1,1,1
2020-09-07T00:00:03.230+02:00,ID-30,GGG,Madrid,Sevilla,Madrid,1
2020-09-07T00:00:03.230+02:00,ID-40,GGG,Madrid,Barcelona,1,1,1,1
2020-09-07T00:00:03.230+02:00
2020-09-07T00:00:03.230+02:00

每行中的Index[2] 显示该特定行中有多少城市。所以第一行对于index[2] 的值为3，即London, Manchester, London.

我正在尝试执行以下操作：

对于每一行，我需要检查row [3] + 后面提到的城市（基于城市数量）是否存在于cities_to_filter 中。但这仅在 row[2] 是数字时才需要完成。我还需要解决一些行包含少于 2 个项目的事实。

这是我的代码：

path = r'c:\data\ELK\Desktop\test_data_countries.txt'

cities_to_filter = ['Sevilla', 'Manchester']

def filter_row(row):
    if row[2].isdigit():
        amount_of_cities = int(row[2]) if len(row) > 2 else True
        
    cities_to_check = row[3:3+amount_of_cities]
    
    condition_1 =  any(city in cities_to_check for city in cities_to_filter)    
    return condition_1

with open (path, 'r') as output_file:
    reader = csv.reader(output_file, delimiter = ',')
    next(reader)
    for row in reader:
        if filter_row(row):
            print(row)

现在我收到以下错误：

UnboundLocalError: local variable 'condition_1' `referenced before assignment`

【问题讨论】：

@mhawke。这就是问题。
为什么要在主 for 循环中访问 row[2]？ int(row[2]) 不应该由isdigits() 签入filter_row() 来保护吗？如果您想打印cities_to_check，请在filter_row() 中进行。如果您这样做，您将不会再看到该错误。你会。然而。当引用不存在的 amount_of_cities 变量时，请参阅从下一行引发的 NameError。
你打算如何处理那些在row[2] 中包含GGG 的行，其中城市计数应该是？您是要忽略这些行，还是仍要尝试检查是否应过滤城市？那些只有时间戳的行呢？默默地忽略它们，将它们转储到 stderr 或中止程序？
@mhawke 我解决了。谢啦兄弟。感谢您的帮助。
我不确定你有没有:) if row[2].isdigit(): amount_of_cities = int(row[2]) if len(row) > 2 else True 搞混了。 len(row) 保证大于 2，因为 if 条件已经访问了行中的第三项。此外，将amount_of_cities 设置为True 会有效地将其设置为1，因为如果将True 视为int，则为1。检查我的答案以获取建议的解决方案。

标签： python list if-statement filter

【解决方案1】：

你可以这样做：

import sys

def filter_row(row):
    '''Returns True if the row should be removed'''
    if len(row) > 2:
        if row[2].isdigit():
            amount_of_cities = int(row[2]) 
            cities_to_check = row[3:3+amount_of_cities]
        else:
            # don't have valid city count, just try the rest of the row
            cities_to_check = row[3:]
        return any(city in cities_to_check for city in cities_to_filter)

    print(f'Invalid row: {row}', file=sys.stderr))
    return True

with open (path, 'r') as input_file:
    reader = csv.reader(input_file, delimiter = ',')
    next(reader)
    for row in reader:
        if filter_row(row):
            print(row)

在filter() 中检查行长度以确保存在row[2] 中可能的城市计数。如果计数是一个数字，则它用于计算提取要检查的城市的上限。否则，从索引 3 到行尾的行将被处理，其中将包括额外的数值，但可能不包括城市名称。

如果字段太少，则通过返回 True 过滤该行并打印错误消息。

【讨论】：

【解决方案2】：

我建议您在优化所有内容之前进行过滤。这里是您应该探索的路径的开始：

test_data = pd.DataFrame({'ID':['ID-10','ID-10','ID-20','ID-20','ID-30','ID-30','ID-40'],'id':[3,3,2,2,3,'GGG','GGG'],'cities':[['London','Manchester','London',1,1,1],['London','Manchester','London',1,1],['London','London',1,1],['London','London',1,1],['Madrid','Sevilla','Sevilla',1,1,1],['Madrid','Sevilla','Sevilla',1],['Madrid','Barçelona',1]]})

cities_to_filter = ['Sevilla', 'Manchester']
_condition1 = test_data.index.isin(test_data[test_data.id.str.isnumeric() != False][test_data[test_data.id.str.isnumeric() != False].id > 2].index)
test_data['results'] = np.where( _condition1,1,0)
test_data

输出：

然后您应用“any() in”来过滤城市，但有很多方法。

【讨论】：