如何在 python 列表中找到彼此相邻的重复项并根据它们的索引列出它们？答案

【问题标题】：How to find duplicates in a python list that are adjacent to each other and list them with respect to their indices?如何在 python 列表中找到彼此相邻的重复项并根据它们的索引列出它们？
【发布时间】：2015-09-11 21:36:48
【问题描述】：

我有一个程序可以读取 .csv 文件，检查列长度是否不匹配（通过将其与标题字段进行比较），然后将找到的所有内容作为列表返回（然后将其写入文件）。我想用这个列表做的，是列出结果如下：

发现相同不匹配的行号：该行中的列数

例如

rows: n-m : y

其中 n 和 m 是共享相同数量的与标题不匹配的列的行数。

我研究了这些主题，虽然这些信息很有用，但它们并没有回答问题：

Find and list duplicates in a list?

Identify duplicate values in a list in Python

这就是我现在的位置：

r = csv.reader(data, delimiter= '\t')
columns = []
for row in r:
        # adds column length to a list
        colm = len(row)
        columns.append(colm)

b = len(columns)
for a in range(b):
        # checks if the current member matches the header length of columns
        if columns[a] != columns[0]:
                # if it doesnt, write the row and the amount of columns in that row to a file
                file.write("row  " + str(a + 1) + ": " + str(columns[a]) + " \n")

文件输出如下所示：

row  7220: 0 
row  7221: 0 
row  7222: 0 
row  7223: 0 
row  7224: 0 
row  7225: 1 
row  7226: 1

当期望的最终结果是

rows 7220 - 7224 : 0
rows 7225 - 7226 : 1

所以我本质上需要的是一个字典，其中键是具有重复值的行，值是所述不匹配中的列数。我基本上认为我需要的东西（在一个可怕的书面伪代码中，现在我在写这个问题多年后阅读它没有任何意义），就在这里：

def pseudoList():
    i = 1
    ListOfLists = []
    while (i < len(originalList)):
        duplicateList = []
        if originalList[i] == originalList[i-1]:
            duplicateList.append(originalList[i])
        i += 1
    ListOfLists.append(duplicateList)


def PseudocreateDict(ListOfLists):
    pseudoDict = {}
    for x in ListOfLists:
        a = ListOfLists[x][0]                   #this is the first node in the uniqueList created
        i = len(ListOfLists) - 1
        b = listOfLists[x][i]   #this is the last node of the uniqueList created
        pseudodict.update('key' : '{} - {}'.format(a,b))

然而，这似乎是做我想做的事情非常复杂的方式，所以我想知道是否有 a) 更有效的方式 b) 更简单的方式来做到这一点？

【问题讨论】：

标签： python list csv dictionary duplicates

【解决方案1】：

您可以使用列表推导返回列列表中与相邻元素不同的元素列表，这将是您的范围的端点。然后枚举这些范围并打印/写出与第一个（标题）元素不同的范围。一个额外的元素被添加到范围列表中以指定列表的结束索引，以避免超出范围的索引。

columns = [2, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 2, 1];

ranges = [[i+1, v] for i,v in enumerate(columns[1:]) if columns[i] != columns[i+1]]
ranges.append([len(columns),0]) # special case for last element 
for i,v in enumerate(ranges[:-1]):
    if v[1] != columns[0]:
        print "rows", v[0]+1, "-", ranges[i+1][0], ":", v[1]

输出：

rows 2 - 5 : 1
rows 6 - 9 : 0
rows 10 - 11 : 1
rows 13 - 13 : 1

【讨论】：

【解决方案2】：

你也可以试试下面的代码-

b = len(columns)
check = 0
for a in range(b):
        # checks if the current member matches the header length of columns
        if check != 0 and columns[a] == check:
            continue
        elif check != 0 and columns[a] != check:
            check = 0
            if start != a:
                file.write("row  " + str(start) + " - " + str(a) + ": " + str(columns[a]) + " \n")
            else:
                file.write("row  " + str(start) + ": " + str(columns[a]) + " \n")
        if columns[a] != columns[0]:
                # if it doesnt, write the row and the amount of columns in that row to a file
                start = a+1
                check = columns[a]

【讨论】：

【解决方案3】：

您想要做的是映射/归约操作，但没有通常在映射和归约之间进行的排序。

如果你输出

row  7220: 0 
row  7221: 0 
row  7222: 0 
row  7223: 0

到标准输出，您可以将此数据通过管道传输到另一个 python 程序，该程序生成您想要的组。

第二个 python 程序可能如下所示：

import sys
import re


line = sys.stdin.readline()
last_rowid, last_diff = re.findall('(\d+)', line)

for line in sys.stdin:
    rowid, diff = re.findall('(\d+)', line)
    if diff != last_diff:
        print "rows", last_rowid, rowid, last_diff
        last_diff = diff
        last_rowid = rowid

print "rows", last_rowid, rowid, last_diff

您可以在 unix 环境中像这样执行它们以将输出保存到文件中：

python yourprogram.py | python myprogram.py > youroutputfile.dat

如果您无法在 unix 环境中运行它，您仍然可以使用我在您的程序中编写的算法并稍作修改。

【讨论】：