Openpyxl - 将两个表的匹配行组合成一个长行答案

【问题标题】：Openpyxl - combine matching rows of two tables into one long rowOpenpyxl - 将两个表的匹配行组合成一个长行
【发布时间】：2021-06-16 01:41:24
【问题描述】：

在一个 Excel 文件中，我有两个大表。表 A（“解剖”，409 行 x 25 列）包含唯一条目，每个条目由唯一 ID 分隔。表 B（“解剖”，234 行 x 39 列）在第一个单元格中使用表 A 的 ID 并对其进行扩展。要在 Minitab 中分析数据，所有数据必须位于一个长行中，这意味着“Damage”的值必须遵循“Dissection”。整个事情看起来像这样：

Table A - i.e. Dissection
- ID1 [valueTabA] [valueTabA]
- ID2 [valueTabA] [valueTabA]
- ID3 [valueTabA] [valueTabA]
- ID4 [valueTabA] [valueTabA]

Table B - i.e. Damage
- ID1 [valueTabB1] [valueTabB1]
- ID1 [valueTabB2] [valueTabB2]
- ID4 [valueTabB] [valueTabB]

他们应该结合这样的东西：

Table A
- ID1 [valueTabA] [valueTabA] [valueTabB1] [valueTabB1] [valueTabB2] [valueTabB2]
- ID2 [valueTabA] [valueTabA]
- ID3 [valueTabA] [valueTabA]
- ID4 [valueTabA] [valueTabA] [valueTabB] [valueTabB]

最好的方法是什么？

以下描述了我的两种方法。两者都在相同的表中使用相同的数据，但在两个不同的文件中，以便能够测试两种方案。

第一种方法使用一个文件，两个表都在同一个工作表中，第二种方法使用一个文件，两个表都在不同的工作表中。

场景：两个表都在同一个工作表中，我试图将行作为一个范围移动

current_row = 415 # start without headers of table A
current_line = 2 # start without headers of table B


for row in ws.iter_rows(min_row=415, max_row=647):
    # loop through damage

    id_A = ws.cell(row=current_row, column=1).value
    max_col = 25

    for line in ws.iter_rows(min_row=2, max_row=409):
        # loop through dissection

        id_B = ws.cell(row=current_line, column=1).value

        if id_A == id_B:
            copy_range = ((ws.cell(row=current_line, column=2)).column_letter + str(current_line) + ":" +
                          (ws.cell(row=current_line, column=39)).column_letter + str(current_line))

            ws.move_range(copy_range, rows=current_row, cols=max_col+1)
            print("copied range: " + copy_range +" to: " + str(current_row) + ":"+str(max_col+1))
            count += 1
            break

        if current_line > 409:
            current_line = 2
        else:
            current_line += 1

    current_row += 1

-> 在这里，我正在努力将范围附加到表 A 的右行，而不覆盖前一行（参见上面的示例 ID1）

场景：两个表都位于不同的工作表中

    dissection = wb["Dissection"]
    damage = wb["Damage"]
    recovery = wb["Recovery"]
    
    current_row, current_line = 2, 2
    
    for row in damage.iter_rows():
        # loop through first table
    
        id_A = damage.cell(row=current_row, column=1).value
    
        for line in dissection.iter_rows():
            # loop through second table
    
            id_B = dissection.cell(row=current_line, column=1).value
            copyData = []
    
            if id_A == id_B:
    
                for col in range(2, 39):
                    # add data to the list, skipping the ID
                    copyData.append(damage.cell(row=current_line, column=col).value)
    
                # print(copyData) for debugging purposes
    
                for item in copyData:
                    column_count = dissection.max_column
    
                    dissection.cell(row=current_row, column=column_count).value = item
                    column_count += 1
    
                current_row += 1
                break
    
            if not current_line > 409:
                # prevent looping out of range
                current_line += 1
            else:
                current_line = 2

-> 与 1. 中的问题相同，在某些时候它不再将伤害值添加到 copyData 而是 None，最后它只是不粘贴项目（单元格保持空白）

我已经尝试了所有我能找到的与 Excel 相关的东西，但不幸的是，没有任何效果。熊猫在这里会更有用还是我什么都没看到？

感谢您花时间阅读本文:)

【问题讨论】：

标签： python python-3.x excel pandas openpyxl

【解决方案1】：

我强烈建议在这种情况下使用pandas。目前还不清楚您的数据在excel 文件中的格式，但考虑到您的第二个选项，我假设这些表都位于excel 文件中的不同工作表上。我还假设第一行包含表格标题（例如Table A - i.e. Dissection）。如果不是这种情况，只需删除skiprows=1：

import pandas as pd

df = pd.concat(pd.read_excel("filename.xlsx", sheet_name=None, skiprows=1, header=None), axis=1, ignore_index=True)
df.to_excel('combined_data.xlsx) #save to excel

read_excel 会将excel 文件加载到pandas 数据帧中。 sheet_name=None 表示所有工作表都应加载到数据帧的OrderedDict 中。 pd.concat 会将这些数据帧连接成一个数据帧（axis=1 表示轴）。您可以使用df.head() 浏览数据，或使用df.to_excel 将数据框保存到excel。

【讨论】：

我编辑了问题以进行澄清。我正在使用两个文件（包含相同的数据）来测试这两种情况。第一种方法将两个表放在同一个工作表中，第二种方法将表放在不同的工作表中。您的答案确实连接了表格，但是它将表格 B 粘贴在表格 A 后面，没有根据 ID 对表格 B 的行进行排序

【解决方案2】：

我最终使用了 2。场景（一个文件，两个工作表），但此代码也应该适用于 1. 场景（一个文件，一个工作表）。

我使用来自here 的代码复制了表 B 的行。
并使用来自here 的代码处理偏移量。

此外，我在我的解决方案中添加了一些附加功能，使其更通用：

import openpyxl, os
from openpyxl.utils import range_boundaries

# Introduction
print("Welcome!\n[!] Advice: Always have a backup of the file you want to sort.\n[+] Please put the file to be sorted in the same directory as this program.")
print("[+] This program assumes that the value to be sorted by is located in the first column of the outgoing table.")

# File listing
while True:
    files = [f for f in os.listdir('.') if os.path.isfile(f)]
    valid_types = ["xlsx", "xltx", "xlt", "xls"]
    print("\n[+] Current directory: " + os.getcwd())
    print("[+] Excel files in the current directory: ")
    for f in files:
        if str(f).split(".")[1] in valid_types:
            print(f)
    file = input("\nWhich file would you like to sort: ")
    try:
        ending = file.split(".")[1]
    except IndexError:
        print("please only enter excel files.")
        continue
    if ending in valid_types:
        break
    else:
        print("Please only enter excel files")
wb = openpyxl.load_workbook(file)

# Handling Worksheets
print("\nAvailable Worksheets: " + str(wb.sheetnames))
print("Which file would you like to sort? (please copy the name without the parenthesis)")
outgoing_sheet = wb[input("Outgoing sheet: ")]
print("\nAvailable Worksheets: " + str(wb.sheetnames))
print("Which is the receiving sheet? (please copy the name without the parenthesis)")
receiving_sheet = wb[input("Receiving sheet: ")]


# Declaring functions
def copy_row(source_range, target_start, source_sheet, target_sheet):
    # Define start Range(target_start) in the new Worksheet
    min_col, min_row, max_col, max_row = range_boundaries(target_start)

    # Iterate Range you want to copy
    for row, row_cells in enumerate(source_sheet[source_range], min_row):
        for column, cell in enumerate(row_cells, min_col):
            # Copy Value from Copy.Cell to given Worksheet.Cell
            target_sheet.cell(row=row, column=column).value = cell.value


def ask_yes_no(prompt):
    """
    :param prompt: The question to be asked
    :return: Value to check
    """
    while True:
        answer = input(prompt + " (y/n): ")

        if answer == "y":
            return True
        elif answer == "n":
            return False

        print("Please only enter y or n.")


def ask_integer(prompt):
    while True:
        try:
            answer = int(input(prompt + ": "))
            break
        except ValueError:
            print("Please only enter integers (e.g. 1, 2 or 3).")
    return answer


def scan_empty(index):
    print("Scanning for empty cells...")
    scan, fill = False, False
    min_col = outgoing_sheet.min_column
    max_col = outgoing_sheet.max_column
    cols = range(min_col, max_col+1)
    break_loop = False
    count = 0

    if not scan:
        search_index = index
        for row in outgoing_sheet.iter_rows():
            for n in cols:
                cell = outgoing_sheet.cell(row=search_index, column=n).value
                if cell:
                    pass
                else:
                    choice = ask_yes_no("\n[!] Empty cells found, would you like to fill them? (recommended)")
                    if choice:
                        fill = input("Fill with: ")
                        scan = True
                        break_loop = True
                        break
                    else:
                        print("[!] Attention: This can produce to mismatches in the sorting algorithm.")
                        confirm = ask_yes_no("[>] Are you sure you don't want to fill them?\n[+] Hint: You can also enter spaces.\n(n)o I really don't want to\noka(y) I'll enter something, just let me sort already.\n")
                        if confirm:
                            fill = input("Fill with: ")
                            scan = True
                            break_loop = True
                            break
                        else:
                            print("You have chosen not to fill the empty cells.")
                            scan = True
                            break_loop = True
                            break
            if break_loop:
                break
            search_index += 1

    if fill:
        search_index = index
        for row in outgoing_sheet.iter_rows(max_row=outgoing_sheet.max_row-1):
            for n in cols:
                cell = outgoing_sheet.cell(row=search_index, column=n).value
                if cell:
                    pass
                elif cell != int(0):
                    count += 1
                    outgoing_sheet.cell(row=search_index, column=n).value = fill

            search_index += 1

        print("Filled " + str(count) + " cells with: " + fill)

    return fill, count


# Declaring basic variables
first_value = ask_yes_no("Is the first row containing values the 2nd in both tables?")
if first_value:
    current_row, current_line = 2, 2
else:
    current_row = ask_integer("Sorting table first row")
    current_line = ask_integer("Receiving table first row")
verbose = ask_yes_no("Verbose output?")
reset = current_line
rec_max = receiving_sheet.max_row
scan_empty(current_row)
count = 0

print("\nSorting: " + str(outgoing_sheet.max_row - 1) + " rows...")
for row in outgoing_sheet.iter_rows():
    # loop through first table - Table you want to sort
    id_A = outgoing_sheet.cell(row=current_row, column=1).value

    if verbose:
        print("\nCurrently at: " + str(current_row - 1) + "/" + str(outgoing_sheet.max_row - 1) + "")
        try:
            print("Sorting now: " + id_A)
        except TypeError:
            # Handling None type exceptions
            pass

    for line in receiving_sheet.iter_rows():
        # loop through second table - The receiving table
        id_B = receiving_sheet.cell(row=current_line, column=1).value

        if id_A == id_B:
            try:
                # calculate the offset
                offset = max((row.column for row in receiving_sheet[current_line] if row.value is not None)) + 1
            except ValueError:
                # typical "No idea why, but it doesn't work without it" - code
                pass

            start_paste_from = receiving_sheet.cell(row=current_line, column=offset).column_letter + str(current_line)
            copy_Range = ((outgoing_sheet.cell(row=current_row, column=2)).column_letter + str(current_row) + ":" +
                          (outgoing_sheet.cell(row=current_row, column=outgoing_sheet.max_column)).column_letter + str(current_row))
            #  Don't copy the ID, alternatively set damage.min_column for the first and damage.max_column for the second

            copy_row(copy_Range, start_paste_from, outgoing_sheet, receiving_sheet)
            count += 1
            current_row += 1

            if verbose:
                print("Copied " + copy_Range + " to: " + str(start_paste_from))

            break

        if not current_line > rec_max:
            # prevent looping out of range
            current_line += 1
        else:
            current_line = reset

wb.save(file)
print("\nSorted: " + str(count) + " rows.")
print("Saving the file to: " + os.getcwd())
print("Done.")

注意：表 B（“伤害”）的值是根据 ID 排序的，尽管这不是必需的。但是，如果您选择这样做，可以使用 pandas 来完成。

import pandas as pd

df = pd.read_excel("excel/separated.xlsx","Damage")
# open the correct worksheet

df.sort_values(by="Identification")
df.to_excel("sorted.xlsx")

【讨论】：