【问题标题】:Openpyxl - combine matching rows of two tables into one long rowOpenpyxl - 将两个表的匹配行组合成一个长行
【发布时间】:2021-06-16 01:41:24
【问题描述】:

在一个 Excel 文件中,我有两个大表。表 A(“解剖”,409 行 x 25 列)包含唯一条目,每个条目由唯一 ID 分隔。表 B(“解剖”,234 行 x 39 列)在第一个单元格中使用表 A 的 ID 并对其进行扩展。要在 Minitab 中分析数据,所有数据必须位于一个长行中,这意味着“Damage”的值必须遵循“Dissection”。整个事情看起来像这样:

Table A - i.e. Dissection
- ID1 [valueTabA] [valueTabA]
- ID2 [valueTabA] [valueTabA]
- ID3 [valueTabA] [valueTabA]
- ID4 [valueTabA] [valueTabA]

Table B - i.e. Damage
- ID1 [valueTabB1] [valueTabB1]
- ID1 [valueTabB2] [valueTabB2]
- ID4 [valueTabB] [valueTabB]

他们应该结合这样的东西:

Table A
- ID1 [valueTabA] [valueTabA] [valueTabB1] [valueTabB1] [valueTabB2] [valueTabB2]
- ID2 [valueTabA] [valueTabA]
- ID3 [valueTabA] [valueTabA]
- ID4 [valueTabA] [valueTabA] [valueTabB] [valueTabB]

最好的方法是什么?


以下描述了我的两种方法。两者都在相同的表中使用相同的数据,但在两个不同的文件中,以便能够测试两种方案。

第一种方法使用一个文件,两个表都在同一个工作表中,第二种方法使用一个文件,两个表都在不同的工作表中。


  1. 场景:两个表都在同一个工作表中,我试图将行作为一个范围移动
current_row = 415 # start without headers of table A
current_line = 2 # start without headers of table B


for row in ws.iter_rows(min_row=415, max_row=647):
    # loop through damage

    id_A = ws.cell(row=current_row, column=1).value
    max_col = 25

    for line in ws.iter_rows(min_row=2, max_row=409):
        # loop through dissection

        id_B = ws.cell(row=current_line, column=1).value

        if id_A == id_B:
            copy_range = ((ws.cell(row=current_line, column=2)).column_letter + str(current_line) + ":" +
                          (ws.cell(row=current_line, column=39)).column_letter + str(current_line))

            ws.move_range(copy_range, rows=current_row, cols=max_col+1)
            print("copied range: " + copy_range +" to: " + str(current_row) + ":"+str(max_col+1))
            count += 1
            break

        if current_line > 409:
            current_line = 2
        else:
            current_line += 1

    current_row += 1

-> 在这里,我正在努力将范围附加到表 A 的右行,而不覆盖前一行(参见上面的示例 ID1)


  1. 场景:两个表都位于不同的工作表中
    dissection = wb["Dissection"]
    damage = wb["Damage"]
    recovery = wb["Recovery"]
    
    current_row, current_line = 2, 2
    
    for row in damage.iter_rows():
        # loop through first table
    
        id_A = damage.cell(row=current_row, column=1).value
    
        for line in dissection.iter_rows():
            # loop through second table
    
            id_B = dissection.cell(row=current_line, column=1).value
            copyData = []
    
            if id_A == id_B:
    
                for col in range(2, 39):
                    # add data to the list, skipping the ID
                    copyData.append(damage.cell(row=current_line, column=col).value)
    
                # print(copyData) for debugging purposes
    
                for item in copyData:
                    column_count = dissection.max_column
    
                    dissection.cell(row=current_row, column=column_count).value = item
                    column_count += 1
    
                current_row += 1
                break
    
            if not current_line > 409:
                # prevent looping out of range
                current_line += 1
            else:
                current_line = 2

-> 与 1. 中的问题相同,在某些时候它不再将伤害值添加到 copyData 而是 None,最后它只是不粘贴项目(单元格保持空白)


我已经尝试了所有我能找到的与 Excel 相关的东西,但不幸的是,没有任何效果。熊猫在这里会更有用还是我什么都没看到?

感谢您花时间阅读本文:)

【问题讨论】:

    标签: python python-3.x excel pandas openpyxl


    【解决方案1】:

    我强烈建议在这种情况下使用pandas。目前还不清楚您的数据在excel 文件中的格式,但考虑到您的第二个选项,我假设这些表都位于excel 文件中的不同工作表上。我还假设第一行包含表格标题(例如Table A - i.e. Dissection)。如果不是这种情况,只需删除skiprows=1

    import pandas as pd
    
    df = pd.concat(pd.read_excel("filename.xlsx", sheet_name=None, skiprows=1, header=None), axis=1, ignore_index=True)
    df.to_excel('combined_data.xlsx) #save to excel
    

    read_excel 会将excel 文件加载到pandas 数据帧中。 sheet_name=None 表示所有工作表都应加载到数据帧的OrderedDict 中。 pd.concat 会将这些数据帧连接成一个数据帧(axis=1 表示轴)。您可以使用df.head() 浏览数据,或使用df.to_excel 将数据框保存到excel

    【讨论】:

    • 我编辑了问题以进行澄清。我正在使用两个文件(包含相同的数据)来测试这两种情况。第一种方法将两个表放在同一个工作表中,第二种方法将表放在不同的工作表中。您的答案确实连接了表格,但是它将表格 B 粘贴在表格 A 后面,没有根据 ID 对表格 B 的行进行排序
    【解决方案2】:

    我最终使用了 2。场景(一个文件,两个工作表),但此代码也应该适用于 1. 场景(一个文件,一个工作表)。

    1. 我使用来自here 的代码复制了表 B 的行。
    2. 并使用来自here 的代码处理偏移量。

    此外,我在我的解决方案中添加了一些附加功能,使其更通用:

    import openpyxl, os
    from openpyxl.utils import range_boundaries
    
    # Introduction
    print("Welcome!\n[!] Advice: Always have a backup of the file you want to sort.\n[+] Please put the file to be sorted in the same directory as this program.")
    print("[+] This program assumes that the value to be sorted by is located in the first column of the outgoing table.")
    
    # File listing
    while True:
        files = [f for f in os.listdir('.') if os.path.isfile(f)]
        valid_types = ["xlsx", "xltx", "xlt", "xls"]
        print("\n[+] Current directory: " + os.getcwd())
        print("[+] Excel files in the current directory: ")
        for f in files:
            if str(f).split(".")[1] in valid_types:
                print(f)
        file = input("\nWhich file would you like to sort: ")
        try:
            ending = file.split(".")[1]
        except IndexError:
            print("please only enter excel files.")
            continue
        if ending in valid_types:
            break
        else:
            print("Please only enter excel files")
    wb = openpyxl.load_workbook(file)
    
    # Handling Worksheets
    print("\nAvailable Worksheets: " + str(wb.sheetnames))
    print("Which file would you like to sort? (please copy the name without the parenthesis)")
    outgoing_sheet = wb[input("Outgoing sheet: ")]
    print("\nAvailable Worksheets: " + str(wb.sheetnames))
    print("Which is the receiving sheet? (please copy the name without the parenthesis)")
    receiving_sheet = wb[input("Receiving sheet: ")]
    
    
    # Declaring functions
    def copy_row(source_range, target_start, source_sheet, target_sheet):
        # Define start Range(target_start) in the new Worksheet
        min_col, min_row, max_col, max_row = range_boundaries(target_start)
    
        # Iterate Range you want to copy
        for row, row_cells in enumerate(source_sheet[source_range], min_row):
            for column, cell in enumerate(row_cells, min_col):
                # Copy Value from Copy.Cell to given Worksheet.Cell
                target_sheet.cell(row=row, column=column).value = cell.value
    
    
    def ask_yes_no(prompt):
        """
        :param prompt: The question to be asked
        :return: Value to check
        """
        while True:
            answer = input(prompt + " (y/n): ")
    
            if answer == "y":
                return True
            elif answer == "n":
                return False
    
            print("Please only enter y or n.")
    
    
    def ask_integer(prompt):
        while True:
            try:
                answer = int(input(prompt + ": "))
                break
            except ValueError:
                print("Please only enter integers (e.g. 1, 2 or 3).")
        return answer
    
    
    def scan_empty(index):
        print("Scanning for empty cells...")
        scan, fill = False, False
        min_col = outgoing_sheet.min_column
        max_col = outgoing_sheet.max_column
        cols = range(min_col, max_col+1)
        break_loop = False
        count = 0
    
        if not scan:
            search_index = index
            for row in outgoing_sheet.iter_rows():
                for n in cols:
                    cell = outgoing_sheet.cell(row=search_index, column=n).value
                    if cell:
                        pass
                    else:
                        choice = ask_yes_no("\n[!] Empty cells found, would you like to fill them? (recommended)")
                        if choice:
                            fill = input("Fill with: ")
                            scan = True
                            break_loop = True
                            break
                        else:
                            print("[!] Attention: This can produce to mismatches in the sorting algorithm.")
                            confirm = ask_yes_no("[>] Are you sure you don't want to fill them?\n[+] Hint: You can also enter spaces.\n(n)o I really don't want to\noka(y) I'll enter something, just let me sort already.\n")
                            if confirm:
                                fill = input("Fill with: ")
                                scan = True
                                break_loop = True
                                break
                            else:
                                print("You have chosen not to fill the empty cells.")
                                scan = True
                                break_loop = True
                                break
                if break_loop:
                    break
                search_index += 1
    
        if fill:
            search_index = index
            for row in outgoing_sheet.iter_rows(max_row=outgoing_sheet.max_row-1):
                for n in cols:
                    cell = outgoing_sheet.cell(row=search_index, column=n).value
                    if cell:
                        pass
                    elif cell != int(0):
                        count += 1
                        outgoing_sheet.cell(row=search_index, column=n).value = fill
    
                search_index += 1
    
            print("Filled " + str(count) + " cells with: " + fill)
    
        return fill, count
    
    
    # Declaring basic variables
    first_value = ask_yes_no("Is the first row containing values the 2nd in both tables?")
    if first_value:
        current_row, current_line = 2, 2
    else:
        current_row = ask_integer("Sorting table first row")
        current_line = ask_integer("Receiving table first row")
    verbose = ask_yes_no("Verbose output?")
    reset = current_line
    rec_max = receiving_sheet.max_row
    scan_empty(current_row)
    count = 0
    
    print("\nSorting: " + str(outgoing_sheet.max_row - 1) + " rows...")
    for row in outgoing_sheet.iter_rows():
        # loop through first table - Table you want to sort
        id_A = outgoing_sheet.cell(row=current_row, column=1).value
    
        if verbose:
            print("\nCurrently at: " + str(current_row - 1) + "/" + str(outgoing_sheet.max_row - 1) + "")
            try:
                print("Sorting now: " + id_A)
            except TypeError:
                # Handling None type exceptions
                pass
    
        for line in receiving_sheet.iter_rows():
            # loop through second table - The receiving table
            id_B = receiving_sheet.cell(row=current_line, column=1).value
    
            if id_A == id_B:
                try:
                    # calculate the offset
                    offset = max((row.column for row in receiving_sheet[current_line] if row.value is not None)) + 1
                except ValueError:
                    # typical "No idea why, but it doesn't work without it" - code
                    pass
    
                start_paste_from = receiving_sheet.cell(row=current_line, column=offset).column_letter + str(current_line)
                copy_Range = ((outgoing_sheet.cell(row=current_row, column=2)).column_letter + str(current_row) + ":" +
                              (outgoing_sheet.cell(row=current_row, column=outgoing_sheet.max_column)).column_letter + str(current_row))
                #  Don't copy the ID, alternatively set damage.min_column for the first and damage.max_column for the second
    
                copy_row(copy_Range, start_paste_from, outgoing_sheet, receiving_sheet)
                count += 1
                current_row += 1
    
                if verbose:
                    print("Copied " + copy_Range + " to: " + str(start_paste_from))
    
                break
    
            if not current_line > rec_max:
                # prevent looping out of range
                current_line += 1
            else:
                current_line = reset
    
    wb.save(file)
    print("\nSorted: " + str(count) + " rows.")
    print("Saving the file to: " + os.getcwd())
    print("Done.")
    

    注意:表 B(“伤害”)的值是根据 ID 排序的,尽管这不是必需的。但是,如果您选择这样做,可以使用 pandas 来完成。

    import pandas as pd
    
    df = pd.read_excel("excel/separated.xlsx","Damage")
    # open the correct worksheet
    
    df.sort_values(by="Identification")
    df.to_excel("sorted.xlsx")
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-08-15
      • 1970-01-01
      • 2011-08-15
      相关资源
      最近更新 更多