如何在python中将多个.xls文件与超链接合并？答案

【问题标题】：How to merge multiple .xls files with hyperlinks in python?如何在python中将多个.xls文件与超链接合并？
【发布时间】：2022-01-25 02:02:01
【问题描述】：

我正在尝试合并具有许多列的多个 .xls 文件，但 1 列带有超链接。我尝试使用 Python 执行此操作，但一直遇到无法解决的错误。

为了简洁起见，超链接隐藏在文本部分下。以下 ctrl-click 超链接是我在 .xls 文件中遇到的示例：ES2866911 (T3)。

为了提高重现性，我在下面添加了 .xls1 和 .xls2 示例。

xls1:

Title	Publication_Number
P_A	ES2866911 (T3)
P_B	EP3887362 (A1)

.xls2:

Title	Publication_Number
P_C	AR118706 (A2)
P_D	ES2867600 (T3)

期望的结果：

Title	Publication_Number
P_A	ES2866911 (T3)
P_B	EP3887362 (A1)
P_C	AR118706 (A2)
P_D	ES2867600 (T3)

我无法在不丢失格式或丢失超链接的情况下将 .xls 文件导入 Python。此外，我无法将 .xls 文件转换为 .xlsx。我无法获取 .xlsx 格式的 .xls 文件。下面我简要总结一下我的尝试：

1.) 使用 pandas 阅读是我的第一次尝试。很容易做到，但是PD中的所有超链接都丢失了，而且原始文件中的所有格式都丢失了。

2.) 使用 openpyxl.load 读取 .xls 文件

InvalidFileException: openpyxl does not support the old .xls file format, please use xlrd to read this file, or convert it to the more recent .xlsx file format.

3.) 将 .xls 文件转换为 .xlsx

from xls2xlsx import XLS2XLSX
x2x = XLS2XLSX(input.file.xls)
wb = x2x.to_xlsx()
x2x.to_xlsx('output_file.xlsx')
TypeError: got invalid input value of type <class 'xml.etree.ElementTree.Element'>, expected string or Element

import pyexcel as p
p.save_book_as(file_name=input_file.xls, dest_file_name=export_file.xlsx)
TypeError: got invalid input value of type <class 'xml.etree.ElementTree.Element'>, expected string or Element
During handling of the above exception, another exception occurred:
StopIteration

4.) 例如，即使我们能够使用 xlrd 读取 .xls 文件（这意味着我们永远无法将文件另存为 .xlsx，我什至看不到超链接：

import xlrd
wb = xlrd.open_workbook(file) # where vis.xls is your test file
ws = wb.sheet_by_name('Sheet1')
ws.cell(5, 1).value   
'AR118706 (A2)' #Which is the name, not hyperlink

5.) 我尝试安装旧版本的 openpyxl==3.0.1 以克服类型错误，但没有成功。我尝试使用带有 xlrd 引擎的 openpyxl 打开 .xls 文件，出现类似的 typerror “xml.entree.elementtree.element”错误。我尝试了很多方法将 .xls 文件批量转换为 .xlsx，但都出现了类似的错误。

显然我可以用 excel 打开并另存为 .xlsx 但这违背了整个目的，而且我不能为 100 个文件这样做。

【问题讨论】：

我会重温熊猫。它允许您在“引擎”之间切换：xlrd 可以读取较旧的 .xls 文件，openpyxl 可以写入较新的 .xlsx 文件。 read_excel 还有一个方便的skiprows 参数：pandas.pydata.org/docs/reference/api/pandas.read_excel.html 另外请确保您拥有最新版本的 pandas，因为它一直在扩展。

标签： python excel pandas openpyxl xlsx

【解决方案1】：

受@Kunal 的启发，我设法编写了避免使用 Pandas 库的代码。 .xls 文件由 xlrd 读取，并由 xlwt 写入新的 excel 文件。维护超链接，输出文件保存为 .xlsx 格式：

import os
import xlwt
from xlrd import open_workbook

# read and combine data
directory = "random_directory"
required_files = os.listdir(directory)

#Define new file and sheet to get files into
new_file = xlwt.Workbook(encoding='utf-8', style_compression = 0)
new_sheet = new_file.add_sheet('Sheet1', cell_overwrite_ok = True)

#Initialize header row, can be done with any file 
old_file = open_workbook(directory+"/"+required_files[0], formatting_info=True)
old_sheet = old_file.sheet_by_index(0)
for column in list(range(0, old_sheet.ncols)):
    new_sheet.write(0, column, old_sheet.cell(0, column).value) #To create header row

#Add rows from all files present in folder 
for file in required_files:
    old_file = open_workbook(directory+"/"+file, formatting_info=True) 
    old_sheet = old_file.sheet_by_index(0) #Define old sheet
    hyperlink_map = old_sheet.hyperlink_map #Create map of all hyperlinks
    for row in range(1, old_sheet.nrows): #We need all rows except header row
        if row-1 < len(hyperlink_map.items()): #Statement to ensure we do not go out of range on the lower side of hyperlink_map.items()
            Row_depth=len(new_sheet._Worksheet__rows) #We need row depth to know where to add new row           
            for col in list(range(old_sheet.ncols)): #For every column we need to add row cell
                if col is 1: #We need to make an exception for column 2 being the hyperlinked column
                    click=list(hyperlink_map.items())[row-1][1].url_or_path #define URL
                    new_sheet.write(Row_depth, col, xlwt.Formula('HYPERLINK("{}", "{}")'.format(click, old_sheet.cell(row, 1).value)))
                else: #If not hyperlinked column
                    new_sheet.write(Row_depth, col, old_sheet.cell(row, col).value) #Write cell

new_file.save("random_directory/output_file.xlsx")

【讨论】：

【解决方案2】：

我假设在 excel 文件方面与 daedalus 相同。我使用 openpyxl 而不是 pandas 来读取和创建一个新的 excel 文件。

import openpyxl

wb1 = openpyxl.load_workbook('tmp.xlsx')
ws1 = wb.get_sheet_by_name('Sheet1')

wb2 = openpyxl.load_workbook('tmp2.xlsx')
ws2 = wb.get_sheet_by_name('Sheet1')

csvDict = {}

# Go through first sheet to find the hyperlinks and keys.
for (row in ws1.max_row):
    hyperlink_dict[ws1.cell(row=row, column=1).value] =
       [ws1.cell(row=row, column=2).hyperlink.target,
        ws1.cell(row=row, column=2).value]
 
# Go Through second sheet to find hyperlinks and keys.
for (row in ws2.max_row):
    hyperlink_dict[ws2.cell(row=row, column=1).value] =
       [ws2.cell(row=row, column=2).hyperlink.target,
        ws2.cell(row=row, column=2).value]

现在您拥有所有数据，因此您可以创建一个新工作簿并通过 opnenpyxl 将 dict 中的值保存到其中。

wb = Workbook(write_only=true)
ws = wb.create_sheet()

for irow in len(csvDict):
    #use ws.append() to add the data from the csv.

wb.save('new_big_file.xlsx')

https://openpyxl.readthedocs.io/en/stable/optimized.html#write-only-mode

【讨论】：

如第一个问题所述，以及对 daedalus 的回答，我无法使用 .xlsx 文件，因此只能使用 .xls 文件。使用 openpyxl 读取这些文件时会发生以下情况

【解决方案3】：

您需要使用 xlrd 库正确读取超链接，使用 pandas 将所有数据合并在一起，并使用 xlsxwriter 正确写入数据。假设所有输入文件格式相同，您可以使用以下代码。

# imports
import os
import xlrd
import xlsxwriter
import pandas as pd

# required functions
def load_excel_to_df(filepath, hyperlink_col):
    book = xlrd.open_workbook(file_path)
    sheet = book.sheet_by_index(0)
    hyperlink_map = sheet.hyperlink_map
    
    data = pd.read_excel(filepath)
    hyperlink_col_index = list(data.columns).index(hyperlink_col)
    
    required_links = [v.url_or_path for k, v in hyperlink_map.items() if k[1] == hyperlink_col_index]
    data['hyperlinks'] = required_links
    return data

# main code
# set required variables
input_data_dir = 'path/to/input/data/'
hyperlink_col = 'Publication_Number'
output_data_dir = 'path/to/output/data/'
output_filename = 'combined_data.xlsx'

# read and combine data
required_files = os.listdir(input_data_dir)
combined_data = pd.DataFrame()
for file in required_files:
    curr_data = load_excel_to_df(data_dir + os.sep + file, hyperlink_col)
    combined_data = combined_data.append(curr_data, sort=False, ignore_index=True)
cols = list(combined_data.columns)
m, n = combined_data.shape
hyperlink_col_index = cols.index(hyperlink_col)

# writing data
writer = pd.ExcelWriter(output_data_dir + os.sep + output_filename, engine='xlsxwriter')
combined_data[cols[:-1]].to_excel(writer, index=False, startrow=1, header=False) # last column contains hyperlinks
workbook  = writer.book
worksheet = writer.sheets[list(workbook.sheetnames.keys())[0]]
for i, col in enumerate(cols[:-1]):
    worksheet.write(0, i, col)
for i in range(m):
    worksheet.write_url(i+1, hyperlink_col_index, combined_data.loc[i, cols[-1]], string=combined_data.loc[i, hyperlink_col])
writer.save()

参考资料：

阅读超链接 - https://stackoverflow.com/a/7057076/17256762
pandas to_excel 标头格式 - Remove default formatting in header when converting pandas DataFrame to excel sheet
使用 xlsxwriter 编写超链接 - https://xlsxwriter.readthedocs.io/example_hyperlink.html

【讨论】：

【解决方案4】：

如果没有明确的可重现示例，问题就不清楚。假设我有两个名为 tmp.xls 和 tmp2.xls 的文件，其中包含如下两个屏幕截图中的虚拟数据。

然后pandas 可以轻松加载、连接并转换为.xlsx 格式，而不会丢失超链接。这是一些演示代码和生成的文件：

import pandas as pd

f1 = pd.read_excel('tmp.xls')
f2 = pd.read_excel('tmp2.xls')

f3 = pd.concat([f1, f2], ignore_index=True)

f3.to_excel('./f3.xlsx')

【讨论】：

不幸的是，这个解决方案不适用于我的问题。我有 ctrl-click 重定向到网页的超链接。例如ES2866911.