Python excel电子表格比较答案

【问题标题】：Python excel spread sheet comparisonPython excel电子表格比较
【发布时间】：2017-06-09 02:31:48
【问题描述】：

我目前正在尝试编写一个脚本来比较两个 excel 文件的内容。

列表 1 将具有以下格式；

Broadcom Drivers and Management Applications  [version 17.0.8.2]
QLogic Drivers and Management Applications  [version 18.00.8.3]
NVIDIA 3D Vision Driver 306.97  [version 306.97]
Citrix online plug-in (Web)  [version 12.1.0.30]
Citrix online plug-in (HDX)  [version 12.1.0.30]
Google Update Helper  [version 1.3.32.7]
QfinitiPatches_20131211_Win7 [version 1.0.0.0]
Citrix online plug-in (Web)  [version 12.1.0.30]
Citrix online plug-in (HDX)  [version 12.1.0.30]
Citrix Receiver (HDX Flash Redirection)  [version 14.3.1.1]
Citrix Authentication Manager  [version 7.0.0.8243]
Microsoft Office Access MUI (English) 2010  [version 14.0.6029.1000]
Microsoft Office Excel MUI (English) 2010  [version 14.0.6029.1000]
Microsoft Office PowerPoint MUI (English) 2010  [version 14.0.6029.1000]
Microsoft Office Publisher MUI (English) 2010  [version 14.0.6029.1000]

列表 2 将具有以下格式；

Mcrosoft Word (All versions)
Microsoft Excel (All versions)
Microsoft Access (All versions)
Microsoft Project (All versions)
Microsoft PowerPoint (All versions)
Microsoft Infopath (All versions)
Microsoft Visio (All versions)
Microsoft SQL Server (All versions)
Microsoft SQL Client (All versions)
Microsoft explorer (version 6+)
Firefox (version 2+)
Oracle Database (All versions)

我需要脚本做的是使用列表 2 作为参考并在列表 1 中查找任何匹配的内容。因为这两个列表不完全匹配，我需要确保它会选择部分匹配。

例如，在列表 1 中有 Microsoft Office Access MUI（英语）2010 [版本 14.0.6029.1000]，而列表 2 有 Microsoft Access（所有版本）我需要脚本来将其作为匹配项并从输出文件中省略。

到目前为止，我有以下内容

import pandas as pd
import numpy as np
df1 = pd.read_excel('/xls comparison project/xl files/Approved Software list.xls', 'Approved Software', parse_cols = 'd', index=False)
df2 = pd.read_excel('/xls comparison project/xl files/Software list.xlsx', 'Sheet1', parse_cols = 'a')
import csv
AS = df1["Software Title"].tolist()
S = df2["Software"].tolist()

我尝试了以下，但这寻找完全匹配

result = [ x for x in AS if x in S]

我已将两个电子表格的内容以列表格式加载到名为 AS 和 S 的变量中。那么;

results = result
resultfile = open("output1.xls",'wb')
wr = csv.writer(resultfile, delimiter=',')
for val in result:
    wr.writerow([val])
resultfile.close()

这给了我需要的输出文件

我唯一的问题实际上是比较数据，我已经没有想法了。

我在 Google 上进行了广泛的搜索，虽然我可以找到类似的问题，但我无法从他们的内容中创建解决方案。我对 python 很陌生，所以我很感激你能给我的任何帮助。

非常感谢

李

【问题讨论】：

标签： excel python-2.7 pandas

【解决方案1】：

import pandas as pd 

df = pd.DataFrame(['Broadcom Drivers and Management Applications  [version 17.0.8.2]','QLogic Drivers and Management Applications  [version 18.00.8.3]','NVIDIA 3D Vision Driver 306.97  [version 306.97]','Citrix online plug-in (Web)  [version 12.1.0.30]','Citrix online plug-in (HDX)  [version 12.1.0.30]','Google Update Helper  [version 1.3.32.7]','QfinitiPatches_20131211_Win7 [version 1.0.0.0]','Citrix online plug-in (Web)  [version 12.1.0.30]','Citrix online plug-in (HDX)  [version 12.1.0.30]','Citrix Receiver (HDX Flash Redirection)  [version 14.3.1.1]','Citrix Authentication Manager  [version 7.0.0.8243]','Microsoft Office Access MUI (English) 2010  [version 14.0.6029.1000]','Microsoft Office Excel MUI (English) 2010  [version 14.0.6029.1000]','Microsoft Office PowerPoint MUI (English) 2010  [version 14.0.6029.1000]','Microsoft Office Publisher MUI (English) 2010  [version 14.0.6029.1000]'], columns=['Software Title'])
df2 = pd.DataFrame(['Mcrosoft Word (All versions)','Microsoft Excel (All versions)','Microsoft Access (All versions)','Microsoft Project (All versions)','Microsoft PowerPoint (All versions)','Microsoft Infopath (All versions)','Microsoft Visio (All versions)','Microsoft SQL Server (All versions)','Microsoft SQL Client (All versions)','Microsoft explorer (version 6+)','Firefox (version 2+)','Oracle Database (All versions)'], columns=['Title'])

df2['TitleName'] = df2['Title'].str.split('(') #to remove version info 

df2 = pd.concat([df2['Title'], df2.TitleName.apply(pd.Series)], axis=1)
df2.columns=['Title','Software','Version']
df2['Software']=df2.Software.str.replace(' ','(.*)') #create search string in regex format


searchitems= df2["Software"].tolist()

result=[]
for item in searchitems:
    print "searching for : "+item
    print df[df['Software Title'].str.contains(item)]

输出

searching for : Mcrosoft(.*)Word(.*)
Empty DataFrame
Columns: [Software Title]
Index: []
searching for : Microsoft(.*)Excel(.*)
                                       Software Title
12  Microsoft Office Excel MUI (English) 2010  [ve...
searching for : Microsoft(.*)Access(.*)
                                       Software Title
11  Microsoft Office Access MUI (English) 2010  [v...
searching for : Microsoft(.*)Project(.*)
Empty DataFrame
Columns: [Software Title]
Index: []
searching for : Microsoft(.*)PowerPoint(.*)
                                       Software Title
13  Microsoft Office PowerPoint MUI (English) 2010...
searching for : Microsoft(.*)Infopath(.*)
Empty DataFrame
Columns: [Software Title]
Index: []
searching for : Microsoft(.*)Visio(.*)
Empty DataFrame
Columns: [Software Title]
Index: []
searching for : Microsoft(.*)SQL(.*)Server(.*)
Empty DataFrame
Columns: [Software Title]
Index: []
searching for : Microsoft(.*)SQL(.*)Client(.*)
Empty DataFrame
Columns: [Software Title]
Index: []
searching for : Microsoft(.*)explorer(.*)
Empty DataFrame
Columns: [Software Title]
Index: []
searching for : Firefox(.*)
Empty DataFrame
Columns: [Software Title]
Index: []
searching for : Oracle(.*)Database(.*)
Empty DataFrame
Columns: [Software Title]
Index: []

【讨论】：