【问题标题】:How to Transfer the "Common" Columns between Two CSV Files如何在两个 CSV 文件之间传输“公共”列
【发布时间】:2016-11-05 10:55:52
【问题描述】:

我对编程很陌生,想编写这个程序,在file1.csvfile2.csv 之间传输公共列

输入:

file1.csv 看起来像这样:

ID,Nickname,Gender,SubjectPrefix,SubjectFirstName,Whatever1A,Whaterver2A,SubjectLastName
1,J.,M,Dr.,Jason,,,Allan
2,B.,M,Mr.,Brian,,,Welch

file2.csv 看起来像这样:

nickname,gender,city,id,prefix_name,first_name,Whatever1B,last_name,Whatever2B,Whatever3B,Whatever4B

问题:

如何比较file1.csvfile1.csv的表头来识别并传递它们之间的“共同”列。 “通用”列是具有相似命名约定的列(即IDid Nicknamenickname),或者不一定具有相同的列命名约定,但要存储相同的数据(即SubjectPrefixprefix_name SubjectFirstNamefirst_name)。

输出:

输出应该是这样的。

  • 注意:转移的列"id""nickname""gender"file1.csvfile2.csv标题之间命名相似的列。而"prefix_name""first_name" 列分别对应"SubjectPrefix""SubjectFirstName"

    id,nickname,gender,prefix_name,first_name,last_name  
    1,J.,M,Dr.,Jason,Allan
    2,B.,M,Mr.,Brian,Welch
    

我试过这段代码:

import csv
import collections

csv_file1 = "file1.csv"
csv_file2 = "file2.csv"

data1 = list(csv.reader(file(csv_file1,'r')))
data2 = list(csv.reader(file(csv_file2,'r')))

file1_header = data1[0][:] #get the header from file1
file2_header = data2[0][:] #get the header from file2
lowered_file1_header = [item.lower() for item in file1_header] #lowercase file1 header
lowered_file2_header = [item.lower() for item in file2_header] #lowercase file2 header anyways
col_index_dict = {}

for column in lowered_file1_header:
    if column == "subjectprefix":  # identify "subjectprefix" column in file1.csv
        col_index_dict[column] = lowered_file1_header.index(column)

   elif column == "subjectfirstname": # identify "subjectfirstname" column in file1.csv
        col_index_dict[column] = lowered_file1_header.index(column)

   elif column in file2_header: # identify the columns with same naming
        col_index_dict[column] = lowered_file1_header.index(column)

   else:
        col_index_dict[column] = -1 # mark the not matching columns

# Build header
output = [col_index_dict.keys()]
is_header = True

for row in data1:
    if is_header is False:
        rowData = []
        for column in col_index_dict:
            column_index = col_index_dict[column]
            if column_index != -1:
                rowData.append(row[column_index])
            else:
                rowData.append('')
        output.append(rowData)
    else:
        is_header = False

print(output)

知道如何解决这个问题吗?

【问题讨论】:

    标签: python csv for-loop pandas dictionary


    【解决方案1】:

    感谢Wboy 的贡献,您的意见非常有用。

    我能够使用 Pandas 库找到问题的解决方案。代码如下:

    import pandas as pd
    
    # read the csv files
    df = pd.read_csv('file1.csv')
    df2 = pd.read_csv('file2.csv')
    
    # lowercase the headers
    df.columns = df.columns.str.lower()
    df2.columns = df2.columns.str.lower()
    
    df_columns = set(list(df.columns))
    df2_columns = set(list(df2.columns))
    

    识别并转移“常用”列:

    for col in list(df_columns):
        for col2 in list(df2_columns):
            if col == "subjectprefix" and col2 =="prefix_name":
                # copy the data from df["subjectprefix"] column to df2["prefix_name"] column in df2 dataframe
                df2["prefix_name"] = df['subjectprefix']
                df3 = [col2]
            elif col == "subjectfirstname" and col2 =="first_name":
                # copy the data from "subjectfirstname" column to "first_name" column
                df2["first_name"] = df["subjectfirstname"]
                df3.append(col2)
    
            elif col =="subjectlastname" and col2 =="last_name":
                #copy the data from "subjectfirstname" column to "last_name" column
                df2["last_name"] = df["subjectlastname"]
                df3.append(col2)
    
            elif col == col2:
                # copy the exactly matching to df2
                df2[col2] = df[col]
                df3.append(col2)
    

    从数据框 df2 中删除“不常见”列:

    for col2 in list(df2_columns):
    if not col2 in df3:
        del df2[col2]
    
    # print the output
    df2.set_index("id",inplace=True)
    print df2
    

    将输出保存为 .csv 文件:

    df2.to_csv('output.csv')
    

    我确信这不是最佳解决方案,我希望在识别和传输“通用”列方面可以改进代码。我的代码中充满了 if/elif 语句,我相信这里一定有更好的方法来实现。

    【讨论】:

    【解决方案2】:

    欢迎来到编程。让我给你介绍一下神奇的pandas library

    在我的脑海中,这里有一些东西可以解决你的问题。 (我不是说它高效!所以对于大型数据集,这可能是个问题)

    import pandas as pd
    
    df = pd.read_csv('file1.csv')
    df2 = pd.read_Csv('file2.csv')
    
    df_columns = set(list(df.columns))
    df2_columns = set(list(df2.columns))
    
    common_columns = list(df_columns.intersection(df2_columns))
    
    common_df = df[common_columns]
    common_df2 = df2[common_colmns]
    
    ## At this point you have the common columns for both CSV's. if you want
    ## to make them into one, just use df concatenate / append. else, you can save both of them like this:
    
    common_df.to_csv('common1.csv')
    common_df2.to_csv('common2.csv')
    

    【讨论】:

      猜你喜欢
      • 2019-11-01
      • 2023-03-12
      • 1970-01-01
      • 2020-05-09
      • 2015-12-22
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多