如何在 Python 中合并 200 个 csv 文件答案

【问题标题】：how to merge 200 csv files in Python如何在 Python 中合并 200 个 csv 文件
【发布时间】：2011-01-31 12:39:36
【问题描述】：

伙计们，我这里有 200 个单独的 csv 文件，名称从 SH (1) 到 SH (200)。我想将它们合并到一个 csv 文件中。我该怎么做？

【问题讨论】：

你会以什么方式合并它们？（连接线，...）
您希望它们如何合并？ CSV 文件中的每一行都是一行。所以一个简单的选择是将所有文件连接在一起。
每个文件有两列。我想将它们合并到一个包含两列连续的文件中。
@Chuck：如何获取您的 cmets 中的所有回复（对问题和答案）并更新您的问题？
这个问题应该命名为“How to concat...”而不是“how to merge...”

【解决方案1】：

我只是将另一个代码示例扔进篮子：

from glob import glob

with open('singleDataFile.csv', 'a') as singleFile:
    for csvFile in glob('*.csv'):
        for line in open(csvFile, 'r'):
            singleFile.write(line)

【讨论】：

@Andy 我看不出 stackoverflow 提醒我投票赞成答案和我提醒人们分享他们的感激（通过投票）如果他们发现我的回答有用的话。我知道这不是 Facebook，我也不是一个喜欢的人..
一直是discussed previously，每次都是deemed不可接受。

【解决方案2】：

我通过实现一个期望输出文件和输入文件路径的函数来做到这一点。该函数将第一个文件的文件内容复制到输出文件中，然后对其余的输入文件执行相同的操作，但没有标题行。

def concat_files_with_header(output_file, *paths):
    for i, path in enumerate(paths):
        with open(path) as input_file:
            if i > 0:
                next(input_file)  # Skip header
            output_file.writelines(input_file)

函数使用示例：

if __name__ == "__main__":
    paths = [f"sh{i}.csv" for i in range(1, 201)]
    with open("output.csv", "w") as output_file:
        concat_files_with_header(output_file, *paths)

【讨论】：

【解决方案3】：

您可以简单地使用内置的csv 库。即使您的某些 CSV 文件的列名或标题略有不同，此解决方案也可以工作，这与其他投票最多的答案不同。

import csv
import glob


filenames = [i for i in glob.glob("SH*.csv")]
header_keys = []
merged_rows = []

for filename in filenames:
    with open(filename) as f:
        reader = csv.DictReader(f)
        merged_rows.extend(list(reader))
        header_keys.extend([key for key in reader.fieldnames if key not in header_keys])

with open("combined.csv", "w") as f:
    w = csv.DictWriter(f, fieldnames=header_keys)
    w.writeheader()
    w.writerows(merged_rows)

合并后的文件将包含可以在文件中找到的所有可能的列 (header_keys)。文件中任何不存在的列都将呈现为空白/空（但保留文件的其余数据）。

注意：

如果您的 CSV 文件没有标题，这将不起作用。在这种情况下，您仍然可以使用 csv 库，但不能使用 DictReader 和 DictWriter，而必须使用基本的 reader 和 writer。
当您处理大量数据时，这可能会遇到问题，因为全部内容都存储在内存中（merged_rows 列表）。

【讨论】：

【解决方案4】：

import pandas as pd
import os

df = pd.read_csv("e:\\data science\\kaggle assign\\monthly sales\\Pandas-Data-Science-Tasks-master\\SalesAnalysis\\Sales_Data\\Sales_April_2019.csv")
files = [file for file in  os.listdir("e:\\data science\\kaggle assign\\monthly sales\\Pandas-Data-Science-Tasks-master\\SalesAnalysis\\Sales_Data")
for file in files:
    print(file)

all_data = pd.DataFrame()
for file in files:
    df=pd.read_csv("e:\\data science\\kaggle assign\\monthly sales\\Pandas-Data-Science-Tasks-master\\SalesAnalysis\\Sales_Data\\"+file)
    all_data = pd.concat([all_data,df])
    all_data.head()

【讨论】：

【解决方案5】：

在使用@Adders 以及后来由@varun 改进的解决方案上，我实现了一些小的改进，也让整个合并的 CSV 只剩下主标题：

from glob import glob

filename = 'main.csv'

with open(filename, 'a') as singleFile:
    first_csv = True
    for csv in glob('*.csv'):
        if csv == filename:
            pass
        else:
            header = True
            for line in open(csv, 'r'):
                if first_csv and header:
                    singleFile.write(line)
                    first_csv = False
                    header = False
                elif header:
                    header = False
                else:
                    singleFile.write(line)
    singleFile.close()

最好的问候！！！

【讨论】：

【解决方案6】：

一个易于使用的功能：

def csv_merge(destination_path, *source_paths):
'''
Merges all csv files on source_paths to destination_path.
:param destination_path: Path of a single csv file, doesn't need to exist
:param source_paths: Paths of csv files to be merged into, needs to exist
:return: None
'''
with open(destination_path,"a") as dest_file:
    with open(source_paths[0]) as src_file:
        for src_line in src_file.read():
            dest_file.write(src_line)
    source_paths.pop(0)
    for i in range(len(source_paths)):
        with open(source_paths[i]) as src_file:
            src_file.next()
            for src_line in src_file:
                 dest_file.write(src_line)

【讨论】：

【解决方案7】：

或者，你可以这样做

cat sh*.csv > merged.csv

【讨论】：

这也将复制每个文件的文件头行。

【解决方案8】：

如果文件没有按顺序编号，请采用以下无忧方法： Windows 机器上的 Python 3.6：

import pandas as pd
from glob import glob

interesting_files = glob("C:/temp/*.csv") # it grabs all the csv files from the directory you mention here

df_list = []
for filename in sorted(interesting_files):

df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)

# save the final file in same/different directory:
full_df.to_csv("C:/temp/merged_pandas.csv", index=False)

【讨论】：

【解决方案9】：

假设您有 2 个csv 文件，如下所示：

csv1.csv：

id,name
1,Armin
2,Sven

csv2.csv：

id,place,year
1,Reykjavik,2017
2,Amsterdam,2018
3,Berlin,2019

你希望结果是这样的 csv3.csv：

id,name,place,year
1,Armin,Reykjavik,2017
2,Sven,Amsterdam,2018
3,,Berlin,2019

那么你可以使用下面的sn-p来做到这一点：

import csv
import pandas as pd

# the file names
f1 = "csv1.csv"
f2 = "csv2.csv"
out_f = "csv3.csv"

# read the files
df1 = pd.read_csv(f1)
df2 = pd.read_csv(f2)

# get the keys
keys1 = list(df1)
keys2 = list(df2)

# merge both files
for idx, row in df2.iterrows():
    data = df1[df1['id'] == row['id']]

    # if row with such id does not exist, add the whole row
    if data.empty:
        next_idx = len(df1)
        for key in keys2:
            df1.at[next_idx, key] = df2.at[idx, key]

    # if row with such id exists, add only the missing keys with their values
    else:
        i = int(data.index[0])
        for key in keys2:
            if key not in keys1:
                df1.at[i, key] = df2.at[idx, key]

# save the merged files
df1.to_csv(out_f, index=False, encoding='utf-8', quotechar="", quoting=csv.QUOTE_NONE)

借助循环，您可以为多个文件实现与您的情况相同的结果（200 个 csv 文件）。

【讨论】：

【解决方案10】：

如果你在 linux/mac 上工作，你可以这样做。

from subprocess import call
script="cat *.csv>merge.csv"
call(script,shell=True)

【讨论】：

【解决方案11】：

更新 wisty 对 python3 的回答

fout=open("out.csv","a")
# first file:
for line in open("sh1.csv"):
    fout.write(line)
# now the rest:    
for num in range(2,201):
    f = open("sh"+str(num)+".csv")
    next(f) # skip the header
    for line in f:
         fout.write(line)
    f.close() # not really needed
fout.close()

【讨论】：

【解决方案12】：

很容易将目录中的所有文件合并起来

import glob
import csv


# Open result file
with open('output.txt','wb') as fout:
    wout = csv.writer(fout,delimiter=',') 
    interesting_files = glob.glob("*.csv") 
    h = True
    for filename in interesting_files: 
        print 'Processing',filename 
        # Open and process file
        with open(filename,'rb') as fin:
            if h:
                h = False
            else:
                fin.next()#skip header
            for line in csv.reader(fin,delimiter=','):
                wout.writerow(line)

【讨论】：

【解决方案13】：

这是一个脚本：

将名为 SH1.csv 的 csv 文件连接到 SH200.csv
保留标题

import glob
import re

# Looking for filenames like 'SH1.csv' ... 'SH200.csv'
pattern = re.compile("^SH([1-9]|[1-9][0-9]|1[0-9][0-9]|200).csv$")
file_parts = [name for name in glob.glob('*.csv') if pattern.match(name)]

with open("file_merged.csv","wb") as file_merged:
    for (i, name) in enumerate(file_parts):
        with open(name, "rb") as file_part:
            if i != 0:
                next(file_part) # skip headers if not first file
            file_merged.write(file_part.read())

【讨论】：

【解决方案14】：

使用accepted StackOverflow answer 创建要附加的 csv 文件列表，然后运行此代码：

import pandas as pd
combined_csv = pd.concat( [ pd.read_csv(f) for f in filenames ] )

如果您想将其导出到单个 csv 文件，请使用：

combined_csv.to_csv( "combined_csv.csv", index=False )

【讨论】：

@wisty,@Andy，假设所有文件的每一行都有标题——有些行有不同的标题。每个文件中的 2 列没有标题。如何合并，使得每个文件只添加一列。
文件导出到哪里？
@dirtysocks45，我更改了答案以使其更加明确。
添加排序：combined_csv = pd.concat( [pd.read_csv(f) for f in filenames ], sort=False)

【解决方案15】：

我修改了@wisty 所说的适用于 python 3.x 的内容，对于那些有编码问题的人，我也使用 os 模块来避免硬编码

import os 
def merge_all():
    dir = os.chdir('C:\python\data\\')
    fout = open("merged_files.csv", "ab")
    # first file:
    for line in open("file_1.csv",'rb'):
        fout.write(line)
    # now the rest:
    list = os.listdir(dir)
    number_files = len(list)
    for num in range(2, number_files):
        f = open("file_" + str(num) + ".csv", 'rb')
        f.__next__()  # skip the header
        for line in f:
            fout.write(line)
        f.close()  # not really needed
    fout.close()

【讨论】：

【解决方案16】：

对上面的代码稍作改动，因为它实际上并不能正常工作。

应该是这样的……

from glob import glob

with open('main.csv', 'a') as singleFile:
    for csv in glob('*.csv'):
        if csv == 'main.csv':
            pass
        else:
            for line in open(csv, 'r'):
                singleFile.write(line)

【讨论】：

【解决方案17】：

你为什么不能sed 1d sh*.csv > merged.csv？

有时你甚至不必使用 python！

【讨论】：

在 windows 上，C:\> 复制 *.csv merge.csv
从一个文件复制标题信息：sed -n 1p some_file.csv > merge_file.csv 从所有其他文件复制除最后一行以外的所有文件：sed 1d *.csv >> merge_file.csv跨度>
@blinsay 它也将每个 CSV 文件中的标题添加到合并文件中。
如何使用此命令而不复制第一个之后的每个后续文件的标题信息？我似乎反复弹出标题信息。
如果您不需要删除标题，那就太好了！

【解决方案18】：

正如 ghostdog74 所说，但这次是标题：

fout=open("out.csv","a")
# first file:
for line in open("sh1.csv"):
    fout.write(line)
# now the rest:    
for num in range(2,201):
    f = open("sh"+str(num)+".csv")
    f.next() # skip the header
    for line in f:
         fout.write(line)
    f.close() # not really needed
fout.close()

【讨论】：

如果在python3.x中f.next()，你可以使用f.__next__()。
请注意：可以使用with open 语法并避免手动.close()ing 文件。
f.next() 和f.__next__() 有什么区别？当我使用前者时，我得到了'_io.TextIOWrapper' object has no attribute 'next'
在fout.write(line) 之前我会这样做：if line[-1] != '\n': line += '\n'

【解决方案19】：

您可以导入 csv，然后遍历所有 CSV 文件，将它们读取到一个列表中。然后将列表写回磁盘。

import csv

rows = []

for f in (file1, file2, ...):
    reader = csv.reader(open("f", "rb"))

    for row in reader:
        rows.append(row)

writer = csv.writer(open("some.csv", "wb"))
writer.writerows("\n".join(rows))

上面不是很健壮，因为它没有错误处理，也没有关闭任何打开的文件。无论单个文件中是否包含一行或多行 CSV 数据，这都应该有效。我也没有运行这段代码，但它应该让你知道该怎么做。

【讨论】：

【解决方案20】：

这取决于您所说的“合并”——它们有相同的列吗？他们有标题吗？例如，如果它们都具有相同的列，并且没有标题，则简单的连接就足够了（打开目标文件进行写入，循环打开每个用于读取的源，使用来自可读取源的shutil.copyfileobj 到open-for-writing 目标，关闭源，继续循环 - 使用 with 语句代表您进行关闭）。如果它们具有相同的列，但也有标题，则除了第一个源文件之外的每个源文件都需要一个 readline，在您打开它进行阅读之后，再将其复制到目标中，以跳过标题行。

如果 CSV 文件并非都具有相同的列，那么您需要定义在何种意义上“合并”它们（如 SQL JOIN？或“水平”，如果它们都具有相同的行数？等等）——在这种情况下，我们很难猜出你的意思。

【讨论】：

每个文件都有两列带有标题。我想将它们合并到一个包含两列的单个文件中。

【解决方案21】：

fout=open("out.csv","a")
for num in range(1,201):
    for line in open("sh"+str(num)+".csv"):
         fout.write(line)    
fout.close()

【讨论】：

【解决方案22】：

如果要在 Python 中使用合并的 CSV，那么只需使用 glob 获取文件列表，通过 files 参数传递给 fileinput.input()，然后使用 csv 模块读取它一口气完成。

【讨论】：