如何计算来自多个 csv 文件的数字的平均值？答案

【问题标题】：How to calculate average of numbers from multiple csv files?如何计算来自多个 csv 文件的数字的平均值？
【发布时间】：2014-09-06 01:13:15
【问题描述】：

我有如下文件作为我一直在做的模拟实验的复制品：

generation, ratio_of_player_A, ratio_of_player_B, ratio_of_player_C

所以，数据是这样的

0, 0.33, 0.33, 0.33

1, 0.40, 0.40, 0.20

2, 0.50, 0.40, 0.10

etc

现在，由于我以倍数运行，因此每个实验我有大约 1000 个文件，给出了各种这样的数字。现在，我的问题是在一组实验中对它们进行平均。

因此，我希望有一个文件，其中包含每一代之后的平均比率（多次复制的平均值，即文件）

所有需要平均的复制输出文件的名称如 output1.csv、output2.csv、output3.csv .....output1000.csv

如果有人可以帮助我使用 shell 脚本或 python 脚本，我将不胜感激。

【问题讨论】：

试试熊猫：pandas.pydata.org
所有文件都在一个目录下吗？

标签： python bash shell csv file-io

【解决方案1】：

您可以将 1000 个实验中的每一个加载到一个数据框中，将它们全部相加，然后计算平均值。

filepath = tkinter.filedialog.askopenfilenames(filetypes=[('CSV','*.csv')]) #select your files
for file in filepath:
    df = pd.read_csv(file, sep=';', decimal=',')
    dfs.append(df)

temp = dfs[0] #creates a temporary variable to store the df
for i in range(1,len(dfs)): #starts from 1 cause 0 is stored in temp
    temp = temp + dfs[i];
result = temp/len(dfs)

【讨论】：

【解决方案2】：

如果我理解得很好，假设你有 2 个这样的文件：

$ cat file1
0, 0.33, 0.33, 0.33
1, 0.40, 0.40, 0.20
2, 0.50, 0.40, 0.10

$ cat file2
0, 0.99, 1, 0.02
1, 0.10, 0.90, 0.90
2, 0.30, 0.10, 0.30

并且您想在两个文件的列之间求平均值。所以这是第一列的一种方式：

编辑：我找到了一个更好的方法，使用 pd.concat ：

all_files = pd.concat([file1,file2]) # you can easily put your 1000 files here
result = {}
for i in range(3): # 3 being number of generations
    result[i] = all_files[i::3].mean()
result_df = pd.DataFrame(result)
result_df
                       0     1     2
ratio_of_player_A  0.660  0.25  0.40
ratio_of_player_B  0.665  0.65  0.25
ratio_of_player_C  0.175  0.55  0.20

另一种合并方式，但需要执行多次合并

import pandas as pd

In [1]: names = ["generation", "ratio_of_player_A", "ratio_of_player_B", "ratio_of_player_C"]
In [2]: file1 = pd.read_csv("file1", index_col=0, names=names)
In [3]: file2 = pd.read_csv("file2", index_col=0, names=names)
In [3]: file1
Out[3]:     
       ratio_of_player_A  ratio_of_player_B  ratio_of_player_C
generation                                                         
0                        0.33               0.33               0.33
1                        0.40               0.40               0.20
2                        0.50               0.40               0.10    

In [4]: file2
Out[4]: 
            ratio_of_player_A  ratio_of_player_B  ratio_of_player_C
generation                                                         
0                        0.99                1.0               0.02
1                        0.10                0.9               0.90
2                        0.30                0.1               0.30



In [5]: merged_file = file1.merge(file2, right_index=True, left_index=True, suffixes=["_1","_2"])
In [6]: merged_file.filter(regex="ratio_of_player_A_*").mean(axis=1)
Out[6]
generation
0             0.66
1             0.25
2             0.40
dtype: float64

或者这样（我猜要快一点）：

merged_file.ix[:,::3].mean(axis=1) # player A

如果你有多个文件，你可以在应用 mean() 方法之前递归合并。

如果我误解了这个问题，请告诉我们您对 file1 和 file2 的期望。

有不明白的地方问一下。

希望这会有所帮助！

【讨论】：

非常感谢您！一段时间以来一直把我的头撞在墙上，直到我找到这个。唯一有点奇怪的是，当您打印数据框时，列如何变成行，而行如何变成列。

【解决方案3】：

以下应该有效：

from numpy import genfromtxt

files = ["file1", "file2", ...]

data = genfromtxt(files[0], delimiter=',')
for f in files[1:]:
    data += genfromtxt(f, delimiter=',')

data /= len(files)

【讨论】：

这可能是最有效的方法！

【解决方案4】：

你的问题不是很清楚.. 如果我理解正确的话..

>temp
for i in `ls *csv`
more "$i">>temp;

然后您将来自不同文件的所有数据放在一个大文件中。尝试加载sqlite数据库（1.创建表2.插入数据）之后，您可以查询您的数据。从您的tablehavetempdata 等中选择 sum(columns)/count(columns)。尝试查看 sqlite，因为您的数据是 tabular.sqlite 在我看来会更适合。

【讨论】：

这不是 OP 所要求的。
“因此，我想要一个包含每一代之后的平均比率的文件（多个复制的平均值，即文件）”。我写了这个答案，牢记这一行。如果没有帮助，很抱歉浪费您的时间:(
无需抱歉！我不是故意要严厉的。我只是认为建议添加一个 SQLite 数据库来从 csv 文件进行 Python 计算有点矫枉过正。 :)