【问题标题】:How to get averages per column not row from a CSV file?如何从 CSV 文件中获取每列而不是行的平均值?
【发布时间】:2016-04-12 17:08:09
【问题描述】:

我有 13 列,每行 303 行我已经在健康患者和患病患者之间划分了 303 行我现在正试图获取 CSV 文件中每一列的平均值,以供健康患者和患病患者进行比较和对比。问题的最终示例是这样,CSV 文件中的数字与此示例中的平均值类似,但缺少数据中的 ? 除外。

Please enter a training file name: train.csv
Total Lines Processed: 303
Total Healthy Count: 164
Total Ill Count: 139
Averages of Healthy Patients:
[52.59, 0.56, 2.79, 129.25, 242.64, 0.14, 0.84, 158.38, 0.14, 0.59, 1.41, 0.27, 3.77, 0.00]
Averages of Ill Patients:
[56.63, 0.82, 3.59, 134.57, 251.47, 0.16, 1.17, 139.26, 0.55, 1.57, 1.83, 1.13, 5.80, 2.04]
Seperation Values are:
[54.61, 0.69, 3.19, 131.91, 247.06, 0.15, 1.00, 148.82, 0.34, 1.08, 1.62, 0.70, 4.79, 1.02]

我的代码还有很长的路要走,我只是在寻找一种简单的方法来获取患者的平均值。我当前的方法只获得第 13 列,但我需要上面的所有 13 列。对于我应该尝试解决这个问题的任何帮助,我们将不胜感激。

import csv
#turn csv files into a list of lists
with open('train.csv') as csvfile:
     reader = csv.reader(csvfile, delimiter=',')
     csv_data = list(reader)

i_list = []
for row in csv_data:
    if (row and int(row[13]) > 0):
        i_list.append(int(row[13]))
H_list = []
for row in csv_data:
    if (row and int(row[13]) <= 0):
        H_list.append(int(row[13]))

Icount = len(i_list)
IPavg = sum(i_list)/len(i_list)
Hcount = len(H_list)
HPavg = sum(H_list)/len(H_list)
file = open("train.csv")
numline = len(file.readlines())

print(numline)
print("Total amount of healthy patients " + str(Icount))
print("Total amount of ill patients " + str(Hcount))
print("Averages of healthy patients " + str(HPavg))
print("Averages of ill patients " + str(IPavg))

我唯一的想法是做同样的事情来获得第 13 行的平均值,但我不知道如何将健康患者与患病患者分开。

【问题讨论】:

  • 我唯一的想法是做同样的事情来获得第 13 行的平均值,但我不知道如何将健康患者与患病患者分开

标签: python csv average


【解决方案1】:

如果您想要每列的平均值,那么在您阅读文件时一次处理所有这些是最简单的 - 这并不难。您没有指定您使用的 Python 版本,但以下内容应该适用于这两种版本(尽管可以针对其中一种进行优化)。

import csv

NUMCOLS = 13

with open('train.csv') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    # initialize totals
    Icount = 0
    Hcount = 0
    H_col_totals = [0.0 for _ in range(NUMCOLS)]  # init to floating pt value for Py 2
    I_col_totals = [0.0 for _ in range(NUMCOLS)]  # init to floating pt value for Py 2
    # read and process file
    for row in reader:
        if row:  # non-blank line?
            # update running total for each column
            row = list(map(int, row))
            for col in range(NUMCOLS):
                if row[col] > 0:
                    Icount += 1
                    I_col_totals[col] += row[col]
                else:
                    Hcount += 1
                    H_col_totals[col] += row[col]

# compute average of data in each column
if Hcount < 1:  # avoid dividing by zero
    HPavgs = [0.0 for _ in range(NUMCOLS)]
else:
    HPavgs = [H_col_totals[col]/Hcount for col in range(NUMCOLS)]

if Icount < 1:  # avoid dividing by zero
    IPavgs = [0.0 for _ in range(NUMCOLS)]
else:
    IPavgs = [I_col_totals[col]/Icount for col in range(NUMCOLS)]

print("Total number of healthy patients: {}".format(Hcount))
print("Total number of ill patients: {}".format(Icount))
print("Averages of healthy patients: " +
      ", ".join(format(HPavgs[col], ".2f") for col in range(NUMCOLS)))
print("Averages of ill patients: " +
      ", ".join(format(IPavgs[col], ".2f") for col in range(NUMCOLS)))

【讨论】:

    【解决方案2】:

    为什么不使用pandas 模块?

    完成你想要的事情会容易得多。

    In [42]: import pandas as pd
    
    In [43]: import numpy as np
    
    In [44]: df = pd.DataFrame(np.random.randn(10, 4))
    
    In [45]: df
    Out[45]:
              0         1         2         3
    0  1.290657 -0.376132 -0.482188  1.117486
    1 -0.620332 -0.247143  0.214548 -0.975472
    2  1.803212 -0.073028  0.224965  0.069488
    3 -0.249340  0.491075  0.083451  0.282813
    4 -0.477317  0.059482  0.867047 -0.656830
    5  0.117523  0.089099 -0.561758  0.459426
    6 -0.173780 -0.066054 -0.943881 -0.301504
    7  1.250235 -0.949350 -1.119425  1.054016
    8  1.031764 -1.470245 -0.976696  0.579424
    9  0.300025  1.141415  1.503518  1.418005
    
    In [46]: df.mean()
    Out[46]:
    0    0.427265
    1   -0.140088
    2   -0.119042
    3    0.304685
    dtype: float64
    

    你可以试试:

    In [47]: df = pd.read_csv('yourfile.csv')
    

    【讨论】:

    • 我还没有在课堂上学习过 numpy 或 pandas。所以我认为我还不能使用那种方法。
    猜你喜欢
    • 2020-08-12
    • 1970-01-01
    • 1970-01-01
    • 2014-11-13
    • 2019-03-26
    • 1970-01-01
    • 1970-01-01
    • 2021-07-20
    • 1970-01-01
    相关资源
    最近更新 更多