合并大量netCDF文件答案

【问题标题】：Combining a large amount of netCDF files合并大量netCDF文件
【发布时间】：2015-05-11 06:30:25
【问题描述】：

我有一个包含 netCDF (.nc) 文件的大文件夹，每个文件的名称相似。数据文件包含时间、经度、纬度和月降水量的变量。目标是获得每个月 X 年的平均月降水量。因此，最后我将有 12 个值代表每个纬度和经度 X 年的平均月降水量。多年来，每个文件都位于同一位置。每个文件都以相同的名称开头并以“date.sub.nc”结尾，例如：

'data1.somthing.somthing1.avg_2d_Ind_Nx.200109.SUB.nc'
'data1.somthing.somthing1.avg_2d_Ind_Nx.200509.SUB.nc'
'data2.somthing.somthing1.avg_2d_Ind_Nx.201104.SUB.nc'
'data2.somthing.somthing1.avg_2d_Ind_Nx.201004.SUB.nc'
'data2.somthing.somthing1.avg_2d_Ind_Nx.201003.SUB.nc'
'data2.somthing.somthing1.avg_2d_Ind_Nx.201103.SUB.nc'
'data1.somthing.somthing1.avg_2d_Ind_Nx.201203.SUB.nc'

结尾是 YearMonth.SUB.nc 到目前为止我所拥有的是：

array=[]
f = nc.MFDataset('data*.nc')
precp = f.variables['prectot']
time = f.variables['time']
array = f.variables['time','longitude','latitude','prectot']

我得到一个 KeyError: ('time', 'longitude', 'latitude', 'prectot')。有没有办法组合所有这些数据，以便我能够操纵它？

【问题讨论】：

“组合”数据是什么意思？感谢您的f = nc.MFDataset... 行，它已经全部在一个 MFDataset 对象中。换句话说，f.variables['prectot'][:] 数组已经是一个 3D 数组，其维度（时间、纬度、经度）包含每个维度（时间、纬度、经度）的 prectot 值。
另外，回复：您的 KeyError，f.variables 是一个字典，对于任何字典，您一次只能访问它的一个值，即 f.variables['time'] 或 f.variables['longitude']，而不是 @987654329 @。但正如我之前的评论所说，无论如何你只需要f.variables['prectot']（只要我理解正确）。
我明白了，我不确定 MFDataset 实际上做了什么。我尝试了 glob.glob 函数，但这只是列出了我所有的文件。谢谢。

标签： python netcdf cdo-climate

【解决方案1】：

NCO 这样做

ncra *.01.SUB.nc pcp_avg_01.nc
ncra *.02.SUB.nc pcp_avg_02.nc
...
ncra *.12.SUB.nc pcp_avg_12.nc
ncrcat pcp_avg_??.nc pcp_avg.nc

当然，前十二个命令可以使用 Bash 循环来完成，从而将总行数减少到少于五行。如果你更喜欢用 python 编写脚本，你可以用这个来检查你的答案。 ncra 文档here。

【讨论】：

我正在使用 Enthought Canopy for python，我找不到或下载包 NCO 来获取 ncra 功能。不过我想用它。
我想我已经安装了 NCO。当我运行 ncra *01.SUB.nc pcp_avg_01.nc 时，它输出 {SyntaxError: invalid syntax ncra *01.SUB.nc pcp_avg_01.nc} 插入符号指向 SUB.nc 中的 S 不确定如何解决这个问题。
您在 Windows 上运行，因此 shell globbing 不起作用。将 *01.SUB.nc 替换为要输入的实际文件列表。该手册可能会有所帮助。见nco.sourceforge.net/nco.html#Specifying-Input-Files

【解决方案2】：

正如@CharlieZender 提到的，ncra 是这里的方法，我将提供一些有关将该函数集成到 Python 脚本中的更多详细信息。（PS - 您可以使用 Homebrew 轻松安装 NCO，例如 http://alejandrosoto.net/blog/2014/01/22/setting-up-my-mac-for-scientific-research/）

import subprocess
import netCDF4
import glob
import numpy as np

for month in range(1,13):
    # Gather all the files for this month
    month_files = glob.glob('/path/to/files/*{0:0>2d}.SUB.nc'.format(month))


    # Using NCO functions ---------------
    avg_file = './precip_avg_{0:0>2d}.nc'.format(month)

    # Concatenate the files using ncrcat
    subprocess.call(['ncrcat'] + month_files + ['-O', avg_file])

    # Take the time (record) average using ncra 
    subprocess.call(['ncra', avg_file, '-O', avg_file])

    # Read in the monthly precip climatology file and do whatever now
    ncfile = netCDF4.Dataset(avg_file, 'r')
    pr = ncfile.variables['prectot'][:,:,:]
    ....

    # Using only Python -------------
    # Initialize an array to store monthly-mean precip for all years
    # let's presume we know the lat and lon dimensions (nlat, nlon)
    nyears = len(month_files)
    pr_arr = np.zeros([nyears,nlat,nlon], dtype='f4')

    # Populate pr_arr with each file's monthly-mean precip
    for idx, filename in enumerate(month_files):
        ncfile = netCDF4.Dataset(filename, 'r')
        pr = ncfile.variable['prectot'][:,:,:]  
        pr_arr[idx,:,:] = np.mean(pr, axis=0)
        ncfile.close()

    # Take the average along all years for a monthly climatology
    pr_clim = np.mean(pr_arr, axis=0)  # 2D now [lat,lon]

【讨论】：

你可以使用纯 Python 来达到同样的效果，NCO 只是极大地帮助压缩了时间/内存需求。我将编辑我的答案，向您展示如何仅使用 Python。
两件事：一，for循环“for month in range(1,13):”只取12月份的文件。将范围从 1,13 更改为 1,12 时，只需要 11 月的月份文件。第一个上面的另一个for循环会解决这个问题吗？其次，一旦我进入“填充 pr_arr”for 循环，它就不喜欢 pr_arr[idx,:,:] =np.mean(pr,axis=0) 说“ValueError：无法从形状广播输入数组（5 ,5) 变成形状 (0,0)"。另一个问题是“{0:0>2d}”实际上做了什么？
奇怪的是 for 循环不起作用，它确实在我这边。它应该首先挑选出所有 1 月 (month=1) 的文件，然后转到下个月。尝试在 for 循环之后打印信息以帮助诊断问题。 prectot 的开头是什么形状？确保将nlat 和nlon 的尺寸大小插入到pr_arr 的初始化中。 '{0:0>2d} 用前导 0 填充整数，例如1 变成 01，这是文件名的结构。查看有关字符串格式的更多信息以获取更多详细信息。
我修复了循环的问题。我添加了“追加”功能，所以现在看起来像这样。 month_files.append(glob.glob(*{0:0>2d}.SUB.nc'.format(month)))。我在打印时看到它会打印所有文件月 1 到 12，但似乎只将最后一组（12 月 12 日）输入到 month_files。当您说假设 lat 和 lon 是否意味着我需要在我的 .nc 文件中创建一个具有相同 lat 和 lon 的新数组？ prectot 的形状为 (396L, 5L, 5L)。
您只需要在nlat 和nlon 在定义pr_arr 的行期间插入5。顺便说一句，如果文件是按月存储的，那么prectot 的形状应该是（1 x lat x lon）......我不确定 396 的来源。在这种情况下，您不需要像我上面显示的那样计算每月平均值。只需使用每个月文件中的月平均沉淀填充pr_arr[idx,:,;]，即pr_arr[idx,:,:] = pr[0,:,:]

【解决方案3】：

ymonmean 命令计算 CDO 中日历月的平均值。因此，该任务可以分两行完成：

cdo mergetime data*.SUB.nc  merged.nc  # put files together into one series
cdo ymonmean merged.nc annual_cycle.nc # mean of all Jan,Feb etc.

cdo 还可以计算其他统计数据的年周期，ymonstd、ymonmax 等……时间单位可以是天或五天，也可以是月。（例如 ydaymean）。

【讨论】：