【问题标题】:Split array vertically, add rows of data, sort, and then apply regression for rainfall data垂直拆分数组,添加数据行,排序,然后对降雨数据应用回归
【发布时间】:2013-08-17 12:51:21
【问题描述】:

我有来自许多站点的气象数据,这些数据汇总在一个数组中,例如下面的不同列分别指的是StationYearMonthRainfall(带有nan nan nan nan的行是因为.csv文件具有分隔每个数据集的字符串标题)。

我使用的数据是大约 111 个站点,每个站点都有 40 多年的降雨数据,这只是我正在试验的一个子集 -

 [[             nan              nan              nan              nan]
 [  1.47130000e+04   1.96800000e+03   1.00000000e+00   2.79000000e+01]
 [  1.47130000e+04   1.96800000e+03   2.00000000e+00   1.30700000e+02]
 [  1.47130000e+04   1.96800000e+03   3.00000000e+00   8.49000000e+01]
 [  1.47130000e+04   1.96800000e+03   4.00000000e+00   0.00000000e+00]
 [  1.47130000e+04   1.96800000e+03   5.00000000e+00   2.41000000e+01]
 [  1.47130000e+04   1.96800000e+03   6.00000000e+00   0.00000000e+00]
 [  1.47130000e+04   1.96800000e+03   7.00000000e+00   3.45000000e+01]
 [  1.47130000e+04   2.00900000e+03   3.00000000e+00   0.00000000e+00]
 [  1.47130000e+04   2.00900000e+03   4.00000000e+00   5.65000000e+01]
 [  1.47130000e+04   2.00900000e+03   5.00000000e+00   0.00000000e+00]
 [  1.47130000e+04   2.00900000e+03   6.00000000e+00   0.00000000e+00]
 [  1.47130000e+04   2.00900000e+03   7.00000000e+00   0.00000000e+00]
 [  1.47130000e+04   2.00900000e+03   8.00000000e+00   0.00000000e+00]
 [  1.47130000e+04   2.00900000e+03   9.00000000e+00   0.00000000e+00]
 [  1.47130000e+04   2.00900000e+03   1.00000000e+01   0.00000000e+00]
 [  1.47130000e+04   2.00900000e+03   1.10000000e+01   6.20000000e+00]
 [  1.47130000e+04   2.01000000e+03   1.00000000e+00   2.33300000e+02]
 [  1.47130000e+04   2.01000000e+03   2.00000000e+00   8.71000000e+01]
 [  1.47130000e+04   2.01000000e+03   3.00000000e+00   4.08000000e+01]
 [  1.47130000e+04   2.01000000e+03   4.00000000e+00   9.62000000e+01]
 [  1.47130000e+04   2.01000000e+03   5.00000000e+00   2.21000000e+01]
 [  1.47130000e+04   2.01000000e+03   6.00000000e+00   0.00000000e+00]
 [  1.47130000e+04   2.01000000e+03   7.00000000e+00   2.20000000e+00]
 [  1.47130000e+04   2.01000000e+03   8.00000000e+00   0.00000000e+00]
 [  1.47130000e+04   2.01000000e+03   9.00000000e+00   0.00000000e+00]
 [  1.47130000e+04   2.01000000e+03   1.00000000e+01   8.60000000e+00]
 [  1.47130000e+04   2.01000000e+03   1.10000000e+01   1.63000000e+01]
 [  1.47130000e+04   2.01100000e+03   1.00000000e+00   1.10800000e+02]
 [  1.47130000e+04   2.01100000e+03   2.00000000e+00   6.76000000e+01]
 [  1.47130000e+04   2.01100000e+03   3.00000000e+00   1.98000000e+02]
 [  1.47130000e+04   2.01100000e+03   6.00000000e+00   4.10000000e+00]
 [  1.47130000e+04   2.01100000e+03   1.00000000e+01   2.52000000e+01]
 [  1.47130000e+04   2.01100000e+03   1.10000000e+01   4.17000000e+01]
 [  1.47130000e+04   2.01200000e+03   1.00000000e+00   2.13600000e+02]
 [  1.47130000e+04   2.01200000e+03   2.00000000e+00   7.44000000e+01]
 [  1.47130000e+04   2.01200000e+03   3.00000000e+00   9.14000000e+01]
 [  1.47130000e+04   2.01200000e+03   4.00000000e+00   1.70000000e+01]
 [  1.47130000e+04   2.01200000e+03   5.00000000e+00   1.56000000e+01]
 [  1.47130000e+04   2.01200000e+03   7.00000000e+00   4.20000000e+00]
 [  1.47130000e+04   2.01200000e+03   1.00000000e+01   3.40000000e+00]
 [  1.47130000e+04   2.01200000e+03   1.10000000e+01   7.70000000e+00]
 [             nan              nan              nan              nan]
 [  1.47320000e+04   2.00000000e+03   9.00000000e+00   0.00000000e+00]
 [  1.47320000e+04   2.00000000e+03   1.00000000e+01   8.34000000e+01]
 [  1.47320000e+04   2.00000000e+03   1.10000000e+01   1.17000000e+02]
 [  1.47320000e+04   2.00000000e+03   1.20000000e+01   4.90800000e+02]
 [  1.47320000e+04   2.00100000e+03   1.00000000e+00   1.64200000e+02]
 [  1.47320000e+04   2.00100000e+03   2.00000000e+00   6.51600000e+02]
 [  1.47320000e+04   2.00100000e+03   3.00000000e+00   1.36800000e+02]
 [  1.47320000e+04   2.00100000e+03   4.00000000e+00   1.64400000e+02]
 [  1.47320000e+04   2.01000000e+03   9.00000000e+00   0.00000000e+00]
 [  1.47320000e+04   2.01100000e+03   1.00000000e+00   1.82400000e+02]
 [  1.47320000e+04   2.01100000e+03   2.00000000e+00   3.81000000e+02]
 [  1.47320000e+04   2.01100000e+03   3.00000000e+00   4.50800000e+02]
 [  1.47320000e+04   2.01100000e+03   4.00000000e+00   3.12800000e+02]
 [  1.47320000e+04   2.01100000e+03   5.00000000e+00   0.00000000e+00]
 [  1.47320000e+04   2.01100000e+03   6.00000000e+00   0.00000000e+00]
 [  1.47320000e+04   2.01100000e+03   7.00000000e+00   1.60000000e+00]
 [             nan              nan              nan              nan]
 [  1.55030000e+04   1.96600000e+03   1.00000000e+00   6.47000000e+01]
 [  1.55030000e+04   1.96600000e+03   2.00000000e+00   1.14000000e+01]
 [  1.55030000e+04   1.96600000e+03   3.00000000e+00   0.00000000e+00]
 [  1.55030000e+04   1.96600000e+03   4.00000000e+00   0.00000000e+00]
 [  1.55030000e+04   1.96600000e+03   5.00000000e+00   2.80000000e+00]
 [  1.55030000e+04   1.96600000e+03   6.00000000e+00   3.47000000e+01]
 [  1.55030000e+04   1.96600000e+03   7.00000000e+00   0.00000000e+00]
 [  1.55030000e+04   2.01100000e+03   2.00000000e+00   1.40500000e+02]
 [  1.55030000e+04   2.01100000e+03   3.00000000e+00   1.13700000e+02]
 [  1.55030000e+04   2.01100000e+03   4.00000000e+00   0.00000000e+00]
 [  1.55030000e+04   2.01100000e+03   5.00000000e+00   4.00000000e-01]
 [  1.55030000e+04   2.01100000e+03   6.00000000e+00   8.60000000e+00]
 [  1.55030000e+04   2.01100000e+03   7.00000000e+00   2.20000000e+00]
 [  1.55030000e+04   2.01100000e+03   8.00000000e+00   0.00000000e+00]
 [  1.55030000e+04   2.01100000e+03   9.00000000e+00   4.50000000e+00]
 [  1.55030000e+04   2.01100000e+03   1.00000000e+01   2.00000000e+00]
 [  1.55030000e+04   2.01100000e+03   1.10000000e+01   2.80000000e+01]
 [  1.55030000e+04   2.01100000e+03   1.20000000e+01   4.18000000e+01]
 [  1.55030000e+04   2.01200000e+03   1.00000000e+00   4.82000000e+01]
 [  1.55030000e+04   2.01200000e+03   2.00000000e+00   5.62000000e+01]
 [  1.55030000e+04   2.01200000e+03   3.00000000e+00   1.45600000e+02]
 [  1.55030000e+04   2.01200000e+03   4.00000000e+00   1.62000000e+01]
 [  1.55030000e+04   2.01200000e+03   5.00000000e+00   0.00000000e+00]
 [  1.55030000e+04   2.01200000e+03   6.00000000e+00   0.00000000e+00]
 [  1.55030000e+04   2.01200000e+03   7.00000000e+00   0.00000000e+00]
 [  1.55030000e+04   2.01200000e+03   8.00000000e+00   0.00000000e+00]
 [  1.55030000e+04   2.01200000e+03   9.00000000e+00   2.60000000e+00]
 [  1.55030000e+04   2.01200000e+03   1.10000000e+01   2.52000000e+01]
 [  1.55030000e+04   2.01200000e+03   1.20000000e+01   1.09900000e+02]]
  1. 我需要根据不同的站点分解数据(第一列)。我想我可以使用B = np.array_split(alldata,np.where(SN == 0)[0]) 来做到这一点(其中SN 是第一列数据,Station 列中的 0 通过最初在 excel 中替换)。但是,这会导致 0 nan nan nan 行包含在每个拆分数组中。我也尝试过 - for key,items in groupby(alldata,itemgetter(0)): print(key),但不确定如何使用 groupby() 函数进一步操作拆分数据集。

  2. 将数据拆分为单独的数组后,我需要插入缺失月份和年份的行,并将新行中的相应单元格留空以用于降水列。我了解如何加入两个列表,但我不确定是否根据序列中的缺失值插入数据,并将其应用于整个数组。例如,对于上面的数据,我希望所有数组都有从 1966 年到 1981 年的年份和月份列,这样我就可以将不同的月份关联在一起。

  3. 一旦所有数据长度相等,我想在不同地点的所有降水数据之间运行回归。例如,假设第一个数据块是“感兴趣的站点”,我想获取其降雨数据与数据集中其他站点之间的相关性的 r^2 值。最后,我还想将降雨数据更改为该站点所有降雨数据的百分位数,然后将所有百分位数形式的降雨值与“感兴趣的站点”相关联

我不确定这是否有意义,请让我知道我应该在问题中添加什么(这是我的第一个问题)。

来自用户 (Brionius) 评论和建议的更新代码:

代码可以很好地应用回归。不太确定是否使用掩码应用相关性。

我尝试使用 slope, intercept, r_value, p_value, std_err = stats.linregress(rotated[0][0],rotated[i][0]) 在不使用掩码的情况下进行关联,但它返回了所有 r_values 的 nan 值,可能是因为数据集中的 nan 值。

  import numpy as np

# Import data:
alldata=\
         np.genfromtxt("combinedWdata.csv",unpack=False,\
                        delimiter=",")

# Split array based on where the 'nan' values are
dataSets = filter(lambda x: len(x) > 0,\
                  np.array_split(alldata,np.where(np.isnan(alldata[:,1]))[0]))

# Delete the rows with 'nan' in it.
dataSets = [np.delete(dataSet, np.where(np.isnan(dataSet)), axis=0)\
           for dataSet in dataSets]

# Assign variables to years and months
startYear = 1877
endYear = 2013
startMonth = 1
endMonth = 12
blank_rainfall_value = np.nan

# Insert rows of the form ['nan', year, month, 0] for all relevant \
    #years/months except where there is already a row containing that year/month
extendedDataSets = []
for dataSet in dataSets:
    missingMonths = [[dataSet[0][0], year, month, blank_rainfall_value] \
                     for year in range(startYear, endYear+1)\
                     for month in range(startMonth, endMonth+1) \
                     if [year, month] not in dataSet[:,1:3].tolist()]
    if len(missingMonths) > 0:
        extendedDataSets.append(np.vstack((dataSet, missingMonths)))

# Sort arrray by year, then month
finalDataSets = [np.array(sorted(dataSet, key=lambda row:row[1]+row[2]/12.0))\
                 for dataSet in extendedDataSets]

# Rotate data to compare between columns rather than rows
rotated=[]
for dataSet in finalDataSets:
    u=np.rot90(dataSet)
    rotated.append(u)
rotated=np.array(rotated)

# Delete year month and station
m=0
rotatedDel=[]
for i in rotated:
    v=[rotated[m][0]]
    rotatedDel.append(v)
rotatedDel=np.array(rotatedDel)

# Apply regression between first station and all later stations with \
#mask for nan values
m=0
r_values=[]
for i in rotatedDel:
    r_value=np.ma.corrcoef(x=\
            (np.ma.array(rotatedDel[0],mask=np.isnan(rotatedDel[0]))),y=\
            (np.ma.array(rotatedDel[m],mask=np.isnan(rotatedDel[m]))),\
             rowvar=False,allow_masked=True) 
    r_values.append(r_value)
    m=m+1

【问题讨论】:

  • 这是一个(预期的)Python 数组或简单字符串 - [1.47130000e+04 1.96800000e+03 1.00000000e+00 2.79000000e+01]?
  • 您是否意识到中间数据集发生在 1966 年,而您说您希望年份范围为 1968-1981 年?
  • @GLES 我很确定我已经打算将数据作为 Python 数组(我使用 np.genfromtxt 从 excel 导入。刚刚习惯了 Python 术语,但我相信我想要一个数组来完成我需要的所有数据操作?
  • @Brionius 感谢您指出这一点;我已经更新了问题

标签: python arrays python-3.x numpy regression


【解决方案1】:

这是您的问题 (1) 和 (2) 的解决方案。如果没有更多关于您的想法的信息,无法帮助您进行回归。

# Split array based on where the 'nan' values are
dataSets = filter(lambda x: len(x) > 0, np.array_split(alldata,np.where(np.isnan(alldata[:,1]))[0]))

# Delete the rows with 'nan' in it.
dataSets = [np.delete(dataSet, np.where(np.isnan(dataSet)), axis=0) for dataSet in dataSets]

startYear = 1966
endYear = 1981
startMonth = 1
endMonth = 12
blank_rainfall_value = np.nan

# Insert rows of the form ['nan', year, month, 0] for all relevant years/months except where there is already a row containing that year/month
extendedDataSets = []
for dataSet in dataSets:
    missingMonths = [[dataSet[0][0], year, month, blank_rainfall_value] for year in range(startYear, endYear+1) for month in range(startMonth, endMonth+1) if [year, month] not in dataSet[:,1:3].tolist()]
    if len(missingMonths) > 0:
        extendedDataSets.append(np.vstack((dataSet, missingMonths)))

# Sort arrray by year, then month
finalDataSets = [np.array(sorted(dataSet, key=lambda row:row[1]+row[2]/12.0)) for dataSet in extendedDataSets]

for dataSet in finalDataSets:
    print dataSet
    print

【讨论】:

  • 感谢 Brionius。它适用于我在问题中提出的数据集。但是,当我应用到我的完整数据集(即 111 个降雨数据站点,涵盖 1896 年至 2012 年的年份)时,它不起作用并给出错误“ValueError:除连接轴外的所有输入数组维度必须完全匹配” .
  • 我已经尝试过在原始数据集中添加和删除数据,并且错误似乎只出现在第一个站点(即第一个 [nan,nan ,nan,nan] row) 包含月份列中包含 12 的行。如果我在第一个数据块上方使用 [nan,nan,nan,nan] 行导入数据(数据集中没有任何第 12 个月),我会收到以下错误:“IndexError: too many indices”
  • @Oregano:好的,我更新了上面的解决方案(在第一条语句中添加了一个过滤器以删除零长度数据集),以修复 IndexError 问题,但是我无法重现ValueError 问题 - 我尝试在顶部添加一行 12 个月。你能发布一些导致错误的数据吗?
  • 我已将数据添加到问题中,这给了我ValueError 问题
  • 我发现新数据子集的数据出现错误。但是,如果我删除包含第一个和最后一个站点的第 12 个月数据的行(中间数据集中第 12 个月的值似乎没问题)
猜你喜欢
  • 1970-01-01
  • 2012-10-04
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2016-07-08
  • 2017-02-24
  • 1970-01-01
相关资源
最近更新 更多