如何自动对熊猫数据框的行进行重复计算？答案

【问题标题】：How to automate a repetitive calculation on the rows of a pandas dataframe?如何自动对熊猫数据框的行进行重复计算？
【发布时间】：2020-11-14 06:29:37
【问题描述】：

我有一个大数据集 - 2018-2020 年空气污染的 1 分钟分辨率，一个小的 sn-p 看起来像这样：

             datetime    date  time type  ... day minute second dayofyear
1 2017-12-19 17:08:30  171219  1708  air  ...  19      8     30       353
2 2018-01-05 15:22:30  180105  1522  air  ...   5     22     30         5
3 2018-01-05 15:23:30  180105  1523  air  ...   5     23     30         5
4 2018-01-05 15:24:30  180105  1524  air  ...   5     24     30         5
5 2018-01-05 15:25:30  180105  1525  air  ...   5     25     30         5

日期时间值是前两列。另外，总共大约有 50 列，完整数据在这里：https://drive.google.com/file/d/15yxPIoPEpQ3Gwb00nCLQMgQM5E_NHcW-/view?usp=sharing

我正在尝试在 y 轴上创建一个 'd13C 列的图，在 x 轴上创建一个'total_co2' 的倒数，然后将回归线拟合到这些数据，我这样做是这样的：

from numpy.polynomial.polynomial import polyfit
from scipy import stats

period = MyData[((MyData['year']==2019) & (MyData['month']==12) & (MyData['day']==31)) #    defining the time period I want from the data
p=(period['total_co2'])**-1 # defining the x axis data
q = period['d13C'] # defining the y axis data
c, m = polyfit(p,q,1) # creating a regression line, with y interecpt,c and gradient, m 
slope, intercept, r_value, p_value, std_err = stats.linregress(p, q) # calculating some statistical properties of the regression line. I'm mainly interested in the R^2 value
print('R-squared: ', r_value**2)

此代码将为 MyData 数据框中具有 'year'=2019, 'month' =12, 'day'=31 的所有行的 p,q 数据拟合一条回归线。我正在尝试对整个数据集执行此操作，并且仅保存/保留 R 平方 >=0.8 的日期。对于上述情况，2019 年 12 月 31 日我得到 R 平方 =0.554，所以我想忽略这个日期。目前，我只是通过更改月、日和年并检查 R 平方值来手动浏览数据。这需要一段时间，因为数据太多了！

最终，我的目标是创建一个列表或数据框或包含所有 R 平方 >=0.8 的日期的集合，如下所示：

  Accepted dates
0  23-11-2019
1  24-11-2019
2  29-11-2019

有没有办法自动化这个过程？目前我正在尝试编写一个 for 循环来迭代空中 df 中的每一行并添加一个 if 语句作为过滤器，但我正在为此苦苦挣扎。 Ps，我对 python 还很陌生，而且我只是在学习中学习！

任何帮助将不胜感激。谢谢。

【问题讨论】：

我很确定，您可以将 apply 与 UDF 一起使用。类似于 df.apply(function,axis=0) 函数生成 Rsquared 的地方。或者您可以按年月日对数据进行分组，然后对其执行更容易的 UDF？所以可能会为一年生成 x df。然后在月/日加入，然后你有一个宽表单行，其中包含你需要的所有数据。
@JasonChia 嗨，如果我尝试创建一个函数来执行此操作，参数会是什么？

标签： python pandas automation

【解决方案1】：

我已经测试了这段代码，我相信它提供了您正在寻找的输出：

import pandas as pd
import numpy as np
from numpy.polynomial.polynomial import polyfit
from scipy import stats

# Restricted the columns and set the dtypes to deal with memory issues when importing a large csv
MyData = pd.read_csv('.../MyData.txt', usecols=['total_co2', 'd13C', 'year', 'month', 'day', 'datetime'], dtype={'total_co2':np.float64, 'd13C':np.float64, 'year':str, 'month':str, 'day':str})

# Created a helper column that is used later to filter and report out the period
MyData['ymd'] = MyData['year'] +'-'+ MyData['month'] +'-'+ MyData['day']

# Empty list that will receive all of the periods with acceptable r-squareds
accepted_date_list = []

# for loop to filter the dataframe according to the unique periods (created with the helper column above)
for d in MyData['ymd'].unique():
    acceptable_date = {} # create a dictionary to populate and send to the list
    period = MyData[MyData.ymd == d] # filter the dataframe with the unique periods created above
    p=(period['total_co2'])**-1 
    q = period['d13C'] 
    c, m = polyfit(p,q,1) 
    slope, intercept, r_value, p_value, std_err = stats.linregress(p, q)

    if r_value**2 > 0.8: # if statement provides the test. If r2 is acceptable, populate the dictionary then send the dictionary to the list
        acceptable_date['period'] = d
        acceptable_date['r-squared'] = r_value**2
        accepted_date_list.append(acceptable_date)
    else:
        pass
   
accepted_dates = pd.DataFrame(accepted_date_list) # convert the list to a Pandas DataFrame (or whatever else you want to do with it)

print(accepted_dates)

输出：

        period  r-squared
0     2018-1-6   0.910516
1     2018-1-9   0.917216
2    2018-1-10   0.980263
3    2018-1-11   0.965971
4    2018-1-12   0.894795
5    2018-1-13   0.831683
6    2018-1-18   0.852207
7    2018-1-21   0.944162
8    2018-1-22   0.871262
9    2018-1-26   0.844020
10   2018-1-27   0.890742
11   2018-1-30   0.971747
...

【讨论】：

这很完美，输出的正是我想要的。谢谢！