【问题标题】:How to iterate over rows of each column in a dataframe如何遍历数据框中每一列的行
【发布时间】:2021-10-01 20:42:37
【问题描述】:

如果只有 1 个传感器,即如果 col2 和 col3 在下面提供的示例数据中被删除,我当前的代码会运行并生成一个图表,留下一列。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

d = {'col1': [-2587.944231, -1897.324231,-2510.304231,-2203.814231,-2105.734231,-2446.964231,-2963.904231,-2177.254231, 2796.354231,-2085.304231], 'col2': [-3764.468462,-3723.608462,-3750.168462,-3694.998462,-3991.268462,-3972.878462,3676.608462,-3827.808462,-3629.618462,-1841.758462,], 'col3': [-166.1357692,-35.36576923, 321.4157692,108.9257692,-123.2257692, -10.84576923, -100.7457692, 89.27423077, -211.0857692, 101.5342308]}

df = pd.DataFrame(data=d)
sensors = 3
window_size = 5
dfn = df.rolling(window_size).corr(pairwise = True)

index = df.index #index of values in the data frame.
rows = len(index) #len(index) returns number of rows in the data.
sensors = 3

baseline_num = [0]*(rows) #baseline numerator, by default zero
baseline = [0]*(rows) #initialize baseline value
baseline = DataFrame(baseline)
baseline_num = DataFrame(baseline_num)


v = [None]*(rows) # Initialize an empty array v[] equal to amount of rows in .csv file
s = [None]*(rows) #Initialize another empty array for the slope values for detecting when there is an exposure
d = [0]*(rows)

sensors_on = True #Is the sensor detecting something (True) or not (False).
off_count  = 0
off_require = 8 # how many offs until baseline is updated
sensitivity = 1000

for i in range(0, (rows)): #This iterates over each index value, i.e. each row, and sums the values and returns them in list format.

    v[i] = dfn.loc[i].to_numpy().sum() - sensors


for colname,colitems in df.iteritems():
    for rownum,rowitem in colitems.iteritems():

        #d[rownum] = dfone.loc[rownum].to_numpy()
        #d[colname][rownum] = df.loc[colname][rownum]

        if v[rownum] >= sensitivity:
            sensors_on = True
            off_count = 0
            baseline_num[rownum] = 0

        else:
            sensors_on = False
            off_count += 1
            if off_count == off_require:
                for x in range(0, (off_require)):
                    baseline_num[colname][rownum] += df[colname][rownum - x]

            elif off_count > off_require:
                baseline_num[colname][rownum] += baseline_num[colname][rownum - 1] + df[colname][rownum] - (df[colname][rownum - off_require]) #this loop is just an optimization, one calculation per loop once the first calculation is established

        baseline[colname][rownum] = ((baseline_num[colname][rownum])//(off_require)) #mean of the last "off_require" points



dfx = DataFrame(v, columns =['Sensor Correlation']) #converts the summed correlation tables back from list format to a DataFrame, with the sole column name 'Sensor Correlation'
dft = pd.DataFrame(baseline, columns =['baseline'])
dft = dft.astype(float)

dfx.plot(figsize=(50,25), linewidth=5, fontsize=40) # plots dfx dataframe which contains correlated and summed data
dft.plot(figsize=(50,25), linewidth=5, fontsize=40)

基本上,我想只为这个循环遍历每一列,而不是生成 1 个图表:

for colname,colitems in df.iteritems():
    for rownum,rowitem in colitems.iteritems():

        #d[rownum] = dfone.loc[rownum].to_numpy()
        #d[colname][rownum] = df.loc[colname][rownum]

        if v[rownum] >= sensitivity:
            sensors_on = True
            off_count = 0
            baseline_num[rownum] = 0

        else:
            sensors_on = False
            off_count += 1
            if off_count == off_require:
                for x in range(0, (off_require)):
                    baseline_num[colname][rownum] += df[colname][rownum - x]

            elif off_count > off_require:
                baseline_num[colname][rownum] += baseline_num[colname][rownum - 1] + df[colname][rownum] - (df[colname][rownum - off_require]) #this loop is just an optimization, one calculation per loop once the first calculation is established

我尝试了其他问题的其他解决方案,但似乎都没有解决这个问题。 到目前为止,我已经尝试过多次转换为列表和元组之类的东西,然后像这样称呼它们:

baseline_num[i,column] += d[i - x,column]

还有

baseline_num[i][column += d[i - x][column]

在循环中使用

for column in columns

但是,无论我如何安排解决方案,总是存在一些期望整数或切片索引的关键错误,以及其他错误。 有关实际数据的一列的预期/可能输出,请参见图片。具有不同的输入参数(灵敏度值和 off_require 在不同情况下会有所不同。) 一个不起作用的解决方案是来自此链接的循环方法:

https://www.geeksforgeeks.org/iterating-over-rows-and-columns-in-pandas-dataframe/

我也尝试过使用 iteritems 作为外循环创建一个循环。这也不起作用。

下面是各种敏感度值的可能图形输出的链接,以及我实际数据集中的窗口,只有一列。 (即我手动删除了其他列,并仅使用当前程序绘制了一个)

sensitivity 1000, window 8

sensitivity 800, window 5

sensitivity 1500, window 5

如果我遗漏了任何有助于解决此问题的内容,请告诉我,以便我立即更正。

查看我原来的 df.head 的这张图片: df.head

【问题讨论】:

  • 这里的代码太多,逻辑太多,请提供一个最小的可重现示例,否则很遗憾您将得不到任何帮助
  • 好的,谢谢。我会试着去做,只是看着它我不确定我可以遗漏什么来做一个更小的例子。这已经是我原始功能代码的“小”功能摘录。不过,我会尝试找出一种方法来制作一个像你说的那样的小例子。
  • 此代码不运行。 NameError: name 'dfone' is not defined 第 37 行
  • 对不起。 dfone 是为一个图形运行它的线。它下面的注释行是用于一次运行多个的行。我已经编辑了代码以删除 dfone 行。

标签: python pandas dataframe numpy data-science


【解决方案1】:

你试过了吗,

for colname,colitems in df.iteritems():
    for rownum,rowitem in colitems.iteritems():
        print(df[colname][rownum])

第一个循环遍历所有列,第二个循环遍历该列的所有行。

编辑:

从我们下面的对话中,我认为您的基线和 df 数据框没有相同的列名,因为您创建它们的方式以及访问元素的方式。

我的建议是您创建基线数据框作为您的 df 数据框的副本,并从那里编辑其中的信息。

编辑:

我已经设法让你的代码在 1 个循环中工作,但我遇到了索引错误,我不确定你的优化函数做了什么,但我认为这是导致它的原因,看看。

这是baseline_num[colname][rownum - 1] 的这一部分,我猜在第二个循环中因为你执行 rownum (0) -1,你得到索引 -1。您需要更改它,以便在第一个循环中 rownum 为 1 或其他内容,我不确定您要在那里做什么。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

d = {'col1': [-2587.944231, -1897.324231,-2510.304231,-2203.814231,-2105.734231,-2446.964231,-2963.904231,-2177.254231, 2796.354231,-2085.304231], 'col2': [-3764.468462,-3723.608462,-3750.168462,-3694.998462,-3991.268462,-3972.878462,3676.608462,-3827.808462,-3629.618462,-1841.758462,], 'col3': [-166.1357692,-35.36576923, 321.4157692,108.9257692,-123.2257692, -10.84576923, -100.7457692, 89.27423077, -211.0857692, 101.5342308]}

df = pd.DataFrame(data=d)
sensors = 3
window_size = 5
dfn = df.rolling(window_size).corr(pairwise = True)

index = df.index #index of values in the data frame.
rows = len(index) #len(index) returns number of rows in the data.
sensors = 3

baseline_num = [0]*(rows) #baseline numerator, by default zero
baseline = [0]*(rows) #initialize baseline value
baseline = pd.DataFrame(df)
baseline_num = pd.DataFrame(df)
#print(baseline_num)


v = [None]*(rows) # Initialize an empty array v[] equal to amount of rows in .csv file
s = [None]*(rows) #Initialize another empty array for the slope values for detecting when there is an exposure
d = [0]*(rows)

sensors_on = True #Is the sensor detecting something (True) or not (False).
off_count  = 0
off_require = 8 # how many offs until baseline is updated
sensitivity = 1000

for i in range(0, (rows)): #This iterates over each index value, i.e. each row, and sums the values and returns them in list format.

    v[i] = dfn.loc[i].to_numpy().sum() - sensors


for colname,colitems in df.iteritems():
    #print(colname)
    for rownum,rowitem in colitems.iteritems():
        #print(rownum)
        #display(baseline[colname][rownum])
        #d[rownum] = dfone.loc[rownum].to_numpy()
        #d[colname][rownum] = df.loc[colname][rownum]

        if v[rownum] >= sensitivity:
            sensors_on = True
            off_count = 0
            baseline_num[rownum] = 0

        else:
            sensors_on = False
            off_count += 1
            if off_count == off_require:
                for x in range(0, (off_require)):
                    baseline_num[colname][rownum] += df[colname][rownum - x]

            elif off_count > off_require:
                baseline_num[colname][rownum] += baseline_num[colname][rownum - 1] + df[colname][rownum] - (df[colname][rownum - off_require]) #this loop is just an optimization, one calculation per loop once the first calculation is established

        baseline[colname][rownum] = ((baseline_num[colname][rownum])//(off_require)) #mean of the last "off_require" points

        print(baseline[colname][rownum])


dfx = pd.DataFrame(v, columns =['Sensor Correlation']) #converts the summed correlation tables back from list format to a DataFrame, with the sole column name 'Sensor Correlation'
dft = pd.DataFrame(baseline, columns =['baseline'])
dft = dft.astype(float)

dfx.plot(figsize=(50,25), linewidth=5, fontsize=40) # plots dfx dataframe which contains correlated and summed data
dft.plot(figsize=(50,25), linewidth=5, fontsize=40)

我的输出是这样的,

-324.0
-238.0
-314.0
-276.0
-264.0
-306.0
-371.0
-806.0
638.0
-412.0

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
    354                 try:
--> 355                     return self._range.index(new_key)
    356                 except ValueError as err:

ValueError: -1 is not in range


The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)

3 frames

/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
    355                     return self._range.index(new_key)
    356                 except ValueError as err:
--> 357                     raise KeyError(key) from err
    358             raise KeyError(key)
    359         return super().get_loc(key, method=method, tolerance=tolerance)

KeyError: -1

【讨论】:

  • 是的,这似乎是最好的解决方案,但是,使用此解决方案时出现的错误是我的数组通常只需要一个索引,而我需要有一个行和列索引(即``` d[row][column] )。同样,他们期望列的 int 值,而 iteritems 似乎返回列名。我不确定为什么会这样,因为您提供的示例非常有效,在这方面不应该有所不同。
  • 你应该做 df[column][row] 我想,你到底想做什么?
  • 嘿@ultruction 看看我刚刚发布的解决方案,我认为这可能是您正在寻找的,它使用列名后跟索引号来访问每个项目。
  • 嘿,这似乎有所进展。我在运行时遇到的唯一错误是: KeyError: 'S1(scaa)' from line baseline[colname][rownum] = ((baseline_num[colname][rownum])//(off_require)) 这表明它可能期望基线和基线编号具有相同的列名才能起作用?不确定。
  • 现已修复。请参阅上面的重要评论,提到我已经将基线和基线编号更改为数据帧。
猜你喜欢
  • 2021-08-25
  • 2019-11-24
  • 2021-06-12
  • 2021-07-24
  • 2020-01-19
  • 2019-09-07
  • 2018-08-26
  • 2021-06-28
  • 2023-03-26
相关资源
最近更新 更多