【问题标题】:pandas numpy : setting an array element with a sequence while math operationpandas numpy:在数学运算时使用序列设置数组元素
【发布时间】:2021-10-27 11:52:46
【问题描述】:

我有一个名为 df4 的 df,你可以通过以下代码获得它:

df4s = """
contract    RB  BeginDate   ValIssueDate    EndDate Valindex0   48  46  47  49  50
2   A00118  46  19850100    19880901    99999999    50  1   2   3   7   7
3   A00118  47  19000100    19880901    19831231    47  1   2   3   7   7
5   A00118  47  19850100    19880901    99999999    50  1   2   3   7   7
6   A00253  48  19000100    19820101    19811231    47  1   2   3   7   7
7   A00253  48  19820100    19820101    19841299    47  1   2   3   7   7
8   A00253  48  19850100    19820101    99999999    50  1   2   3   7   7
9   A00253  50  19000100    19820101    19781231    47  1   2   3   7   7
10  A00253  50  19790100    19820101    19841299    47  1   2   3   7   7
11  A00253  50  19850100    19820101    99999999    50  1   2   3   7   7

"""

df4 = pd.read_csv(StringIO(df4s.strip()), sep='\s+', 
                  dtype={"RB": int, "BeginDate": int, "EndDate": int,'ValIssueDate':int,'Valindex0':int})

输出将是:

contract    RB  BeginDate   ValIssueDate    EndDate Valindex0   48  46  47  49  50
2   A00118  46  19850100    19880901    99999999    50  1   2   3   7   7
3   A00118  47  19000100    19880901    19831231    47  1   2   3   7   7
5   A00118  47  19850100    19880901    99999999    50  1   2   3   7   7
6   A00253  48  19000100    19820101    19811231    47  1   2   3   7   7
7   A00253  48  19820100    19820101    19841299    47  1   2   3   7   7
8   A00253  48  19850100    19820101    99999999    50  1   2   3   7   7
9   A00253  50  19000100    19820101    19781231    47  1   2   3   7   7
10  A00253  50  19790100    19820101    19841299    47  1   2   3   7   7
11  A00253  50  19850100    19820101    99999999    50  1   2   3   7   7

我正在尝试通过以下逻辑构建一个新列,新列的值将基于 2 个现有列的值:

def test(RB):
    n=1
    for i in np.arange(RB,50):
        n = n * df4[str(i)].values
    return  n


vfunc=np.vectorize(test)
df4['n']=vfunc(df4['RB'].values)

然后收到错误:

    res = array(outputs, copy=False, subok=True, dtype=otypes[0])

ValueError: setting an array element with a sequence.

【问题讨论】:

  • df4[str(i)].values 是一个数组,因此您返回的 n(假设 RB 足够低以至于您可以循环)是一个数组,例如:[6 6 6 6 6 6 6 6 6] vectorize 正在尝试将其分配回一维大批。你想在这里创建一个二维数组吗?
  • 是的,我想是的,谢谢您的回复
  • @HenryEcker,我的回答显示错误发生在vectorize,而不是数据框列的分配。

标签: python pandas dataframe numpy numpy-ndarray


【解决方案1】:

重建您的数据框(感谢您使用StringIO 方法)

In [82]: df4['RB'].values
Out[82]: array([46, 47, 47, 48, 48, 48, 50, 50, 50])
In [83]: test(46)
Out[83]: array([42, 42, 42, 42, 42, 42, 42, 42, 42])
In [84]: test(50)
Out[84]: 1
In [85]: [test(i) for i in df4['RB'].values]
Out[85]: 
[array([42, 42, 42, 42, 42, 42, 42, 42, 42]),
 array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
 array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
 array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
 array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
 array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
 1,
 1,
 1]
In [86]: vfunc=np.vectorize(test)
In [87]: vfunc(df4['RB'].values)
TypeError: only size-1 arrays can be converted to Python scalars

The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "<ipython-input-87-8db8cd5dc5ab>", line 1, in <module>
    vfunc(df4['RB'].values)
  File "/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py", line 2163, in __call__
    return self._vectorize_call(func=func, args=vargs)
  File "/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py", line 2249, in _vectorize_call
    res = asanyarray(outputs, dtype=otypes[0])
ValueError: setting an array element with a sequence.

注意完整的回溯。 vectorize 无法从这组混合大小的数组创建返回数组。它 'guessed, based on a trial calculation that it should return an int` dtype。

如果我们告诉它返回一个对象 dtype 数组,我们会得到:

In [88]: vfunc=np.vectorize(test, otypes=['object'])
In [89]: vfunc(df4['RB'].values)
Out[89]: 
array([array([42, 42, 42, 42, 42, 42, 42, 42, 42]),
       array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
       array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
       array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
       array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
       array([7, 7, 7, 7, 7, 7, 7, 7, 7]), 1, 1, 1], dtype=object)

我们可以将其分配给 df 列:

In [90]: df4['n']=_
In [91]: df4
Out[91]: 
   contract  RB  BeginDate  ...  49  50                                     n
2    A00118  46   19850100  ...   7   7  [42, 42, 42, 42, 42, 42, 42, 42, 42]
3    A00118  47   19000100  ...   7   7  [21, 21, 21, 21, 21, 21, 21, 21, 21]
5    A00118  47   19850100  ...   7   7  [21, 21, 21, 21, 21, 21, 21, 21, 21]
6    A00253  48   19000100  ...   7   7           [7, 7, 7, 7, 7, 7, 7, 7, 7]
7    A00253  48   19820100  ...   7   7           [7, 7, 7, 7, 7, 7, 7, 7, 7]
8    A00253  48   19850100  ...   7   7           [7, 7, 7, 7, 7, 7, 7, 7, 7]
9    A00253  50   19000100  ...   7   7                                     1
10   A00253  50   19790100  ...   7   7                                     1
11   A00253  50   19850100  ...   7   7                                     1

我们也可以分配Out[85] 列表

df4['n']=Out[85]

时间差不多:

In [94]: timeit vfunc(df4['RB'].values)
211 µs ± 5.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [95]: timeit [test(i) for i in df4['RB'].values]
217 µs ± 6.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

通常vectorize 比较慢,但test 本身可能已经够慢了,迭代方式并没有太大区别。请记住(如有必要,请重新阅读文档),vectorize 不是性能工具。它不会“编译”你的函数或让它运行得更快。

返回对象 dtype 数组的替代方法:

In [96]: vfunc=np.frompyfunc(test,1,1)
In [97]: vfunc(df4['RB'].values)
Out[97]: 
array([array([42, 42, 42, 42, 42, 42, 42, 42, 42]),
       array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
       array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
       array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
       array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
       array([7, 7, 7, 7, 7, 7, 7, 7, 7]), 1, 1, 1], dtype=object)
In [98]: timeit vfunc(df4['RB'].values)
202 µs ± 6.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

【讨论】:

    猜你喜欢
    • 2016-01-21
    • 2016-09-14
    • 1970-01-01
    • 2015-12-19
    • 2018-11-10
    • 2018-04-09
    • 1970-01-01
    • 2018-05-09
    • 2017-10-20
    相关资源
    最近更新 更多