如何让 Pandas 数据框与 numpy 和 scipy 用户定义函数一起使用？答案

【问题标题】：How to get Pandas dataframe to work with numpy and scipy user-defined function?如何让 Pandas 数据框与 numpy 和 scipy 用户定义函数一起使用？
【发布时间】：2018-12-02 22:32:38
【问题描述】：

任何帮助将不胜感激，因为我在使用具有以下功能的 Pandas 时遇到了错误。

这是我尝试使用的示例设置：

example_data = {'age': [37,37,27,22,32,22,42,32,37,22], 'target': [0,0,2,0,0,0,0,0,2,0]}
example_df = pd.DataFrame(data=example_data)
example_df

我调用了rdc函数如下：

ldc(x=example_data['age'],y=example_data['target'])

但是，我遇到了一个问题：

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-248-c4a80b9f6e55> in <module>()
----> 1 ldc(x=example_data['age'],y=example_data['target'])

<ipython-input-231-78a770305f24> in ldc(x, y, f, k, s, n)
     29         return np.median(values)
     30 
---> 31     if len(x.shape) == 1: x = x.values.reshape((-1, 1))
     32     if len(y.shape) == 1: y = y.values.reshape((-1, 1))
     33 

AttributeError: 'list' object has no attribute 'shape'

下面是我正在使用的函数本身：

"""
Implements the Randomized Dependence Coefficient
David Lopez-Paz, Philipp Hennig, Bernhard Schoelkopf
http://papers.nips.cc/paper/5138-the-randomized-dependence-coefficient.pdf
"""
import numpy as np
from scipy.stats import rankdata

def rdc(x, y, f=np.sin, k=20, s=1/6., n=1):
    """
    Computes the Randomized Dependence Coefficient
    x,y: numpy arrays 1-D or 2-D
         If 1-D, size (samples,)
         If 2-D, size (samples, variables)
    f:   function to use for random projection
    k:   number of random projections to use
    s:   scale parameter
    n:   number of times to compute the RDC and
         return the median (for stability)
    According to the paper, the coefficient should be relatively insensitive to
    the settings of the f, k, and s parameters.
    """
    if n > 1:
        values = []
        for i in range(n):
            try:
                values.append(rdc(x, y, f, k, s, 1))
            except np.linalg.linalg.LinAlgError: pass
        return np.median(values)

    if len(x.shape) == 1: x = x.values.reshape((-1, 1))
    if len(y.shape) == 1: y = y.values.reshape((-1, 1))

    # Copula Transformation
    cx = np.column_stack([rankdata(xc, method='ordinal') for xc in x.T])/float(x.size)
    cy = np.column_stack([rankdata(yc, method='ordinal') for yc in y.T])/float(y.size)

    # Add a vector of ones so that w.x + b is just a dot product
    O = np.ones(cx.shape[0])
    X = np.column_stack([cx, O])
    Y = np.column_stack([cy, O])

    # Random linear projections
    Rx = (s/X.shape[1])*np.random.randn(X.shape[1], k)
    Ry = (s/Y.shape[1])*np.random.randn(Y.shape[1], k)
    X = np.dot(X, Rx)
    Y = np.dot(Y, Ry)

    # Apply non-linear function to random projections
    fX = f(X)
    fY = f(Y)

    # Compute full covariance matrix
    C = np.cov(np.hstack([fX, fY]).T)

    # Due to numerical issues, if k is too large,
    # then rank(fX) < k or rank(fY) < k, so we need
    # to find the largest k such that the eigenvalues
    # (canonical correlations) are real-valued
    k0 = k
    lb = 1
    ub = k
    while True:

        # Compute canonical correlations
        Cxx = C[:k, :k]
        Cyy = C[k0:k0+k, k0:k0+k]
        Cxy = C[:k, k0:k0+k]
        Cyx = C[k0:k0+k, :k]

        eigs = np.linalg.eigvals(np.dot(np.dot(np.linalg.inv(Cxx), Cxy),
                                        np.dot(np.linalg.inv(Cyy), Cyx)))

        # Binary search if k is too large
        if not (np.all(np.isreal(eigs)) and
                0 <= np.min(eigs) and
                np.max(eigs) <= 1):
            ub -= 1
            k = (ub + lb) / 2
            continue
        if lb == ub: break
        lb = k
        if ub == lb + 1:
            k = ub
        else:
            k = (ub + lb) / 2

    return np.sqrt(np.max(eigs))

【问题讨论】：

你调用你的字典（它有一个列表作为值），而不是你的数据框，它是一个系列
正如@Jondiedoop 所说，您可能想要ldc(x=example_df['age'], y=example_df['target'])。
不幸的是，我收到以下错误：TypeError: slice indices must be integers or None or have an index method when using theSuggested: ldc(x=example_df['age '], y=example_df['target'])

标签： python-3.x pandas numpy scipy

【解决方案1】：

您在下一行错误地传递了example_data 而不是example_df。

ldc(x=example_data['age'],y=example_data['target'])

如下重命名变量。

ldc(x=example_df['age'],y=example_df['target'])

【讨论】：