【发布时间】:2017-08-31 13:13:15
【问题描述】:
我正在尝试使用 SciPy 的 scipy.optimize.minimize 函数来最小化我创建的函数。但是,我尝试优化的函数本身是由其他基于 pandas DataFrame 执行计算的函数构造的。
我了解 SciPy 的最小化函数可以通过元组输入多个参数(例如,Structure of inputs to scipy minimize function)。但是,我不知道如何传入依赖于 pandas DataFrame 的函数。
我在下面创建了一个可重现的示例。
import pandas as pd
import numpy as np
from scipy.stats import norm
from scipy.optimize import minimize
#################### Data ####################
# Initialize dataframe.
data = pd.DataFrame({'id_i': ['AA', 'BB', 'CC', 'XX', 'DD'],
'id_j': ['ZZ', 'YY', 'XX', 'BB', 'AA'],
'y': [0.30, 0.60, 0.70, 0.45, 0.65],
'num': [1000, 2000, 1500, 1200, 1700],
'bar': [-4.0, -6.5, 1.0, -3.0, -5.5],
'mu': [-4.261140, -5.929608, 1.546283, -1.810941, -3.186412]})
data['foo_1'] = data['bar'] - 11 * norm.ppf(1/1.9)
data['foo_2'] = data['bar'] - 11 * norm.ppf(1 - (1/1.9))
# Store list of ids.
id_list = sorted(pd.unique(pd.concat([data['id_i'], data['id_j']], axis=0)))
#################### Functions ####################
# Function 1: Intermediate calculation to calculate predicted values.
def calculate_y_pred(row, delta_params, sigma_param, id_list):
# Extract the relevant values from delta_params.
delta_i = delta_params[id_list.index(row['id_i'])]
delta_j = delta_params[id_list.index(row['id_j'])]
# Calculate adjusted version of mu.
mu_adj = row['mu'] - delta_i + delta_j
# Calculate predicted value of y.
y_pred = norm.cdf(row['foo_1'], loc=mu_adj, scale=sigma_param) / \
(norm.cdf(row['foo_1'], loc=mu_adj, scale=sigma_param) +
(1 - norm.cdf(row['foo_2'], loc=mu_adj, scale=sigma_param)))
return y_pred
# Function to calculate the log-likelihood (for a row of DataFrame data).
def loglik_row(row, delta_params, sigma_param, id_list):
# Calculate the log-likelihood for this row.
y_pred = calculate_y_pred(row, delta_params, sigma_param, id_list)
y_obs = row['y']
n = row['num']
loglik_row = np.log(norm.pdf(((y_obs - y_pred) * np.sqrt(n)) / np.sqrt(y_pred * (1-y_pred))) /
np.sqrt(y_pred * (1-y_pred) / n))
return loglik_row
# Function to calculate the sum of the negative log-likelihood.
# This function is called via SciPy's minimize function.
def loglik_total(data, id_list, params):
# Extract parameters.
delta_params = list(params[0:len(id_list)])
sigma_param = init_params[-1]
# Calculate the negative log-likelihood for every row in data and sum the values.
loglik_total = -np.sum( data.apply(lambda row: loglik_row(row, delta_params, sigma_param, id_list), axis=1) )
return loglik_total
#################### Optimize ####################
# Provide initial parameter guesses.
delta_params = [0 for id in id_list]
sigma_param = 11
init_params = tuple(delta_params + [sigma_param])
# Maximize the log likelihood (minimize the negative log likelihood).
minimize(fun=loglik_total, x0=init_params,
args=(data, id_list), method='nelder-mead')
这会导致以下错误:AttributeError: 'numpy.ndarray' object has no attribute 'apply'(整个错误输出如下)。我相信这个错误是因为minimize 将X 视为一个numpy 数组,而我想将它作为pandas DataFrame 传递。
AttributeError: 'numpy.ndarray' object has no attribute 'apply'
AttributeErrorTraceback (most recent call last)
<ipython-input-93-9a5866bd626e> in <module>()
1 minimize(fun=loglik_total, x0=init_params,
----> 2 args=(data, id_list), method='nelder-mead')
/Users/adam/anaconda/lib/python2.7/site-packages/scipy/optimize/_minimize.pyc in minimize(fun, x0, args, method, jac, hess, hessp, bounds, constraints, tol, callback, options)
436 callback=callback, **options)
437 elif meth == 'nelder-mead':
--> 438 return _minimize_neldermead(fun, x0, args, callback, **options)
439 elif meth == 'powell':
440 return _minimize_powell(fun, x0, args, callback, **options)
/Users/adam/anaconda/lib/python2.7/site-packages/scipy/optimize/optimize.pyc in _minimize_neldermead(func, x0, args, callback, maxiter, maxfev, disp, return_all, initial_simplex, xatol, fatol, **unknown_options)
515
516 for k in range(N + 1):
--> 517 fsim[k] = func(sim[k])
518
519 ind = numpy.argsort(fsim)
/Users/adam/anaconda/lib/python2.7/site-packages/scipy/optimize/optimize.pyc in function_wrapper(*wrapper_args)
290 def function_wrapper(*wrapper_args):
291 ncalls[0] += 1
--> 292 return function(*(wrapper_args + args))
293
294 return ncalls, function_wrapper
<ipython-input-69-546e169fc54e> in loglik_total(data, id_list, params)
6
7 # Calculate the negative log-likelihood for every row in data and sum the values.
----> 8 loglik_total = -np.sum( data.apply(lambda row: loglik_row(row, delta_params, sigma_param, id_list), axis=1) )
9
10 return loglik_total
AttributeError: 'numpy.ndarray' object has no attribute 'apply'
在 SciPy 的 minimize 函数中处理 DataFrame data 并调用我的函数 loglik_total 的正确方法是什么?欢迎任何建议,我们将不胜感激。
可能的解决方案:
请注意,我认为我可以编辑我的函数以将 data 视为 numpy 数组而不是 pandas DataFrame。但是,如果可能的话,我想避免这种情况,原因如下:1)在loglik_total 中,我使用pandas 的apply 函数将loglik_row 函数应用于data 的每一行; 2) 通过列名而不是数字索引来引用data 的列很方便。
【问题讨论】:
-
无法重现错误;我收到
KeyError: ('id_i', u'occurred at index 0') -
@Cleb 抱歉——你得到了那个错误是因为我不小心在
loglik_total函数中包含了一个额外的行data = pd.DataFrame(data)(我在探索将data从numpy 数组到 pandas DataFrame)。我已删除该行,您现在应该能够重现原始帖子中显示的错误。 -
好的,我想我找到了问题;请检查下面的答案。