【问题标题】:How to handle this linear optimization problem with missing values?如何处理缺失值的线性优化问题?
【发布时间】:2022-07-06 18:45:51
【问题描述】:

让我们考虑这个示例代码:

rng('default')

% creating fake data
data = randi([-1000 +1000],30,500);
yt = randi([-1000 1000],30,1);

% creating fake missing values
row = randi([1 15],1,500);
col = rand(1,500) < .5;

% imputing missing fake values
for i = 1:500
    if col(i) == 1
        data(1:row(i),i) = nan;
    end
end

%% here starts my problem
wgts = ones(1,500); % optimal weights needs to be binary (only zero or one)

% this would be easy with matrix formulas but I have missing values at the
% beginning of the series
for j = 1:30
    xt(j,:) = sum(data(j,:) .* wgts,2,'omitnan');
end


X = [xt(3:end) xt(2:end-1) xt(1:end-2)];
y = yt(3:end);

% from here I basically need to:
% maximize the Adjusted R squared of the regression fitlm(X,y)
% by changing wgts
% subject to wgts = 1 or wgts = 0
% and optionally to impose sum(wgts,'all') = some number;

% basically I need to select the data cols with the highest explanatory
% power, omitting missing data

这用 Excel 求解器相对容易实现,但它只能处理 200 个决策变量,而且需要很多时间。提前谢谢你。

【问题讨论】:

  • 我想你想要一些intlinprog的版本
  • 省略缺失数据部分相当容易,因为您只需将NaN 值设置为0,它们不会以任何方式干扰。其余的我不确定我明白了。您正在寻找与fitlm 线性拟合的列子集将具有最大 R 平方?如果是这样,那么答案总是会是整组列。
  • @BillBokeey 当然,R 平方会随着自变量数量的增加而增加。我想最大化的不是 R 平方,而是考虑到它的调整后 R 平方。
  • 您确定这确实是您要寻找的解决方案吗?提取变量的最小子集来解释输出的经典方法是运行类似pca

标签: matlab optimization linear-regression linear-programming


【解决方案1】:

lasso 似乎给出了有趣的结果:

% creating fake data (but having an actual relationship between `yt` and the predictors)
rng('default')
data = randi([-1000 +1000],30,500);
alphas = rand(1,500);
yt = sum(alphas.*data,2) + 10*randn(30,1);
plot(yt)

% Use lasso algorithm with no constant coefficients
% keep the column of coefficients that minimizes MSE.
% By design, lasso minimizes the amount of non zero coefficients

[B,FitInfo] = lasso(data,yt,'Intercept',false);
idxLambda1SE = find(FitInfo.MSE == min(FitInfo.MSE));
coef = B(:,idxLambda1SE);
y_verif = data*coef;
hold on;plot(y_verif)

sum(coef~=0)

ans =

29

输出仅由 29 列解释,而 alpha 中的所有值均非零

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2019-12-29
    • 2010-09-30
    • 2015-12-22
    • 1970-01-01
    • 2020-03-15
    • 1970-01-01
    • 1970-01-01
    • 2021-07-28
    相关资源
    最近更新 更多