使用 Scikit Learn 防止 Logistic 回归中的过度拟合答案

【问题标题】：Prevent overfitting in Logistic Regression using Sci-Kit Learn使用 Scikit Learn 防止 Logistic 回归中的过度拟合
【发布时间】：2026-01-18 10:20:03
【问题描述】：

我使用逻辑回归训练了一个模型，以预测名称字段和描述字段是否属于男性、女性或品牌的个人资料。我的训练准确率约为 99%，而我的测试准确率约为 83%。我尝试通过调整 C 参数来实现正则化，但几乎没有注意到改进。我的训练集中有大约 5,000 个示例。这是我只需要更多数据的情况，还是我可以在 Sci-Kit Learn 中做些什么来提高我的测试准确性？

【问题讨论】：

更精确（训练/测试拆分；C-tuning 完成；几乎没有注意到？；预处理；opt-algorithm；哪种正则化；多类策略）并可能添加一些格式。即便如此，这似乎仍然是一个非常广泛的问题。

标签： python machine-learning scikit-learn logistic-regression data-science

【解决方案1】：

过拟合是一个多方面的问题。这可能是您的训练/测试/验证拆分（从 50/40/10 到 90/9/1 的任何事情都可能改变）。您可能需要打乱您的输入。尝试集成方法，或减少特征数量。你可能有异常值把事情扔掉

再一次，它可能不是这些，或所有这些，或这些的某种组合。

对于初学者，尝试将测试分数绘制为测试拆分大小的函数，看看你会得到什么

【讨论】：

感谢您的指导。我正在使用的功能仅来自 TfidfVectorizer，所以我不确定如何减少功能数量。对于像我这样的新手有什么想法吗？谢谢！
很多；情节分数与测试大小。尝试只预测男性或非男性，确保您的数据缩放到零均值和单位范数

【解决方案2】：

#The 'C' value in Logistic Regresion works very similar as the Support 
#Vector Machine (SVM) algorithm, when I use SVM I like to use #Gridsearch 
#to find the best posible fit values for 'C' and 'gamma',
#maybe this can give you some light:

# For SVC You can remove the gamma and kernel keys 
# param_grid = {'C': [0.1,1, 10, 100, 1000], 
#                'gamma': [1,0.1,0.01,0.001,0.0001], 
#                'kernel': ['rbf']} 

param_grid = {'C': [0.1,1, 10, 100, 1000]} 

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report,confusion_matrix

# Train and fit your model to see initial values
X_train, X_test, y_train, y_test = train_test_split(df_feat, np.ravel(df_target), test_size=0.30, random_state=101)
model = SVC()
model.fit(X_train,y_train)
predictions = model.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))

# Find the best 'C' value
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=3)
grid.best_params_
c_val = grid.best_estimator_.C

#Then you can re-run predictions on this grid object just like you would with a normal model.
grid_predictions = grid.predict(X_test)

# use the best 'C' value found by GridSearch and reload your LogisticRegression module
logmodel = LogisticRegression(C=c_val)
logmodel.fit(X_train,y_train)

print(confusion_matrix(y_test,grid_predictions))
print(classification_report(y_test,grid_predictions))

【讨论】：

no copy + paste 几天前我使用了这个 SVM 模块，这是我想为 SVC 找到最佳 C 和 gamma 的方法，是一种机器学习算法，可能很难其他人可以关注，但我相信提出问题的用户会得到它。