如何使用 sklearn.datasets.make_classification 在给定范围内生成合成数据？答案

【问题标题】：How to generate synthetic data within a given range using sklearn.datasets.make_classification?如何使用 sklearn.datasets.make_classification 在给定范围内生成合成数据？
【发布时间】：2020-02-25 19:34:02
【问题描述】：

我想为分类问题创建合成数据。我正在使用sklearn.datasets 的make_classification 方法。我希望数据在特定范围内，比如说[80, 155]，但它会生成负数。

我尝试了很多 scale 和 class_sep 参数的组合，但没有得到想要的输出。

import pandas as pd
from sklearn.datasets import make_classification
weight = [0.2, 0.37, 0.21, 0.04, 0.11, 0.05, 0.02]

X, y = make_classification(n_samples=100, n_features=3,
            n_informative=3, n_redundant=0, n_repeated=0, 
            n_classes=7, n_clusters_per_class=1, weights=weight,
            class_sep=1,shuffle=True, random_state=41, scale= 1)

pd.DataFrame(X).describe()

输出

输出应该在一个特定的范围内，但它会选择标准偏差约为 1.33 的随机值。

【问题讨论】：

标签： python machine-learning scikit-learn data-science

【解决方案1】：

您可以使用MinMaxScaler（参见docs）。

只要运行：

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(80, 155))
X = scaler.fit_transform(X)
y = scaler.fit_transform(y)

请注意，此缩放器将为 X 训练一次，为 y 训练一次。

【讨论】：

拯救了我的一天 :)