使用 scikit-learn 创建数据集函数答案

【问题标题】：creating a dataset function using scikit-learn使用 scikit-learn 创建数据集函数
【发布时间】：2021-05-20 11:52:36
【问题描述】：

所以我是 Python 的新手，我正在尝试使用 scikit 从我的计算机加载数据集。这是我的代码的样子：

**whatever.py**

import numpy as np
import csv
from sklearn.datasets.base import Bunch

class Cortex_nuc:
    def cortex_nuclear():
        with open('C:/Users/User/Desktop/Data_Cortex_Nuclear4.csv') as csv_file:
            data_file = csv.reader(csv_file)
            temp = next(data_file)
            n_samples = int(float(temp[0]))
            n_features = int(float(temp[1]))
            data = np.empty((n_samples, n_features))
            target = np.empty((n_samples,), dtype=np.float64)

            for i, sample in enumerate(data_file):
                data[i] = np.asarray(sample[:-1], dtype=np.float64)
                target[i] = np.asarray(sample[-1], dtype=np.float64)

        return Bunch(data=data, target=target)

然后我将它导入到我的项目中：

from whatever import Cortex_nuc

然后我尝试将其保存到 df:

df = Cortex_nuc.cortex_nuclear()

顺便说一句，这就是数据集的样子：

这只是数据集的一部分，否则它有 77 列和大约一千行。

但我收到一条错误消息，我似乎无法弄清楚为什么会发生这种情况。这是错误消息：

IndexError                                Traceback (most recent call last)
<ipython-input-5-a4935f2c187f> in <module>
----> 1 df = Cortex_nuc.cortex_nuclear()

~\whatever.py in cortex_nuclear()
     20 
     21             for i, sample in enumerate(data_file):
---> 22                 data[i] = np.asarray(sample[:-1], dtype=np.float64)
     23                 target[i] = np.asarray(sample[-1], dtype=np.float64)
     24 

IndexError: index 0 is out of bounds for axis 0 with size 0

有人可以帮帮我吗？谢谢！

【问题讨论】：

首先您可以使用print() 来查看变量中的内容。可能(n_samples, n_features) 具有值(0,0)，它创建的数组没有数据位置。您应该创建普通列表并在for-loop 中使用append()。并在循环后将此列表转换为数组。更短：首先学会使用print()调试代码。或者学习如何使用真正的调试器。

标签： python scikit-learn

【解决方案1】：

如果您想在 Bunch 对象中创建“sklearn-like”数据集，您可能需要这样的东西：

import pandas as pd
import numpy as np
from sklearn.utils import Bunch

# For reproducing
from io import StringIO
csv_file = StringIO("""
target,A,B
0,0,0
1,0,1
1,1,0
0,1,1
""")

def load_xor(*, return_X_y=False):
    """Describe your data here."""
    _data_file = pd.read_csv(csv_file)
    _data = Bunch()

    _data["DESCR"] = load_xor.__doc__
    _data["data"] = _data_file[["A", "B"]].to_numpy(dtype=np.float64)
    _data["target"] = _data_file["target"].to_numpy(dtype=np.float64)
    _data["target_names"] = np.array(["false", "true"])
    _data["feature_names"] = np.array(list(_data_file.drop(["target"], axis=1)))

    if return_X_y:
        return _data.data, _data.target
    return _data

if __name__ == "__main__":
    # Return and unpack the `X`, `y` tuple
    X, y = load_xor(return_X_y=True)
    print(X, y)

这是因为sklearn.datasets 通常返回具有特定属性/键的 Bunch 对象（有关说明，请参阅 load_iris 文档的“返回”部分）：

>>> from sklearn.datasets import load_iris
>>> data = load_iris()
>>> dir(data)
['DESCR', 'data', 'feature_names', 'filename', 'frame', 'target', 'target_names']

【讨论】：