【问题标题】:Split into training and testing set in R?在 R 中拆分为训练和测试集?
【发布时间】:2018-04-22 22:58:06
【问题描述】:

如何将以下用python编写的代码写入R?

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, random_state=42)   

按照 80/20 的比例分成训练集和测试集。

【问题讨论】:

    标签: python r machine-learning train-test-split


    【解决方案1】:

    您可以使用caretcreateDataPartition 函数来做到这一点:

    library(caret)
    
    # Make example data
    X = data.frame(matrix(rnorm(200), nrow = 100)) 
    y = rnorm(100) 
    
    #Extract random sample of indices for test data
    set.seed(42) #equivalent to python's random_state arg
    test_inds = createDataPartition(y = 1:length(y), p = 0.2, list = F) 
    
    # Split data into test/train using indices
    X_test = X[test_inds, ]; y_test = y[test_inds] 
    X_train = X[-test_inds, ]; y_train = y[-test_inds]
    

    您还可以使用test_inds = sample(1:length(y), ceiling(length(y) * 0.2))“从头开始”创建test_inds

    【讨论】:

      【解决方案2】:

      这可能是更简单的方法

      #read in iris dataset 
       data(iris)  
       library(caret) #this package has the createDataPartition function
          
       set.seed(123) #randomization`
          
       #creating indices
       trainIndex <- createDataPartition(iris$Species,p=0.75,list=FALSE)
          
       #splitting data into training/testing data using the trainIndex object
       IRIS_TRAIN <- iris[trainIndex,] #training data (75% of data)
          
       IRIS_TEST <- iris[-trainIndex,] #testing data (25% of data)
      

      【讨论】:

        【解决方案3】:

        使用基础 R,您可以执行以下操作:

        set.seed(12345)
        #getting training data set sizes of .20 (in this case 20 out of 100)
        train.x<-sample(1:100, 20)
        train.y<-sample(1:100, 20)
        
        #simulating random data
        x<-rnorm(100)
        y<-rnorm(100)
        
        #sub-setting the x data
        training.x.data<-x[train]
        testing.x.data<-x[-train]
        
        #sub-setting the y data
        training.y.data<-y[train]
        testing.y.data<-y[-train]
        

        【讨论】:

          【解决方案4】:

          让我们以iris 数据集为例:

          # in case you want to use a seed
          set.seed(5)
          ## 70% of the sample size
          train_size <- floor(0.75 * nrow(iris))
          
          in_rows <- sample(c(1:nrow(iris)), size = train_size, replace = FALSE)
          
          train <- iris[in_rows, ]
          test <- iris[-in_rows, ]
          

          【讨论】:

            猜你喜欢
            • 1970-01-01
            • 2017-11-01
            • 2018-10-13
            • 2017-06-11
            • 1970-01-01
            • 1970-01-01
            • 2021-05-09
            • 1970-01-01
            • 2020-06-18
            相关资源
            最近更新 更多