【发布时间】:2018-11-15 14:03:30
【问题描述】:
我正在工作场所从事一个项目,但我的决策树分析遇到了一些问题。这不是家庭作业。 样本数据集
PRODUCT_SUB_LINE_DESCR MAJOR_CATEGORY_DESCR CUST_REGION_DESCR
SUNDRY SMALL EQUIP NORTH EAST REGION
SUNDRY SMALL EQUIP SOUTH EAST REGION
SUNDRY SMALL EQUIP SOUTH EAST REGION
SUNDRY SMALL EQUIP NORTH EAST REGION
SUNDRY PREVENTIVE SOUTH CENTRAL REGION
SUNDRY PREVENTIVE SOUTH EAST REGION
SUNDRY PREVENTIVE SOUTH EAST REGION
SUNDRY SMALL EQUIP NORTH CENTRAL REGION
SUNDRY SMALL EQUIP MOUNTAIN WEST REGION
SUNDRY SMALL EQUIP MOUNTAIN WEST REGION
SUNDRY COMPOSITE NORTH CENTRAL REGION
SUNDRY COMPOSITE NORTH CENTRAL REGION
SUNDRY COMPOSITE OHIO VALLEY REGION
SUNDRY COMPOSITE NORTH EAST REGION
Sales QtySold MFGCOST MarginDollars new_ProductName
209.97 3 134.55 72.72 no
-76.15 -1 -44.85 -30.4 no
275.6 2 162.5 109.84 no
138.7 1 81.25 55.82 no
226 2 136 87.28 no
115 1 68 45.64 no
210.7 2 136 71.98 no
29 1 18.85 9.77 no
29 1 18.85 9.77 no
46.32 2 37.7 7.86 no
159.86 1 132.4 24.81 no
441.3 2 264.8 171.2 no
209.62 1 132.4 74.57 no
209.62 1 132.4 74.57 no
1) 我的树只有两个节点,原因如下
>summary(tree_model)
Classification tree:
tree(formula = new_ProductName ~ ., data = training_data)
Variables actually used in tree construction:
[1] "PRODUCT_SUB_LINE_DESCR"
Number of terminal nodes: 2
Residual mean deviance: 0 = 0 / 41140
Misclassification error rate: 0 = 0 / 41146
2) 我确实创建了一个新的数据框,其中只有级别低于 22 级别的因子。有一个因素有 25 个级别,但是 tree() 没有给出错误,所以我认为该算法接受 25 个级别
>str(new_Dataset)
'data.frame': 51433 obs. of 7 variables:
$ PRODUCT_SUB_LINE_DESCR: Factor w/ 3 levels "Handpieces","PRIVATE
LABEL",..: 3 3 3 3 3 3 3 3 3 3 ...
$ MAJOR_CATEGORY_DESCR : Factor w/ 25 levels "AIR ABRASION",..: 23 23 23
23 21 21 21 23 23 23 ...
$ CUST_REGION_DESCR : Factor w/ 7 levels "MOUNTAIN WEST REGION",..: 3
6 6 3 5 6 6 2 1 1 ...
$ Sales : num 210 -76.2 275.6 138.7 226 ...
$ QtySold : int 3 -1 2 1 2 1 2 1 1 2 ...
$ MFGCOST : num 134.6 -44.9 162.5 81.2 136 ...
$ MarginDollars : num 72.7 -30.4 109.8 55.8 87.3 ...
3) 以下是我设置分析的方式
# I choose product name as my main attribute(maybe that is why it appears at
the root node?)
new_ProductName = ifelse( PRODUCT_SUB_LINE_DESCR == "PRIVATE
LABEL","yes","no")
data = data.frame(new_Dataset, new_ProductName)
set.seed(100)
train = sample(1:nrow(data), 0.8*nrow(data)) # training row indices
training_data = data[train,] # training data
testing_data = data[-train,] # testing data
#fit the tree model using training data
tree_model = tree(new_ProductName ~.,data = training_data)
summary(tree_model)
plot(tree_model)
text(tree_model, pretty = 0)
out = predict(tree_model) # predict the training data
# actuals
input.newproduct = as.character(training_data$new_ProductName)
# predicted
pred.newproduct = colnames(out)[max.col(out,ties.method = c("first"))]
mean (input.newproduct != pred.newproduct) # misclassification %
# Cross Validation to see how much we need to prune the tree
set.seed(400)
cv_Tree = cv.tree(tree_model, FUN = prune.misclass) # run cross validation
attach(cv_Tree)
plot(cv_Tree) # plot the CV
plot(size, dev, type = "b")
# set size corresponding to lowest value in the plot above.
treePruneMod = prune.misclass(tree_model, best = 9) plot(treePruneMod)
text(treePruneMod, pretty = 0)
out = predict(treePruneMod) # fit the pruned tree
# Predicted
pred.newproduct = colnames(out)[max.col(out,ties.method = c("random"))]
# calculate Mis-classification error
mean(training_data$new_ProductName != pred.newproduct)
# Predict testData with Pruned tree
out = predict(treePruneMod, testing_data, type = "class")
4) 我以前从未这样做过。我看了几个 youtube 视频并开始这样做。我欢迎很好的建议、解释和批评,请帮助我完成这个过程。这对我来说是一个挑战。
> table(data$PRODUCT_SUB_LINE_DESCR, data$new_ProductName)
no yes
Handpieces 164 0
PRIVATE LABEL 0 14802
SUNDRY 36467 0
【问题讨论】:
-
你能添加你的树的情节吗?
-
@G5W 当然可以,但不要笑!
-
能否显示
table(new_Dataset$PRODUCT_SUB_LINE_DESCR, new_Dataset$new_ProductName)的结果? -
@G5W 我可以邀请你参加聊天讨论吗?我真的需要一些帮助,拜托!
-
对不起。在你的邀请到来之前我已经注销了。如果我们能同步,我愿意提供帮助。但是从表中可以看出 PRODUCT_SUB_LINE_DESCR 完全确定 new_ProductName
标签: r decision-tree