【问题标题】:Problem with lm and output of summary for regression analysislm 问题和回归分析摘要的输出
【发布时间】:2018-11-12 19:50:43
【问题描述】:

我正在尝试对具有变量目标、助攻和得分的预测变量进行 NHL 统计数据的回归。但是,我们的输出与我们想要的输出不同。我们得到的不是我们指定的预测变量(目标、助攻和得分),而是我们拦截实例的每个实例。见下文:

urlname <- "https://www.hockey-reference.com/leagues/NHL_2018_skaters.html"
scraped_data <- read_html(urlname)
table.nhl <- html_nodes(scraped_data, "table")

scraped.nhl.data <- as.data.frame(html_table(table.nhl, header = TRUE))
colnames(scraped.nhl.data) = scraped.nhl.data[1, ] # the first row will be the header
scraped.nhl.data = scraped.nhl.data[-1, ]          # removing the first row.
for (i in 1:nrow(scraped.nhl.data)){
  if (scraped.nhl.data[i,1] == "Rk"){
    scraped.nhl.data <- scraped.nhl.data[-i,]
  }
}

pittsburgh <- scraped.nhl.data[scraped.nhl.data$Tm == "PIT", ]
pittsburgmodel <- pittsburgh[, c( "G", "A", "PTS")]
pittsburgmodel <- pittsburgmodel[complete.cases(pittsburgmodel), ]
View(pittsburgmodel)
names(pittsburgmodel) <- c(" goals", "assists", "points")
attach(pittsburgmodel)
fit = lm(games played ~., data = pittsburgmodel)
summary(fit)

输出

Coefficients: (18 not defined because of singularities)
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept) -3.719e-15  2.835e-15 -1.312e+00    0.247    
assists1     2.000e+00  6.945e-15  2.880e+14   <2e-16 ***
assists10    4.000e+00  6.945e-15  5.759e+14   <2e-16 ***
assists12    1.800e+01  6.945e-15  2.592e+15   <2e-16 ***
assists13    5.000e+00  6.945e-15  7.199e+14   <2e-16 ***
assists2     4.000e+00  6.945e-15  5.759e+14   <2e-16 ***
assists20    2.900e+01  6.945e-15  4.175e+15   <2e-16 ***
assists21    1.100e+01  6.945e-15  1.584e+15   <2e-16 ***
assists22    7.000e+00  6.945e-15  1.008e+15   <2e-16 ***
assists23    4.000e+00  6.945e-15  5.759e+14   <2e-16 ***
assists25    1.300e+01  6.945e-15  1.872e+15   <2e-16 ***
assists26    2.200e+01  6.945e-15  3.168e+15   <2e-16 ***
assists3     2.000e+00  5.305e-15  3.770e+14   <2e-16 ***
assists4     4.000e+00  6.945e-15  5.759e+14   <2e-16 ***
assists42    9.000e+00  6.945e-15  1.296e+15   <2e-16 ***
assists5     3.000e+00  6.945e-15  4.319e+14   <2e-16 ***
assists56    4.200e+01  6.945e-15  6.047e+15   <2e-16 ***
assists58    3.400e+01  6.945e-15  4.895e+15   <2e-16 ***
assists6     2.000e+00  6.945e-15  2.880e+14   <2e-16 ***
assists60    2.900e+01  6.945e-15  4.175e+15   <2e-16 ***
assists8     4.000e+00  6.945e-15  5.759e+14   <2e-16 ***
points1      1.000e+00  6.945e-15  1.440e+14   <2e-16 ***
points10     2.000e+00  8.967e-15  2.231e+14   <2e-16 ***
points12            NA         NA         NA       NA    
points13    -1.000e+00  8.967e-15 -1.115e+14   <2e-16 ***
points14            NA         NA         NA       NA    
points18            NA         NA         NA       NA    
points27            NA         NA         NA       NA    
points29            NA         NA         NA       NA    
points3             NA         NA         NA       NA    
points30            NA         NA         NA       NA    
points31    -1.000e+00  8.967e-15 -1.115e+14   <2e-16 ***
points32            NA         NA         NA       NA    
points38            NA         NA         NA       NA    
points4     -2.000e+00  8.967e-15 -2.231e+14   <2e-16 ***
points48            NA         NA         NA       NA    
points49            NA         NA         NA       NA    
points5             NA         NA         NA       NA    
points51            NA         NA         NA       NA    
points6             NA         NA         NA       NA    
points8             NA         NA         NA       NA    
points89            NA         NA         NA       NA    
points92            NA         NA         NA       NA    
points98            NA         NA         NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.34e-15 on 5 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 3.72e+30 on 25 and 5 DF,  p-value: < 2.2e-16

期望的输出

                 Estimate     Std. Error      t value   Pr(>|t|)
(Intercept)        value         value          value     value
Goals              value         value          value     value 
Assists            value         value          value     value

【问题讨论】:

  • 使用str(pittsburgmodel) 查看每列的数据类型。看起来像数字的值实际上并未编码为数值。

标签: r


【解决方案1】:

最好多花一点时间逆流而上,修正表格中的信息。这个例子使用了XML 包,因为正如this blog post 所指出的,XML::readHTMLTable 函数有一个skip 参数,而html_table 显然没有...

读取原始 HTML:

urlname <- "https://www.hockey-reference.com/leagues/NHL_2018_skaters.html"
rr <- readLines(urlname)

首先尝试阅读:标题+跳过第1行

library(XML)
h1 <- readHTMLTable(rr, header=TRUE,skip=1)$stats

数据中散布着错误的(非数字)行,这些​​行显然是额外的内部“标题”行。定义一个函数来找到它们:

br  <- function(i,x=h1) { 
    suppressWarnings(which(is.na(as.numeric(as.character(x[[i]])))))
}
badrows <- br(1)

再试一次,跳过“坏”行:

h2 <- readHTMLTable(rr, header=TRUE,skip=c(1,badrows+1))$stats

将数字列定义为除这 4 个之外的所有列:

numcols <- setdiff(names(h2),c("Player", "Tm", "Pos", "ATOI"))

转换应该为数字的列:

for (i in numcols) {
    h2[[i]] <- as.numeric(as.character(h2[[i]]))
}

【讨论】:

    【解决方案2】:

    在 lm 之前

    pittsburghmodel$points <- as. numeric(as.character(pittsburghmodel$points)
    pittsburghmodel$assists <- as. numeric(as.character(pittsburghmodel$assists)
    

    此外,不要使用附加命令并改进术语的使用,避免将模型用于数据集。

    【讨论】:

    • 这可能是解决方案,但您能详细解释一下吗? (即,您正在将因子(分类变量)转换为数字;您可以链接 previous questions ... 以及 R FAQs
    猜你喜欢
    • 2015-09-19
    • 2016-09-18
    • 2017-01-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-01-23
    • 2018-10-04
    • 2021-12-19
    相关资源
    最近更新 更多