使用 R 进行文本挖掘第 4 章的 widyr 部分的代码会生成已弃用的函数消息,以供使用 distinct_() 和 tbl_df() 函数。由于本书第 4 章中有 100 多行代码,我们将其缩减为相关部分以及复制警告消息所需的最少包数。
library(dplyr)
library(janeaustenr)
library(tidytext)
austen_section_words <- austen_books() %>%
filter(book == "Pride & Prejudice") %>%
mutate(section = row_number() %/% 10) %>%
filter(section > 0) %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word)
austen_section_words
library(widyr)
# count words co-occuring within sections
word_pairs <- austen_section_words %>%
pairwise_count(word, section, sort = TRUE)
word_pairs
...生成以下内容:
> # count words co-occuring within sections
> word_pairs <- austen_section_words %>%
+ pairwise_count(word, section, sort = TRUE)
Warning messages:
1: `distinct_()` is deprecated as of dplyr 0.7.0.
Please use `distinct()` instead.
See vignette('programming') for more help
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
2: `tbl_df()` is deprecated as of dplyr 1.0.0.
Please use `tibble::as_tibble()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
>
> word_pairs
# A tibble: 796,008 x 3
item1 item2 n
<chr> <chr> <dbl>
1 darcy elizabeth 144
2 elizabeth darcy 144
3 miss elizabeth 110
4 elizabeth miss 110
5 elizabeth jane 106
6 jane elizabeth 106
7 miss darcy 92
8 darcy miss 92
9 elizabeth bingley 91
10 bingley elizabeth 91
# … with 795,998 more rows
生成这些消息是因为widyr::pairwise_count() 使用dplyr::distinct_(),然后调用tbl_df()。
#' @rdname pairwise_count
#' @export
pairwise_count_ <- function(tbl, item, feature, wt = NULL, ...) {
if (is.null(wt)) {
func <- squarely_(function(m) m %*% t(m), sparse = TRUE, ...)
wt <- "..value"
} else {
func <- squarely_(function(m) m %*% t(m > 0), sparse = TRUE, ...)
}
tbl %>%
distinct_(.dots = c(item, feature), .keep_all = TRUE) %>%
mutate(..value = 1) %>%
func(item, feature, wt) %>%
rename(n = value)
}
当我们用lifecycle::last_warnings()打印警告信息时,我们可以看到警告的来源。
<deprecated>
message: `tbl_df()` is deprecated as of dplyr 1.0.0.
Please use `tibble::as_tibble()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
backtrace:
9. widyr::pairwise_count(., word, section, sort = TRUE)
10. widyr::pairwise_count_(...)
3. dplyr::distinct_(., .dots = c(item, feature), .keep_all = TRUE)
3. dplyr::mutate(., ..value = 1)
10. widyr:::func(., item, feature, wt)
19. widyr:::new_f(tbl, item, feature, value, ...)
7. widyr:::custom_melt(.)
15. dplyr::tbl_df(.)
>
widyr 的 0.1.3 版是软件包的当前版本。要解决这些警告消息,必须替换widyr::pairwise_count() 中对dplyr::distinct_() 的引用。由于这是当前受支持的 R 包,因此要启动此过程,需要在 widyr Github Issues page 报告问题。
如警告消息中所述,distinct_() 已替换为 dplyr::distinct(),tbl_df() 已替换为 tibble::as_tibble()。
抑制警告
可以通过将pairwise_count() 包装在suppressWarnings() 函数中来抑制由pairwise_count() 产生的警告。
library(widyr)
suppressWarnings(
# count words co-occuring within sections
word_pairs <- austen_section_words %>%
pairwise_count(word, section, sort = TRUE))
...和输出:
> suppressWarnings(
+ # count words co-occuring within sections
+ word_pairs <- austen_section_words %>%
+ pairwise_count(word, section, sort = TRUE))
>
> word_pairs
# A tibble: 796,008 x 3
item1 item2 n
<chr> <chr> <dbl>
1 darcy elizabeth 144
2 elizabeth darcy 144
3 miss elizabeth 110
4 elizabeth miss 110
5 elizabeth jane 106
6 jane elizabeth 106
7 miss darcy 92
8 darcy miss 92
9 elizabeth bingley 91
10 bingley elizabeth 91
# … with 795,998 more rows
附录
此代码在 R 的 4.0.2 版本上运行,包含以下软件包,如 sessionInfo() 所报告:
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tidytext_0.2.5 janeaustenr_0.1.5 widyr_0.1.3 tidyr_1.1.1
[5] dplyr_1.0.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 rstudioapi_0.11 magrittr_1.5 tidyselect_1.1.0
[5] lattice_0.20-41 R6_2.4.1 rlang_0.4.7 fansi_0.4.1
[9] stringr_1.4.0 tools_4.0.2 grid_4.0.2 packrat_0.5.0
[13] broom_0.7.0 utf8_1.1.4 cli_2.0.2 ellipsis_0.3.1
[17] assertthat_0.2.1 tibble_3.0.3 lifecycle_0.2.0 crayon_1.3.4
[21] Matrix_1.2-18 purrr_0.3.4 vctrs_0.3.2 tokenizers_0.2.1
[25] SnowballC_0.7.0 glue_1.4.1 stringi_1.4.6 compiler_4.0.2
[29] pillar_1.4.6 generics_0.0.2 backports_1.1.8 pkgconfig_2.0.3