我认为您只需要再次传播和收集数据,就像您提供的示例中所做的那样。另请注意,reprex 在这里会有所帮助,因此我不必从示例中创建一个,这可能与您无关。
#creating fake data
library(gutenbergr)
library(tidytext)
library(dplyr)
library(janeaustenr)
library(stringi)
library(tidyr)
hgwells <- gutenberg_download(c(35, 36, 5230, 159))
tidy_hgwells <- hgwells %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
bronte <- gutenberg_download(c(1260, 768, 969, 9182, 767))
tidy_bronte <- bronte %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
original_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup()
tidy_books <- original_books %>%
unnest_tokens(word, text)
tidy_books <- tidy_books %>%
anti_join(stop_words)
frequency <- bind_rows(mutate(tidy_bronte, author = "Hillary Clinton"),
mutate(tidy_hgwells, author = "Barack Obama"),
mutate(tidy_books, author = "Donald Trump")) %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
count(author, word) %>%
group_by(author) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
spread(author, proportion) %>%
gather(author, proportion, `Hillary Clinton`,`Barack Obama`)
管道代码的最后两行将使用您的数据框。你正在做的是传播你的克林顿和奥巴马的数据,同时保留一列仅对应于特朗普的比例。
这是您的数据框的外观示例:
> head(frequency)
# A tibble: 6 x 4
word `Donald Trump` author proportion
<chr> <dbl> <chr> <dbl>
1 a 0.00000919 Hillary Clinton 0.0000319
2 aback NA Hillary Clinton 0.00000398
3 abaht NA Hillary Clinton 0.00000398
4 abandon NA Hillary Clinton 0.0000319
5 abandoned 0.00000460 Hillary Clinton 0.0000916
6 abandoning NA Hillary Clinton 0.00000398
现在可以正常绘制了。
ggplot(frequency, aes(x = proportion, y = `Donald Trump`, color = abs(`Donald Trump` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
facet_wrap(~author, ncol = 2) +
theme(legend.position="none") +
labs(y = "Donald Trump", x = NULL)