当值配对错误时绘制散点图答案

【问题标题】：Plot scatter plot when values are wrongly paired当值配对错误时绘制散点图
【发布时间】：2019-02-15 03:07:37
【问题描述】：

我正在尝试根据我使用 dplyr 的 spread() 函数创建的数据框创建一些相关图。当我使用扩展函数时，它在新数据框中创建了 NA。这是有道理的，因为数据框在不同时间段具有不同参数的浓度值。

以下是原始数据框的示例截图：

当我使用扩展函数时，它给了我一个像这样的数据框（示例数据）：

structure(list(orgid = c("11NPSWRD", "11NPSWRD", "11NPSWRD", 
"11NPSWRD", "11NPSWRD", "11NPSWRD", "11NPSWRD", "11NPSWRD", "11NPSWRD", 
"11NPSWRD", "11NPSWRD", "11NPSWRD", "11NPSWRD", "11NPSWRD", "11NPSWRD", 
"11NPSWRD", "11NPSWRD", "11NPSWRD", "11NPSWRD", "11NPSWRD"), 
    locid = c("11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", 
    "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", 
    "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", 
    "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", 
    "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", 
    "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", 
    "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2", "11NPSWRD-MORR_NPS_PR2"
    ), stdate = structure(c(9891, 9891, 9891, 9920, 9920, 9920, 
    9949, 9949, 9949, 9978, 9978, 9978, 10011, 10011, 10011, 
    10067, 10067, 10073, 10073, 10073), class = "Date"), sttime = structure(c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), class = c("hms", 
    "difftime"), units = "secs"), valunit = c("uS/cm", "mg/l", 
    "mg/l", "uS/cm", "mg/l", "mg/l", "uS/cm", "mg/l", "mg/l", 
    "uS/cm", "mg/l", "mg/l", "uS/cm", "mg/l", "mg/l", "uS/cm", 
    "mg/l", "uS/cm", "mg/l", "mg/l"), swqs = c("FW2-TP", "FW2-TP", 
    "FW2-TP", "FW2-TP", "FW2-TP", "FW2-TP", "FW2-TP", "FW2-TP", 
    "FW2-TP", "FW2-TP", "FW2-TP", "FW2-TP", "FW2-TP", "FW2-TP", 
    "FW2-TP", "FW2-TP", "FW2-TP", "FW2-TP", "FW2-TP", "FW2-TP"
    ), WMA = c(6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
    6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L), year = c(1997L, 1997L, 1997L, 
    1997L, 1997L, 1997L, 1997L, 1997L, 1997L, 1997L, 1997L, 1997L, 
    1997L, 1997L, 1997L, 1997L, 1997L, 1997L, 1997L, 1997L), 
    Chloride = c(NA, 35, NA, NA, 45, NA, NA, 30, NA, NA, 30, 
    NA, NA, 30, NA, NA, NA, NA, 35, NA), `Specific conductance` = c(224, 
    NA, NA, 248, NA, NA, 204, NA, NA, 166, NA, NA, 189, NA, NA, 
    119, NA, 194, NA, NA), `Total dissolved solids` = c(NA, NA, 
    101, NA, NA, 115, NA, NA, 96, NA, NA, 79, NA, NA, 89, NA, 
    56, NA, NA, 92)), .Names = c("orgid", "locid", "stdate", 
"sttime", "valunit", "swqs", "WMA", "year", "Chloride", "Specific conductance", 
"Total dissolved solids"), row.names = c(NA, 20L), class = "data.frame")

我遇到的问题是，当我尝试创建相关图时，它给了我一个只有一个点的图。我猜这是因为数据框中有 NA。但是当我尝试过滤时NAs 它给了我一个有 0 个观察值的数据框。任何帮助将不胜感激！

创建相关图的示例代码：

plot1<-ggplot(data=df,aes(x="Specific conductance",y="Chloride"))+
  geom_smooth(method = "lm", se=FALSE, color="black", formula = y ~ x)+
  geom_point()

我想创建一个这样的情节：

【问题讨论】：

从aes(x="Specific conductance",y="Chloride") 中删除引号。由于列名中有空格，请使用：aes(x=`Specific conductance`,y=Chloride)
@PoGibas 当我这样做时，我得到了这个 -> 错误：无法将 ggproto 对象添加在一起。您是否忘记将此对象添加到 ggplot 对象中？
正如你提到的，你的数据格式很奇怪，因为它只是与 NA 配对的数值。

标签： r ggplot2 regression lm ggpmisc

【解决方案1】：

你需要remove NAs & collapse rows which have the same Date

library(tidyverse)

# clean up column names by removing spaces
df <- df %>% 
  select_all(~str_replace(., " ", "_"))

# removing NAs & collapsing rows which have the same Date 
require(data.table)
DT <- data.table(df)
DT2 <- unique(DT[, lapply(.SD, na.omit), by = stdate], by = "stdate")

library(ggpmisc)
formula1 <- y ~ x

ggplot(data = DT2, aes(x = Specific_conductance, y = Chloride)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, formula = formula1) +
  stat_poly_eq(aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~~")), 
               label.x.npc = "left", label.y.npc = "top",
               formula = formula1, parse = TRUE, size = 6) +
  theme_bw(base_size = 14)

由reprex package (v0.2.0.9000) 于 2018 年 9 月 10 日创建。

【讨论】：

谢谢董先生的回答！你摇滚！
顺便说一句，如果您想将方程式和 R2 分成不同的行，请使用 solution
Hey Tung.. 快速提问.. 我应该收到一条警告消息说行已被删除吗？
是的，这很正常，您可以忽略警告。如果您有任何疑问，请使用原始数据帧仔细检查几行
我知道这是一篇旧帖子......但我刚刚意识到我不想折叠日期，因为那时我正在删除我在分析中想要的值。 stdate 列具有基于 sttime 增加的具有不同值的重复日期。有没有办法解决这个问题？

【解决方案2】：

快速而肮脏的解决方案是修改您已有的数据。通过特定列将其与自身合并，并保留两个值都不是NA 的匹配项。

# Merge data with itself
# Here I'm only guessing columns that need to match between
# Conductance and Chloride
df2 <- merge(df, df, c("orgid", "locid", "stdate"))
# This will give table with multiple duplicate rows (all possible combinations)

# Select only those combinations where both values are not NA
df2 <- subset(df2, !is.na(Chloride.x) & !is.na(`Specific conductance.y`))

# Plot
ggplot(df2, aes(`Specific conductance.y`, Chloride.x)) +
    geom_smooth(method = "lm", se = FALSE, color = "black", formula = y ~ x) +
    geom_point()

【讨论】：

@KWANGER 按这些列绑定数据，以便您获得与数值配对的数值。