如何将方程应用于考虑到 r 中数据框的其他列的一列？答案

【问题标题】：how to apply an equation to one column having in consideration other columns of a dataframe in r?如何将方程应用于考虑到 r 中数据框的其他列的一列？
【发布时间】：2021-01-18 15:27:53
【问题描述】：

我的数据如下所示：

tibble [1,702,551 x 4] (S3: tbl_df/tbl/data.frame)
$ date   : Date[1:1702551], format: "2011-04-12" "2011-04-12" ...
$ wlength: num [1:1702551] 350 351 352 353 354 355 356 357 358 359 ...
$ ID     : chr [1:1702551] "c01" "c01" "c01" "c01" ...
$ R      : num [1:1702551] 0.009 0.009 0.009 0.009 0.009 0.009 0.009 0.009 0.009 0.009 ...

head(fdata)
A tibble: 6 x 4
date       wlength ID        R
<date>       <dbl> <chr> <dbl>
1 2011-04-12     350 c01   0.009
2 2011-04-12     351 c01   0.009
3 2011-04-12     352 c01   0.009
4 2011-04-12     353 c01   0.009
5 2011-04-12     354 c01   0.009
6 2011-04-12     355 c01   0.009

数据快速解释：在 9 年中，收集了不同种类植被 (ID) 的反射率（波长）的年份（日期）数据，例如“c01”、“h07”……相关的值为 (R)。

我想应用这个归一化差异植被指数 (NDVI) 方程：

(R800-R670)/(R800+R670)

R前面的数字是波长（wlength）。基本上对于每个“日期”和每个“ID”，当波长等于 800 和 670 时，我想提取 R 的值并应用方程。

我如何处理所有这些变量以便将这个方程应用于我的数据？

任何帮助将不胜感激。谢谢。

【问题讨论】：

如何计算 R？结果（方程）需要是向量还是标量？
R 是特定植被代码重复次数的平均反射率，例如特定波长的 h01。结果需要是标量。

标签： r multiple-columns numeric equation categorical-data

【解决方案1】：

不是很漂亮，但应该可以：

library(dplyr)

data <- tibble(
  date = c("2020-01-01", "2020-01-01", "2020-01-02"),
  wlength = c(800, 670, 800),
  ID = c('c01', 'c01', 'c01'),
  R = c(1, 2, 3))

data

reduced <- data %>%
  filter(wlength %in% c(800, 670)) %>%
  mutate(
    R800 = ifelse(wlength == 800, R, NA),
    R670 = ifelse(wlength == 670, R, NA)) %>%
  group_by(date, ID) %>%
  summarise(
    R800 = max(R800, na.rm=TRUE),
    R670 = max(R670, na.rm=TRUE),
    NDVI = ((max(R800) - max(R670)) / (max(R800) + max(R670))))

reduced

【讨论】：

@CláudioSiva 您可以考虑将其中一个答案标记为已接受。

【解决方案2】：

这是使用 tidyverse 的一种可能性：

library(tidyverse)

fdata <-
  tribble(
          ~date , ~wlength , ~ID , ~R,
          "2011-04-12", 354 , "c01" , 0.022 ,
          "2011-04-12", 800 , "c01" , 0.014,
          "2011-04-12", 670 , "c01" , 0.009,
          "2011-04-15", 355 , "h07" , 0.012,
          "2011-04-15", 800 , "h07" , 0.003,
          "2011-04-15", 670 , "h07" , 0.077
  )

est_ndvi <-
  fdata %>%
  group_by(date, ID) %>%
  filter(wlength %in% c(670, 800)) %>%
  pivot_wider(names_from = wlength, names_prefix = "R", values_from = R) %>%
  mutate(ndvi = (R800 - R670)/(R800 + R670))

【讨论】：

非常感谢您的帮助。

【解决方案3】：

首先，请参阅下面有关浮点相等的说明。虽然它可能不会用这些数据咬你，但浮点等式过滤的一个问题是你可能不知道它正在发生，你的计算将是不正确的。

两种替代解决方案：

tidyverse，取 1

library(dplyr)
fdata %>%
  arrange(-wlength) %>%
  filter(wlength %in% c(352L, 350L)) %>%
  group_by(date, ID) %>%
  filter(n() == 2L) %>%
  summarize(
    quux = diff(R) / sum(R),
    .groups = "drop"
  )
# # A tibble: 4 x 3
#   date       ID      quux
#   <chr>      <chr>  <dbl>
# 1 2011-04-12 c01   -0.223
# 2 2011-04-12 c02   -0.152
# 3 2011-04-13 c01   -0.120
# 4 2011-04-13 c02    0.745

tidyverse，取 2

func <- function(wl, r, wavelengths = c(800, 670)) {
  inds <- sapply(wavelengths, function(w) {
    diffs <- abs(wl - w)
    which(diffs < 1)[1]
  })
  diff(r[inds]) / sum(r[inds])
}
fdata %>%
  group_by(date, ID) %>%
  summarize(
    quux = func(wlength, R, c(352, 350)),
    .groups = "drop"
  )
# # A tibble: 4 x 3
#   date       ID      quux
#   <chr>      <chr>  <dbl>
# 1 2011-04-12 c01   -0.223
# 2 2011-04-12 c02   -0.152
# 3 2011-04-13 c01   -0.120
# 4 2011-04-13 c02    0.745

浮点等式

你的 wlength 是一个 numeric 字段，使用浮点数测试严格相等确实有其偶然的风险。计算机在浮点数方面存在局限性（又名double、numeric、float）。这是一般计算机在处理非整数方面的一个基本限制。这并不特定于任何一种编程语言。有一些附加库或包在任意精度数学方面要好得多，但我相信大多数主流语言（这是相对/主观的，我承认）默认情况下不使用这些。参考：Why are these numbers not equal?、Is floating point math broken? 和 https://en.wikipedia.org/wiki/IEEE_754。

integer 严格相等不是问题，在我的示例数据中它们是整数。你有几个选项来处理这个问题，通常是注入/替换%>%-pipe 的组件。

转换为整数，
```
mutate(wlength = as.integer(wlength))
```

用特定的容差过滤，也许

filter(abs(wlength - 800) < 0.1 | abs(wlength - 670) < 0.1)

临时转换，
```
filter(sprintf("%0.0f", wlength) %in% c("800", "670"))
```
（不是最有效的，但有效并且可以支持非整数波长）。

数据

fdata <- read.table(header = TRUE, text = "
date       wlength ID
2011-04-12     350 c01
2011-04-12     351 c01
2011-04-12     352 c01
2011-04-12     353 c01
2011-04-12     354 c01
2011-04-12     355 c01
2011-04-13     350 c01
2011-04-13     351 c01
2011-04-13     352 c01
2011-04-13     353 c01
2011-04-13     354 c01
2011-04-13     355 c01
2011-04-12     350 c02
2011-04-12     351 c02
2011-04-12     352 c02
2011-04-12     353 c02
2011-04-12     354 c02
2011-04-12     355 c02
2011-04-13     350 c02
2011-04-13     351 c02
2011-04-13     352 c02
2011-04-13     353 c02
2011-04-13     354 c02
2011-04-13     355 c02
")
set.seed(2021)
fdata$R <- round(runif(nrow(fdata)), 3)

【讨论】：

感谢您的帮助。关于如何解决这个问题的全新认识。