as.character(call[[1]]) 中的 Sparklyr 错误：无法将类型“闭包”强制转换为“字符”类型的向量答案

【问题标题】：Sparklyr Error in as.character(call[[1]]) : cannot coerce type 'closure' to vector of type 'character'as.character(call[[1]]) 中的 Sparklyr 错误：无法将类型“闭包”强制转换为“字符”类型的向量
【发布时间】：2021-01-20 06:31:10
【问题描述】：

试图弄清楚这个错误的含义以及如何解决它。我正在使用 sparklyr 和 Spark 3.0 来解决使用随机森林的多分类问题。我的数据在特征工程之前看起来像这样：

数据大约100万行：


 Source: spark<?> [?? x 8]
   label_detail duration orig_bytes resp_bytes proto history time_diff_from_last_connection resp_class
   <chr>           <dbl>      <int>      <int> <chr> <chr>                            <dbl> <chr>     
 1 okiru               0          0          0 tcp   S                             0.000250 A         
 2 okiru               0          0          0 tcp   S                             0.000250 B         
 3 okiru               0          0          0 tcp   S                             0.000250 C         
 4 okiru               0          0          0 tcp   S                             0.000250 A         
 5 okiru               0          0          0 tcp   S                             0.000250 B

然后我按如下方式使用 ml 管道：

pipline <- ml_pipeline(sc) %>% 
  ft_string_indexer("label_detail", "label_detail_idx") %>% 
  ft_string_indexer("proto", "proto_idx") %>% 
  ft_string_indexer("resp_class", "resp_class_idx") %>% 
  ft_one_hot_encoder(
    input_cols = c( "proto_idx", "resp_class_idx"),
    output_cols = c( "proto_encode", "resp_class_encode")) %>%
  ft_regex_tokenizer("history", "history_token", pattern = "") %>% 
  ft_count_vectorizer(input_col = "history_token", output_col = "history_vector") %>% 
  ft_vector_assembler(
    input_cols = c("duration", "orig_bytes", 
                   "resp_bytes", "proto_encode", "time_diff_from_last_connection", "resp_class_encode", "history_vector"), 
    output_col = "features") %>% 
  ml_random_forest_classifier(label_col="label_detail_idx",
                              features_col="features",
                              seed=222)

model_rf<-ml_fit(pipline,zeek_train)

运行ml_fit 时出现以下错误：

> model_rf<-ml_fit(pipline,zeek_train)
Error in as.character(call[[1]]) : 
  cannot coerce type 'closure' to vector of type 'character'

在使用来自https://therinspark.com/pipelines.html Mastering Spark with R 的数据和示例时，我也遇到了同样的错误

okc_train <- spark_read_parquet(sc, "data/okc-train.parquet")

okc_train <- okc_train %>% 
  select(not_working, age, sex, drinks, drugs, essay1:essay9, essay_length)

pipeline <- ml_pipeline(sc) %>%
  ft_string_indexer(input_col = "sex", output_col = "sex_indexed") %>%
  ft_string_indexer(input_col = "drinks", output_col = "drinks_indexed") %>%
  ft_string_indexer(input_col = "drugs", output_col = "drugs_indexed") %>%
  ft_one_hot_encoder(
    input_cols = c("sex_indexed", "drinks_indexed", "drugs_indexed"),
    output_cols = c("sex_encoded", "drinks_encoded", "drugs_encoded")
  ) %>%
  ft_vector_assembler(
    input_cols = c("age", "sex_encoded", "drinks_encoded", 
                   "drugs_encoded", "essay_length"), 
    output_col = "features"
  ) %>%
  ft_standard_scaler(input_col = "features", output_col = "features_scaled", 
                     with_mean = TRUE) %>%
  ml_logistic_regression(features_col = "features_scaled", 
                         label_col = "not_working")


 ml_fit(pipeline, okc_train)

【问题讨论】：

标签： r sparklyr

【解决方案1】：

我在书本示例中遇到了同样的错误，所以我在错误消息之后使用了 traceback() 来找出更多详细信息。该功能似乎需要 Spark 3.0 版本。

【讨论】：

我相信我也在使用 Spark 3.0，我会检查一下调用的是什么火花引擎。