【发布时间】:2021-01-20 06:31:10
【问题描述】:
试图弄清楚这个错误的含义以及如何解决它。我正在使用 sparklyr 和 Spark 3.0 来解决使用随机森林的多分类问题。我的数据在特征工程之前看起来像这样:
数据大约100万行:
Source: spark<?> [?? x 8]
label_detail duration orig_bytes resp_bytes proto history time_diff_from_last_connection resp_class
<chr> <dbl> <int> <int> <chr> <chr> <dbl> <chr>
1 okiru 0 0 0 tcp S 0.000250 A
2 okiru 0 0 0 tcp S 0.000250 B
3 okiru 0 0 0 tcp S 0.000250 C
4 okiru 0 0 0 tcp S 0.000250 A
5 okiru 0 0 0 tcp S 0.000250 B
然后我按如下方式使用 ml 管道:
pipline <- ml_pipeline(sc) %>%
ft_string_indexer("label_detail", "label_detail_idx") %>%
ft_string_indexer("proto", "proto_idx") %>%
ft_string_indexer("resp_class", "resp_class_idx") %>%
ft_one_hot_encoder(
input_cols = c( "proto_idx", "resp_class_idx"),
output_cols = c( "proto_encode", "resp_class_encode")) %>%
ft_regex_tokenizer("history", "history_token", pattern = "") %>%
ft_count_vectorizer(input_col = "history_token", output_col = "history_vector") %>%
ft_vector_assembler(
input_cols = c("duration", "orig_bytes",
"resp_bytes", "proto_encode", "time_diff_from_last_connection", "resp_class_encode", "history_vector"),
output_col = "features") %>%
ml_random_forest_classifier(label_col="label_detail_idx",
features_col="features",
seed=222)
model_rf<-ml_fit(pipline,zeek_train)
运行ml_fit 时出现以下错误:
> model_rf<-ml_fit(pipline,zeek_train)
Error in as.character(call[[1]]) :
cannot coerce type 'closure' to vector of type 'character'
在使用来自https://therinspark.com/pipelines.html Mastering Spark with R 的数据和示例时,我也遇到了同样的错误
okc_train <- spark_read_parquet(sc, "data/okc-train.parquet")
okc_train <- okc_train %>%
select(not_working, age, sex, drinks, drugs, essay1:essay9, essay_length)
pipeline <- ml_pipeline(sc) %>%
ft_string_indexer(input_col = "sex", output_col = "sex_indexed") %>%
ft_string_indexer(input_col = "drinks", output_col = "drinks_indexed") %>%
ft_string_indexer(input_col = "drugs", output_col = "drugs_indexed") %>%
ft_one_hot_encoder(
input_cols = c("sex_indexed", "drinks_indexed", "drugs_indexed"),
output_cols = c("sex_encoded", "drinks_encoded", "drugs_encoded")
) %>%
ft_vector_assembler(
input_cols = c("age", "sex_encoded", "drinks_encoded",
"drugs_encoded", "essay_length"),
output_col = "features"
) %>%
ft_standard_scaler(input_col = "features", output_col = "features_scaled",
with_mean = TRUE) %>%
ml_logistic_regression(features_col = "features_scaled",
label_col = "not_working")
ml_fit(pipeline, okc_train)
【问题讨论】: