【发布时间】:2018-12-18 12:15:03
【问题描述】:
我第一次从三个 pyspark.ml.feature (tokenizer,CV,idf) 构建了一个流水线,所有丁字裤都运行良好,但第二次尝试它告诉我 Py4JJavaError:调用 o175.fit 时发生错误。 有谁知道这个错误的原因是什么谢谢
import findspark
findspark.init()
import pyspark.sql.types as typ
import pyspark as ps
from pyspark.sql import SparkSession
import pandas as pd
import numpy as np
import warnings
from pyspark.sql import SQLContext
sparkSession = SparkSession.builder \
.master("local[2]") \
.appName("Pyspark Sentiment") \
.getOrCreate()
df = sparkSession.read.load('data/Microblog_Trialdata.csv',
format='com.databricks.spark.csv',
header='true',
inferSchema='true')
df=df.select("sentiment score","spans")
(train_set, val_set, test_set) = df.randomSplit([0.6, 0.2, 0.2], seed = 42)
from pyspark.ml.feature import HashingTF, IDF, Tokenizer ,CountVectorizer
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
tokenizer = Tokenizer(inputCol="spans", outputCol="words")
CV = CountVectorizer(vocabSize=2**11, inputCol="words", outputCol='cv_')
idf = IDF(inputCol='cv_', outputCol="features", minDocFreq=5) #minDocFreq:
remove sparse terms
#model=CV.fit(data)
#vo=model.vocabulary
#print(type(vo))
pipeline = Pipeline(stages=[tokenizer, CV, idf])
pipelineFit = pipeline.fit(train_set)
train_df = pipelineFit.transform(train_set)
val_df = pipelineFit.transform(val_set)
train_df.select("cv_").show(5,truncate=False)
train_df.show(5)
【问题讨论】:
-
您需要提供有关错误的更多详细信息。但我想第一次没有看到的分类 id 可能会导致这个错误
-
嗨 hamza 我编辑了这个问题,对不起,我不明白你所说的分类 ID 是什么意思??
-
嗨乔维尔。我添加了答案你可以试试。
标签: python pyspark pipeline apache-spark-ml