【发布时间】:2018-01-16 09:56:29
【问题描述】:
sentenceDataFrame = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(1, "I wish Java could use case classes"),
(2, "Logistic,regression,models,are,neat")
], ["id", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
tokenized = tokenizer.transform(sentenceDataFrame)
如果我运行命令
tokenized.head()
我想得到这样的结果
Row(id=0, sentence='Hi I heard about Spark',
words=['H','i',' ','h','e',‘a’,……])
然而,现在的结果是
Row(id=0, sentence='Hi I heard about Spark',
words=['Hi','I','heard','about','spark'])
PySpark 中的 Tokenizer 或 RegexTokenizer 有什么方法可以实现吗?
【问题讨论】:
标签: python apache-spark pyspark spark-dataframe apache-spark-mllib