【问题标题】:MLlib regexTokenizer is ignoring accentsMLlib regexTokenizer 忽略重音符号
【发布时间】:2020-01-07 12:47:54
【问题描述】:

我正在以这种方式使用 pySpark(Python3) 测试 MLlib Tokenizer:

# -*- coding: utf-8 -*-

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
from pyspark.ml.feature import Tokenizer, RegexTokenizer

# Creating dataframe
sentenceData = spark.createDataFrame([
(["Eu acho que MLlib é incrível!"]),
(["Muito mais legal do que scikit-learn"])
], ["raw"])

# Putting sequential indexer on DataFrame
w = Window.orderBy('raw')
sentenceData = sentenceData.withColumn("id", row_number().over(w))

# Configuring regexTokenizer
regexTokenizer = RegexTokenizer(inputCol="raw", outputCol="words", pattern="\\W")

# Applying Tokenizer to dataset
sentenceData = regexTokenizer.transform(sentenceData)

sentenceData.select(
   'id','raw','words'
).show(truncate=False)

结果是这样的:

+---+------------------------------------+--------------------------------------------+
|id |raw                                 |words                                       |
+---+------------------------------------+--------------------------------------------+
|1  |Eu acho que MLlib é incrível!       |[eu, acho, que, mllib, incr, vel]           |
|2  |Muito mais legal do que scikit-learn|[muito, mais, legal, do, que, scikit, learn]|
+---+------------------------------------+--------------------------------------------+

如您所见,单词 'incrível'(葡萄牙语单词的意思是 'amazing')因为字符 'í' 而变成了两个“新词”。我没有在文档中找到任何解决该问题的方法。所以,我迷路了!

我试图更改 'regexTokenizer' 配置上的 'pattern' 直接包括 'í' 和其他模式,包括 '\w' 字符在“类”模式(类似于 pattern="[\Wí\w ]+"),但没有用!有什么方法可以设置“葡萄牙语”或强制 Spark 以某种方式不忽略重音?

谢谢!

【问题讨论】:

    标签: regex tokenize apache-spark-mllib


    【解决方案1】:

    试试

    pattern="[\\p{L}\\w]+"
    

    使用 Scala 代码对我有用,如下所示:

    val tokenizer = new RegexTokenizer().setGaps(false)
                    .setPattern("[\\p{L}\\w]+")
                    .setInputCol("raw")
                    .setOutputCol("words")
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2011-10-15
      • 1970-01-01
      • 2020-12-09
      • 2013-08-28
      • 2018-06-25
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多