【发布时间】:2018-11-30 23:27:30
【问题描述】:
我有一个 spark 数据框,其中给定的列是一些文本。我正在尝试清理文本并用逗号分隔,这将输出一个包含单词列表的新列。
我遇到的问题是该列表中的某些元素包含我想删除的尾随空格。
代码:
# Libraries
# Standard Libraries
from typing import Dict, List, Tuple
# Third Party Libraries
import pyspark
from pyspark.ml.feature import Tokenizer
from pyspark.sql import SparkSession
import pyspark.sql.functions as s_function
def tokenize(sdf, input_col="text", output_col="tokens"):
# Remove email
sdf_temp = sdf.withColumn(
colName=input_col,
col=s_function.regexp_replace(s_function.col(input_col), "[\w\.-]+@[\w\.-]+\.\w+", ""))
# Remove digits
sdf_temp = sdf_temp.withColumn(
colName=input_col,
col=s_function.regexp_replace(s_function.col(input_col), "\d", ""))
# Remove one(1) character that is not a word character except for
# commas(,), since we still want to split on commas(,)
sdf_temp = sdf_temp.withColumn(
colName=input_col,
col=s_function.regexp_replace(s_function.col(input_col), "[^a-zA-Z0-9,]+", " "))
# Split the affiliation string based on a comma
sdf_temp = sdf_temp.withColumn(
colName=output_col,
col=s_function.split(sdf_temp[input_col], ", "))
return sdf_temp
if __name__ == "__main__":
# Sample data
a_1 = "Department of Bone and Joint Surgery, Ehime University Graduate"\
" School of Medicine, Shitsukawa, Toon 791-0295, Ehime, Japan."\
" shinyama@m.ehime-u.ac.jp."
a_2 = "Stroke Pharmacogenomics and Genetics, Fundació Docència i Recerca"\
" Mútua Terrassa, Hospital Mútua de Terrassa, 08221 Terrassa, Spain."
a_3 = "Neurovascular Research Laboratory, Vall d'Hebron Institute of Research,"\
" Hospital Vall d'Hebron, 08035 Barcelona, Spain;catycarrerav@gmail.com"\
" (C.C.). catycarrerav@gmail.com."
data = [(1, a_1), (2, a_2), (3, a_3)]
spark = SparkSession\
.builder\
.master("local[*]")\
.appName("My_test")\
.config("spark.ui.port", "37822")\
.getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("WARN")
af_data = spark.createDataFrame(data, ["index", "text"])
sdf_tokens = tokenize(af_data)
# sdf_tokens.select("tokens").show(truncate=False)
输出
|[Department of Bone and Joint Surgery, Ehime University Graduate School of Medicine, Shitsukawa, Toon , Ehime, Japan ] |
|[Stroke Pharmacogenomics and Genetics, Fundaci Doc ncia i Recerca M tua Terrassa, Hospital M tua de Terrassa, Terrassa, Spain ] |
|[Neurovascular Research Laboratory, Vall d Hebron Institute of Research, Hospital Vall d Hebron, Barcelona, Spain C C ]
期望的输出:
|[Department of Bone and Joint Surgery, Ehime University Graduate School of Medicine, Shitsukawa, Toon, Ehime, Japan] |
|[Stroke Pharmacogenomics and Genetics, Fundaci Doc ncia i Recerca M tua Terrassa, Hospital M tua de Terrassa, Terrassa, Spain] |
|[Neurovascular Research Laboratory, Vall d Hebron Institute of Research, Hospital Vall d Hebron, Barcelona, Spain C C]
这样在
- 第一行:
'Toon ' -> 'Toon','Japan ' -> 'Japan'。 - 第二行:
'Spain ' -> 'Spain' - 第三行:
'Spain C C ' -> 'Spain C C'
注意
尾随空格不仅出现在列表的最后一个元素中,还可以出现在任何元素中。
【问题讨论】:
-
您的预期输出不符合 python - 如果应该是字符串,请引用它们。它将更清楚字符串中的空格在哪里 - 与冒号旁边的空格相反。此外,您的
af_data = spark.createDataFrame(data, ["index", ""text"])行最后的"太多了 - 所以这段代码甚至不会运行。请修复。谢谢 -
@PatrickArtner 当您使用
.show()显示带有字符串的 pyspark 数据框时,引号将被省略。 -
@pault - 很高兴知道 - 对于“期望的输出”,引号增加了很多清晰度 - 至少我是这么认为的。
标签: python-3.x apache-spark pyspark apache-spark-sql