在 python 或 Pyspark 数据框中使用特殊字符重命名列答案

【问题标题】：Rename columns with special characters in python or Pyspark dataframe在 python 或 Pyspark 数据框中使用特殊字符重命名列
【发布时间】：2017-03-12 21:44:12
【问题描述】：

我在 python/pyspark 中有一个数据框。列具有特殊字符，例如点 (.)、空格、括号 (()) 和括号 {}。以他们的名义。

现在我想重命名列名，如果有点和空格，则用下划线替换它们，如果有 () 和 {}，则将它们从列名中删除。

我已经做到了

df1 = df.toDF(*(re.sub(r'[\.\s]+', '_', c) for c in df.columns))

有了这个，我可以用下划线替换点和空格，不能做第二位，即如果 () 和 {} 在那里，只需从列名中删除它们。

我们如何实现这一目标。

【问题讨论】：

替换为空字符串""。
@Denziloe 我试过这个df1 = mysql.toDF(*(re.sub(r'[\.\s]+ [\(){}\s]', '_','', c) for c in mysql.columns)) 并得到以下错误Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: toDF() argument after * must be a sequence, not generator

标签： python pandas dataframe pyspark spark-dataframe

【解决方案1】：

如果您有 pyspark 数据框，您可以尝试使用 withColumnRenamed 函数重命名列。我确实尝试过，看看并根据您的更改对其进行自定义。

>>> l=[('some value1','some value2','some value 3'),('some value4','some value5','some value 6')]
>>> l_schema = StructType([StructField("col1.some valwith(in)and{around}",StringType(),True),StructField("col2.some valwith()and{}",StringType(),True),StructField("col3 some()valwith.and{}",StringType(),True)])
>>> reps=('.','_'),(' ','_'),('(',''),(')',''),('{','')('}','')
>>> rdd = sc.parallelize(l)
>>> df = sqlContext.createDataFrame(rdd,l_schema)
>>> df.printSchema()
root
 |-- col1.some valwith(in)and{around}: string (nullable = true)
 |-- col2.some valwith()and{}: string (nullable = true)
 |-- col3 some()valwith.and{}: string (nullable = true)

>>> df.show()
+------------------------+------------------------+------------------------+
|col1.some valwith(in)and{around}|col2.some valwith()and{}|col3 some()valwith.and{}|
+------------------------+------------------------+------------------------+
|             some value1|             some value2|            some value 3|
|             some value4|             some value5|            some value 6|
+------------------------+------------------------+------------------------+

>>> def colrename(x):
...    return reduce(lambda a,kv : a.replace(*kv),reps,x)
>>> for i in df.schema.names:
...    df = df.withColumnRenamed(i,colrename(i))
>>> df.printSchema()
root
 |-- col1_some_valwithinandaround: string (nullable = true)
 |-- col2_some_valwithand: string (nullable = true)
 |-- col3_somevalwith_and: string (nullable = true)

>>> df.show()
+--------------------+--------------------+--------------------+
|col1_some_valwithinandaround|col2_some_valwithand|col3_somevalwith_and|
+--------------------+--------------------+--------------------+
|                 some value1|         some value2|        some value 3|
|                 some value4|         some value5|        some value 6|
+--------------------+--------------------+--------------------+

【讨论】：

一列就是这样 col1.some.val{with} 和 val(abc)。我们怎样才能得到 col1_some_valwithand_valabc
我们用下划线替换了空格，比如我想使用 hive 支持的特殊字符来代替空格。我们怎样才能做到这一点
更改代表的映射。

【解决方案2】：

Python 3.x 解决方案：

tran_tab = str.maketrans({x:None for x in list('{()}')})

df1 = df.toDF(*(re.sub(r'[\.\s]+', '_', c).translate(tran_tab) for c in df.columns))

Python 2.x 解决方案：

df1 = df.toDF(*(re.sub(r'[\.\s]+', '_', c).translate(None, '(){}') for c in df.columns))

【讨论】：

我在 pyspark 中收到此错误 >>> tran_tab = str.maketrans({x:None for x in list('{()}')}) Traceback（最近一次调用最后一次）：中的文件“”第 1 行 AttributeError: type object 'str' has no attribute 'maketrans'
说我想使用 hive 支持的特殊字符而不是空格。我们怎样才能做到这一点