【问题标题】:PySpark remove special characters in all column names for all special charactersPySpark 删除所有列名中所有特殊字符的特殊字符
【发布时间】:2020-06-18 02:37:17
【问题描述】:

我正在尝试从所有列中删除所有特殊字符。我正在使用以下命令:

import pyspark.sql.functions as F

df_spark = spark_df.select([F.col(col).alias(col.replace(' ', '_')) for col in df.columns])
df_spark1 = df_spark.select([F.col(col).alias(col.replace('%', '_')) for col in df_spark.columns])
df_spark = df_spark1.select([F.col(col).alias(col.replace(',', '_')) for col in df_spark1.columns])
df_spark1 = df_spark.select([F.col(col).alias(col.replace('(', '_')) for col in df_spark.columns])
df_spark2 = df_spark1.select([F.col(col).alias(col.replace(')', '_')) for col in df_spark1.columns])

有没有一种更简单的方法可以在一个命令中替换所有特殊字符(不仅仅是上面的 5 个)?我在 Databricks 上使用 PySpark。

【问题讨论】:

    标签: apache-spark pyspark apache-spark-sql special-characters str-replace


    【解决方案1】:

    您可以替换除 A-z 和 0-9 以外的任何字符

    import pyspark.sql.functions as F
    import re
    
    df = df.select([F.col(col).alias(re.sub("[^0-9a-zA-Z$]+","",col)) for col in df.columns])
    

    【讨论】:

    • 很奇怪,我收到一个错误:AttributeError: 'DataFrame' object has no attribute 'select'。你确定这是用python写的吗?
    【解决方案2】:

    使用list comprehension在python中使用re(正则表达式)模块。

    Example:

    df=spark.createDataFrame([('a b','ac','ac','ac','ab')],["i d","id,","i(d","i)k","i%j"])
    
    df.columns
    #['i d', 'id,', 'i(d', 'i)k', 'i%j']
    
    import re
    
    #replacing all the special characters using list comprehension
    [re.sub('[\)|\(|\s|,|%]','',x) for x in df.columns]
    #['id', 'id', 'id', 'ik', 'ij']
    
    df.toDF(*[re.sub('[\)|\(|\s|,|%]','',x) for x in df.columns])
    #DataFrame[id: string, id: string, id: string, ik: string, ij: string]
    

    【讨论】:

      【解决方案3】:

      re.sub('[^\w]', '_', c) 将标点符号和空格替换为_ 下划线。

      测试结果:

      from pyspark.sql import SparkSession
      import re
      
      spark = SparkSession.builder.getOrCreate()
      df = spark.createDataFrame([(1, 2, 3, 4)], [' 1', '%2', ',3', '(4)'])
      
      df = df.toDF(*[re.sub('[^\w]', '_', c) for c in df.columns])
      df.show()
      
      #  +---+---+---+---+
      #  | _1| _2| _3|_4_|
      #  +---+---+---+---+
      #  |  1|  2|  3|  4|
      #  +---+---+---+---+
      

      删除标点符号 + 用 _ 代替空格:

      re.sub('[^\w ]', '', c).replace(' ', '_')

      【讨论】:

        【解决方案4】:

        也许这很有用-

         // [^0-9a-zA-Z]+ => this will remove all special chars
            spark.range(2).withColumn("str", lit("abc%xyz_12$q"))
              .withColumn("replace", regexp_replace($"str", "[^0-9a-zA-Z]+", "_"))
              .show(false)
        
            /**
              * +---+------------+------------+
              * |id |str         |replace     |
              * +---+------------+------------+
              * |0  |abc%xyz_12$q|abc_xyz_12_q|
              * |1  |abc%xyz_12$q|abc_xyz_12_q|
              * +---+------------+------------+
              */
        
            // if you don't want to remove some special char like $ etc, include it [^0-9a-zA-Z$]+
            spark.range(2).withColumn("str", lit("abc%xyz_12$q"))
              .withColumn("replace", regexp_replace($"str", "[^0-9a-zA-Z$]+", "_"))
              .show(false)
        
            /**
              * +---+------------+------------+
              * |id |str         |replace     |
              * +---+------------+------------+
              * |0  |abc%xyz_12$q|abc_xyz_12$q|
              * |1  |abc%xyz_12$q|abc_xyz_12$q|
              * +---+------------+------------+
              */
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2019-09-21
          • 1970-01-01
          • 2019-08-13
          • 2014-02-22
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2012-12-16
          相关资源
          最近更新 更多