【问题标题】:How to compare Data Types and columns in 2 Data frames in PySpark如何在 PySpark 中比较 2 个数据框中的数据类型和列
【发布时间】:2021-05-25 12:17:42
【问题描述】:

我在 pyspark df_1 和 df2 中有两个数据框。架构如下所示

>>> df1.printSchema()
root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = true)
 |-- address: string (nullable = true)
 |-- Zip: decimal(18,2)(nullable = true)


>>> df2.printSchema()
root 
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- address: string (nullable = true)
 |-- Zip: decimal(9,2)(nullable = true)
 |-- nation: string (nullable = true)

现在我想比较数据框中的列的列和数据类型差异。

我们如何在 pyspark 中实现这一点。

预期输出:

列:

ID  COl_Name DataFrame 
1    Nation  df2

数据类型:

ID Col_Name DF1            DF2
1  id       None          None 
2  name     None          None 
3  address  None          None 
4  Zip      Decimal(18,2) Decimal(9,2)
5  nation   None          None 

type1.printSchema()
root
 |-- col_name: string (nullable = true)
 |-- dtype: string (nullable = true)
 |-- dataframe: string (nullable = false)

type2.printSchema()
root
 |-- col_name: string (nullable = true)
 |-- dtype: string (nullable = true)
 |-- dataframe: string (nullable = false)

result2.show()
+--------+----+----+
|col_name| df1| df2|
+--------+----+----+
| movieId|null|null|
|   title|null|null|
|     zip|null|null|
|  genres|null|null|
+--------+----+----+

type1.show()
+--------+------+---------+
|col_name| dtype|dataframe|
+--------+------+---------+
| movidId|   int|      df1|
|  string|string|      df1|
|  genres|string|      df1|
|     zip|string|      df1|
+--------+------+---------+

type2.show()
+--------+------+---------+
|col_name| dtype|dataframe|
+--------+------+---------+
| movidId|   int|      df2|
|  string|string|      df2|
|  genres|string|      df2|
|     zip|   int|      df2|
+--------+------+---------+

【问题讨论】:

    标签: python dataframe apache-spark pyspark apache-spark-sql


    【解决方案1】:

    您可以创建列数据类型的数据框并对其进行操作以获得所需的结果。我在这里使用了 spark 数据帧,但我想 pandas 也应该可以工作。

    import pyspark.sql.functions as F
    
    type1 = spark.createDataFrame(
        df1.dtypes, 'col_name string, dtype string'
    ).withColumn('dataframe', F.lit('df1'))
    
    type2 = spark.createDataFrame(
        df2.dtypes, 'col_name string, dtype string'
    ).withColumn('dataframe', F.lit('df2'))
    
    result1 = type1.join(type2, 'col_name', 'left_anti').unionAll(
        type2.join(type1, 'col_name', 'left_anti')
    ).drop('dtype')
    
    result1.show()
    +--------+---------+
    |col_name|dataframe|
    +--------+---------+
    |  nation|      df2|
    +--------+---------+
    
    result2 = type1.join(type2, 'col_name', 'full').select(
        'col_name', 
        F.when(type1.dtype != type2.dtype, type1.dtype).alias('df1'), 
        F.when(type1.dtype != type2.dtype, type2.dtype).alias('df2')
    )
    
    result2.show()
    +--------+-------------+------------+
    |col_name|          df1|         df2|
    +--------+-------------+------------+
    |    name|         null|        null|
    |  nation|         null|        null|
    |     Zip|decimal(18,2)|decimal(9,2)|
    |      id|         null|        null|
    | address|         null|        null|
    +--------+-------------+------------+
    

    【讨论】:

      猜你喜欢
      • 2018-08-08
      • 1970-01-01
      • 2017-07-30
      • 2020-08-25
      • 2022-11-13
      • 1970-01-01
      • 1970-01-01
      • 2020-03-20
      • 2020-06-02
      相关资源
      最近更新 更多