【发布时间】:2021-05-25 12:17:42
【问题描述】:
我在 pyspark df_1 和 df2 中有两个数据框。架构如下所示
>>> df1.printSchema()
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- address: string (nullable = true)
|-- Zip: decimal(18,2)(nullable = true)
>>> df2.printSchema()
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- address: string (nullable = true)
|-- Zip: decimal(9,2)(nullable = true)
|-- nation: string (nullable = true)
现在我想比较数据框中的列的列和数据类型差异。
我们如何在 pyspark 中实现这一点。
预期输出:
列:
ID COl_Name DataFrame
1 Nation df2
数据类型:
ID Col_Name DF1 DF2
1 id None None
2 name None None
3 address None None
4 Zip Decimal(18,2) Decimal(9,2)
5 nation None None
type1.printSchema()
root
|-- col_name: string (nullable = true)
|-- dtype: string (nullable = true)
|-- dataframe: string (nullable = false)
type2.printSchema()
root
|-- col_name: string (nullable = true)
|-- dtype: string (nullable = true)
|-- dataframe: string (nullable = false)
result2.show()
+--------+----+----+
|col_name| df1| df2|
+--------+----+----+
| movieId|null|null|
| title|null|null|
| zip|null|null|
| genres|null|null|
+--------+----+----+
type1.show()
+--------+------+---------+
|col_name| dtype|dataframe|
+--------+------+---------+
| movidId| int| df1|
| string|string| df1|
| genres|string| df1|
| zip|string| df1|
+--------+------+---------+
type2.show()
+--------+------+---------+
|col_name| dtype|dataframe|
+--------+------+---------+
| movidId| int| df2|
| string|string| df2|
| genres|string| df2|
| zip| int| df2|
+--------+------+---------+
【问题讨论】:
标签: python dataframe apache-spark pyspark apache-spark-sql