【发布时间】:2021-07-07 02:18:24
【问题描述】:
这个问题类似于 Pandas here 中已经提出的问题。我正在使用 Google Cloud DataProc 集群来执行函数,因此无法将它们转换为 pandas。
我想转换以下内容:
+----+----------------------------------+-----+---------+------+--------------------+-------------+
| key| value|topic|partition|offset| timestamp|timestampType|
+----+----------------------------------+-----+---------+------+--------------------+-------------+
|null|["sepal_length","sepal_width",...]| iris| 0| 289|2021-04-11 22:32:...| 0|
|null|["5.0","3.5","1.3","0.3","setosa"]| iris| 0| 290|2021-04-11 22:32:...| 0|
|null|["4.5","2.3","1.3","0.3","setosa"]| iris| 0| 291|2021-04-11 22:32:...| 0|
|null|["4.4","3.2","1.3","0.2","setosa"]| iris| 0| 292|2021-04-11 22:32:...| 0|
|null|["5.0","3.5","1.6","0.6","setosa"]| iris| 0| 293|2021-04-11 22:32:...| 0|
|null|["5.1","3.8","1.9","0.4","setosa"]| iris| 0| 294|2021-04-11 22:32:...| 0|
|null|["4.8","3.0","1.4","0.3","setosa"]| iris| 0| 295|2021-04-11 22:32:...| 0|
+----+----------------------------------+-----+---------+------+--------------------+-------------+
变成这样:
+--------------+-------------+--------------+-------------+-------+
| sepal_length | sepal_width | petal_length | petal_width | class |
+--------------+-------------+--------------+-------------+-------+
| 5.0 | 3.5 | 1.3 | 0.3 | setosa|
| 4.5 | 2.3 | 1.3 | 0.3 | setosa|
| 4.4 | 3.2 | 1.3 | 0.2 | setosa|
| 5.0 | 3.5 | 1.6 | 0.6 | setosa|
| 5.1 | 3.8 | 1.9 | 0.4 | setosa|
| 4.8 | 3.0 | 1.4 | 0.3 | setosa|
+--------------+-------------+--------------+-------------+-------+
我该怎么做呢?任何帮助将不胜感激!
【问题讨论】:
-
寻找
explode函数,就像在 Pandas 中一样。 -
遗憾的是,
explode将列表分成单独的行。 sparkbyexamples.com/pyspark/…
标签: python pandas google-cloud-platform pyspark