PySpark 中等效的 Scala 案例类是什么？答案

【问题标题】：What is the Scala case class equivalent in PySpark?PySpark 中等效的 Scala 案例类是什么？
【发布时间】：2016-05-10 19:35:19
【问题描述】：

您将如何在 PySpark 中使用和/或实现等效的案例类？

【问题讨论】：

Python 的 collections.namedtuple 非常相似。
@AlexHall 所以你最终是在说你可以使用一些通用的 Python 类......没有 PySpark 附带的 Spark 优化案例类等价物对吗？
我对 PySpark 了解不多，只是一般的 Python 推荐。
@conner.xyz 不，没有，因为没有静态类型案例类（或一般的Product 类型）没有那么有用。通常普通的 Python 元组就足够了。命名元组很棒，但 require distributing over the workers.

标签： python apache-spark pyspark case-class

【解决方案1】：

As mentioned by Alex Hall 真正等同于命名的产品类型，是namedtuple。

与the other answer 中建议的Row 不同，它具有许多有用的属性：

具有良好定义的形状，可以可靠地用于结构模式匹配：

>>> from collections import namedtuple
>>>
>>> FooBar = namedtuple("FooBar", ["foo", "bar"])
>>> foobar = FooBar(42, -42)
>>> foo, bar = foobar
>>> foo
42
>>> bar
-42

相比之下Rowsare not reliable when used with keyword arguments：

>>> from pyspark.sql import Row
>>>
>>> foobar = Row(foo=42, bar=-42)
>>> foo, bar = foobar
>>> foo
-42
>>> bar
42

虽然如果使用位置参数定义：

>>> FooBar = Row("foo", "bar")
>>> foobar = FooBar(42, -42)
>>> foo, bar = foobar
>>> foo
42
>>> bar
-42

订单被保留。

定义正确的类型

>>> from functools import singledispatch
>>> 
>>> FooBar = namedtuple("FooBar", ["foo", "bar"])
>>> type(FooBar)
<class 'type'>
>>> isinstance(FooBar(42, -42), FooBar)
True

并且可以在需要进行类型处理的任何时候使用，尤其是单：

>>> Circle = namedtuple("Circle", ["x", "y", "r"])
>>> Rectangle = namedtuple("Rectangle", ["x1", "y1", "x2", "y2"])
>>>
>>> @singledispatch
... def area(x):
...     raise NotImplementedError
... 
... 
>>> @area.register(Rectangle)
... def _(x):
...     return abs(x.x1 - x.x2) * abs(x.y1 - x.y2)
... 
... 
>>> @area.register(Circle)
... def _(x):
...     return math.pi * x.r ** 2
... 
... 
>>>
>>> area(Rectangle(0, 0, 4, 4))
16
>>> >>> area(Circle(0, 0, 4))
50.26548245743669

和multiple调度：

>>> from multipledispatch import dispatch
>>> from numbers import Rational
>>>
>>> @dispatch(Rectangle, Rational)
... def scale(x, y):
...     return Rectangle(x.x1, x.y1, x.x2 * y, x.y2 * y)
... 
... 
>>> @dispatch(Circle, Rational)
... def scale(x, y):
...     return Circle(x.x, x.y, x.r * y)
...
...
>>> scale(Rectangle(0, 0, 4, 4), 2)
Rectangle(x1=0, y1=0, x2=8, y2=8)
>>> scale(Circle(0, 0, 11), 2)
Circle(x=0, y=0, r=22)

并结合第一个属性，可以在广泛的模式匹配场景中使用。 namedtuples也支持标准继承和type hints。

Rows不要：

>>> FooBar = Row("foo", "bar")
>>> type(FooBar)
<class 'pyspark.sql.types.Row'>
>>> isinstance(FooBar(42, -42), FooBar)  # Expected failure
Traceback (most recent call last):
...
TypeError: isinstance() arg 2 must be a type or tuple of types
>>> BarFoo = Row("bar", "foo")
>>> isinstance(FooBar(42, -42), type(BarFoo))
True
>>> isinstance(BarFoo(42, -42), type(FooBar))
True

提供高度优化的表示。与Row 对象不同，元组不使用__dict__，并且每个实例都带有字段名称。因此，初始化速度可以提高一个数量级：

>>> FooBar = namedtuple("FooBar", ["foo", "bar"])
>>> %timeit FooBar(42, -42)
587 ns ± 5.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

与不同的Row 构造函数相比：

>>> %timeit Row(foo=42, bar=-42)
3.91 µs ± 7.67 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> FooBar = Row("foo", "bar")
>>> %timeit FooBar(42, -42)
2 µs ± 25.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

并且显着提高内存效率（处理大规模数据时非常重要的属性）：

>>> import sys
>>> FooBar = namedtuple("FooBar", ["foo", "bar"])
>>> sys.getsizeof(FooBar(42, -42))
64

与等效Row相比

>>> sys.getsizeof(Row(foo=42, bar=-42))
72

最后，namedtuple 的属性访问速度提高了一个数量级：

>>> FooBar = namedtuple("FooBar", ["foo", "bar"])
>>> foobar = FooBar(42, -42)
>>> %timeit foobar.foo
102 ns ± 1.33 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

与Row 对象上的等效操作相比：

>>> foobar = Row(foo=42, bar=-42)
>>> %timeit foobar.foo
2.58 µs ± 26.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

最后但并非最不重要的一点是 namedtuples 在 Spark SQL 中得到适当支持

>>> Record = namedtuple("Record", ["id", "name", "value"])
>>> spark.createDataFrame([Record(1, "foo", 42)])
DataFrame[id: bigint, name: string, value: bigint]

总结：

应该清楚，Row 是 actual product type 的一个非常差的替代品，除非由 Spark API 强制执行，否则应避免使用。

还应该清楚的是，pyspark.sql.Row 并不是要替代案例类，当您考虑到这一点时，它直接等同于 org.apache.spark.sql.Row - 与实际产品相距甚远的类型，其行为类似于Seq[Any]（取决于子类，添加了名称）。 Python 和 Scala 实现都是作为外部代码和内部 Spark SQL 表示之间的一个有用但笨拙的接口引入的。

另见：

如果不提由 Li Haoyi 开发的出色的 MacroPy 及其由 Alberto Berti 开发的端口 (MacroPy3)，那就太可惜了：

>>> import macropy.console
0=[]=====> MacroPy Enabled <=====[]=0
>>> from macropy.case_classes import macros, case
>>> @case
... class FooBar(foo, bar): pass
... 
>>> foobar = FooBar(42, -42)
>>> foo, bar = foobar
>>> foo
42
>>> bar
-42

它具有丰富的其他功能，包括但不限于高级模式匹配和简洁的 lambda 表达式语法。

Python dataclasses (Python 3.7+)。

【讨论】：

这是一个绝妙的答案！

【解决方案2】：

如果您在Inferring the Schema Using Reflection部分转到sql-programming-guide，您将看到case class被定义为

case 类定义表的模式。案例类的参数名称使用反射读取并成为列的名称。案例类也可以嵌套或包含复杂类型，例如序列或数组。

以示例为

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Person(name: String, age: Int)
val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()

在同一部分，如果您切换到 python 即 pyspark，您将看到 Row 被使用并定义为

通过将键/值对列表作为 kwargs 传递给 Row 类来构造行。该列表的键定义了表的列名，通过查看第一行来推断类型。

以示例为

from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
schemaPeople = sqlContext.createDataFrame(people)

所以解释的结论是Row可以作为case class在pyspark

【讨论】：