-
具有良好定义的形状,可以可靠地用于结构模式匹配:
>>> from collections import namedtuple
>>>
>>> FooBar = namedtuple("FooBar", ["foo", "bar"])
>>> foobar = FooBar(42, -42)
>>> foo, bar = foobar
>>> foo
42
>>> bar
-42
相比之下Rowsare not reliable when used with keyword arguments:
>>> from pyspark.sql import Row
>>>
>>> foobar = Row(foo=42, bar=-42)
>>> foo, bar = foobar
>>> foo
-42
>>> bar
42
虽然如果使用位置参数定义:
>>> FooBar = Row("foo", "bar")
>>> foobar = FooBar(42, -42)
>>> foo, bar = foobar
>>> foo
42
>>> bar
-42
订单被保留。
-
定义正确的类型
>>> from functools import singledispatch
>>>
>>> FooBar = namedtuple("FooBar", ["foo", "bar"])
>>> type(FooBar)
<class 'type'>
>>> isinstance(FooBar(42, -42), FooBar)
True
并且可以在需要进行类型处理的任何时候使用,尤其是单:
>>> Circle = namedtuple("Circle", ["x", "y", "r"])
>>> Rectangle = namedtuple("Rectangle", ["x1", "y1", "x2", "y2"])
>>>
>>> @singledispatch
... def area(x):
... raise NotImplementedError
...
...
>>> @area.register(Rectangle)
... def _(x):
... return abs(x.x1 - x.x2) * abs(x.y1 - x.y2)
...
...
>>> @area.register(Circle)
... def _(x):
... return math.pi * x.r ** 2
...
...
>>>
>>> area(Rectangle(0, 0, 4, 4))
16
>>> >>> area(Circle(0, 0, 4))
50.26548245743669
和multiple调度:
>>> from multipledispatch import dispatch
>>> from numbers import Rational
>>>
>>> @dispatch(Rectangle, Rational)
... def scale(x, y):
... return Rectangle(x.x1, x.y1, x.x2 * y, x.y2 * y)
...
...
>>> @dispatch(Circle, Rational)
... def scale(x, y):
... return Circle(x.x, x.y, x.r * y)
...
...
>>> scale(Rectangle(0, 0, 4, 4), 2)
Rectangle(x1=0, y1=0, x2=8, y2=8)
>>> scale(Circle(0, 0, 11), 2)
Circle(x=0, y=0, r=22)
并结合第一个属性,可以在广泛的模式匹配场景中使用。 namedtuples也支持标准继承和type hints。
Rows不要:
>>> FooBar = Row("foo", "bar")
>>> type(FooBar)
<class 'pyspark.sql.types.Row'>
>>> isinstance(FooBar(42, -42), FooBar) # Expected failure
Traceback (most recent call last):
...
TypeError: isinstance() arg 2 must be a type or tuple of types
>>> BarFoo = Row("bar", "foo")
>>> isinstance(FooBar(42, -42), type(BarFoo))
True
>>> isinstance(BarFoo(42, -42), type(FooBar))
True
-
提供高度优化的表示。与Row 对象不同,元组不使用__dict__,并且每个实例都带有字段名称。因此,初始化速度可以提高一个数量级:
>>> FooBar = namedtuple("FooBar", ["foo", "bar"])
>>> %timeit FooBar(42, -42)
587 ns ± 5.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
与不同的Row 构造函数相比:
>>> %timeit Row(foo=42, bar=-42)
3.91 µs ± 7.67 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> FooBar = Row("foo", "bar")
>>> %timeit FooBar(42, -42)
2 µs ± 25.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
并且显着提高内存效率(处理大规模数据时非常重要的属性):
>>> import sys
>>> FooBar = namedtuple("FooBar", ["foo", "bar"])
>>> sys.getsizeof(FooBar(42, -42))
64
与等效Row相比
>>> sys.getsizeof(Row(foo=42, bar=-42))
72
最后,namedtuple 的属性访问速度提高了一个数量级:
>>> FooBar = namedtuple("FooBar", ["foo", "bar"])
>>> foobar = FooBar(42, -42)
>>> %timeit foobar.foo
102 ns ± 1.33 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
与Row 对象上的等效操作相比:
>>> foobar = Row(foo=42, bar=-42)
>>> %timeit foobar.foo
2.58 µs ± 26.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
-
最后但并非最不重要的一点是 namedtuples 在 Spark SQL 中得到适当支持
>>> Record = namedtuple("Record", ["id", "name", "value"])
>>> spark.createDataFrame([Record(1, "foo", 42)])
DataFrame[id: bigint, name: string, value: bigint]