[源码解析] TensorFlow 之分布式变量

[源码解析] TensorFlow 之分布式变量

在 TensorFlow 之中，分布式变量是在多个设备上创建的变量。Mirrored variable 和 SyncOnRead variable 是两个例子。本文就对分布式变量进行分析。我们通过一系列问题来引导分析：

创建如何调用到 Strategy 这里？
如何生成 Mirrored Variable？
如何把张量分发到各个设备上？
如果对外保持一个统一的视图？
变量之间如何保持一致？

依然安利两个大神：

[TensorFlow Internals] (https://github.com/horance-liu/tensorflow-internals)，虽然其分析的不是最新代码，但是建议对 TF 内部实现机制有兴趣的朋友都去阅读一下，绝对大有收获。
https://home.cnblogs.com/u/deep-learning-stacks/ 西门宇少，不仅仅是 TensorFlow，其公共号还有更多其他领域，业界前沿。

本系列其他文章是：

[翻译] TensorFlow 分布式之论文篇 "TensorFlow : Large-Scale Machine Learning on Heterogeneous Distributed Systems"

[翻译] TensorFlow 分布式之论文篇 "Implementation of Control Flow in TensorFlow"

[源码解析] TensorFlow 分布式环境(1) --- 总体架构

[源码解析] TensorFlow 分布式环境(2)---Master 静态逻辑

[源码解析] TensorFlow 分布式环境(3)--- Worker 静态逻辑

[源码解析] TensorFlow 分布式环境(4) --- WorkerCache

[源码解析] TensorFlow 分布式环境(5) --- Session

[源码解析] TensorFlow 分布式环境(7) --- Worker 动态逻辑

[源码解析] TensorFlow 分布式环境(8) --- 通信机制

[翻译] 使用 TensorFlow 进行分布式训练

[源码解析] TensorFlow 分布式 DistributedStrategy 之基础篇

1. MirroredVariable

tf.distribute.MirroredStrategy 支持在一台机器的多个 GPU 上进行同步分布式训练。该策略会为每个 GPU 设备创建一个副本。模型中的每个变量都会在所有副本之间进行镜像。这些变量将共同形成一个名为 MirroredVariable 的单个概念上的变量。这些变量会通过应用相同的更新彼此保持同步。

图 1 MirroredVariable

具体使用代码示例如下：

strategy = tf.distribute.MirroredStrategy(["GPU:0", "GPU:1"])
# Variable created inside scope:
with strategy.scope():
  mirrored_variable = tf.Variable(1.)

# Variable created outside scope:
regular_variable = tf.Variable(1.)

打印结果如下：

>>> mirrored_variable
  MirroredVariable:{
    0: <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>,
    1: <tf.Variable 'Variable/replica_1:0' shape=() dtype=float32, numpy=1.0>
  }

>>> regular_variable
<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>

或者也可以参见 tensorflow/python/module/module_test.py 之中的示例.

def test_supports_distributed_variables(self):
  mirrored = distributed_values.MirroredVariable(
      None, [variables.Variable(1.)], variables.VariableAggregation.SUM)
  tpu = tpu_values.TPUMirroredVariable(
      strategy=None, values=[variables.Variable(42.)], aggregation=None)
  aggregating = ps_values.AggregatingVariable(
      strategy=None, v=variables.Variable(1.), aggregation=None)

  m = module.Module()
  m.a = mirrored

1.1 定义

MirroredVariable 注释之中指出其作用是：保存一个从副本到变量的映射，这些变量的值保持同步。具体没有任何新增成员变量，只是实现了一些成员函数。

class MirroredVariable(DistributedVariable, Mirrored):
  """Holds a map from replica to variables whose values are kept in sync."""

  def _update_replica(self, update_fn, value, **kwargs):
    return _on_write_update_replica(self, update_fn, value, **kwargs)

  def scatter_min(self, *args, **kwargs):
    if values_util.is_saving_non_distributed():
      return self._primary.scatter_min(*args, **kwargs)
    return super(MirroredVariable, self).scatter_min(*args, **kwargs)

  def scatter_max(self, *args, **kwargs):
    if values_util.is_saving_non_distributed():
      return self._primary.scatter_max(*args, **kwargs)
    return super(MirroredVariable, self).scatter_max(*args, **kwargs)

  def scatter_update(self, *args, **kwargs):
    if values_util.is_saving_non_distributed(): # 非分布式情况
      # 直接返回本地数值
      return self._primary.scatter_update(*args, **kwargs)
    # 否则进行分布式处理
    return super(MirroredVariable, self).scatter_update(*args, **kwargs)

  def _get_cross_replica(self):
    # Return identity, to avoid directly exposing the variable to the user and
    # allowing it to be modified by mistake.
    return array_ops.identity(Mirrored._get_cross_replica(self))

我们以 scatter_update 为例看看，当不是分布式时候，其会直接调用 _primary 进行处理，否则会调用基类方法处理。另外，_update_replica 方法在更新时候会调用 _on_write_update_replica 进行副本同步，_on_write_update_replica 又会从使用上下文来进行更新，具体定义在 tensorflow/python/distribute/values.py 之中。

def _on_write_update_replica(var, update_fn, value, **kwargs):
  """Updates variables with ON_WRITE synchronization in replica context."""
  if var.aggregation == vs.VariableAggregation.NONE:
    return update_fn(var._get_on_device_or_primary(), value, **kwargs) 

    aggregated_value = apply_aggregation_replica_context(
        value, var.aggregation, var)
    values_util.mark_as_unsaveable()

    return ds_context.get_replica_context()._update(  
        var,
        update_fn,
        args=(aggregated_value,),
        kwargs=kwargs,
        group=True)

  else:

    def merge_fn(strategy, value, **kwargs):
      """Aggregate values and update all variables in cross replica context."""
      v = values_util.apply_aggregation(strategy, value, var.aggregation, var)
      return var._update_cross_replica(update_fn, v, **kwargs)  

    return ds_context.get_replica_context().merge_call(
        merge_fn, args=(value,), kwargs=kwargs)

只看这些成员方法，我们很难对 MirroredVariable 有一个清晰认识，我们还是需要从其类体系入手来分析。

1.2 相关类

1.2.1 类体系

MirroredVariable 类体系如下，我们会在逐一分析之后，再最终进行汇总。

图 2 MirroredVariable 类体系

1.2.2 DistributedValues

我们首先看看 DistributedValues。

图 3 DistributedValues

分布式变量（DistributedValues）由基类 tf.distribution.DistributedValues 表示。 tf.distributed.DistributedValues 概念适合表示多个设备上的值，它包含一个从副本ID到值的映射。

tf.distributed.DistributedValues 包含每个副本的一个值。根据子类的不同，这些值可以在更新时同步，也可以在需求时同步，或者从不同步。 tf.distributed.DistributedValues 可以规约（reduce）以获得跨副本的单一值来作为 tf.distributed.Strategy.run 的输入，或使用 tf.distributed.Strategy.experimental_local_results 检查每个副本的值。

DistributedValues 作为基类不应该被直接实例化。而应该在 distribution strategy 之中创建其子类实例，具体可以通过在 tf.distribution.DistributedDataset 迭代或者通过 tf.distribution.Strategy.run 创建。

tf.distributed.DistributedValues 的两种代表性类型是 "PerReplica" 和 "Mirrored" 值。

"PerReplica"值存在于 worker 设备上，每个副本有不同的值。它们是由 tf.distribution.Strategy.experimental_distribute_dataset 和 tf.distribution.Strategy.distribution_datasets_from_function 返回的分布式数据集的迭代产生。它们也是由 tf.distribution.Strategy.run 返回的典型结果。
"Mirrored"值与 "PerReplica"值类似，只是所有副本上的值都是一样的。我们可以通过使用任何副本上的值，在跨副本上下文中安全地读取 "Mirrored"值。

定义

DistributedValues 有两个成员变量比较重要，_values 和 _primary。初始化变量被设置到 _values 数组之中，数组第一个变量被复制为 _primary。

因为派生类会用到，所以我们分析 DistributedValues 的几个成员函数。

_get_on_device_or_primary 就是返回本副本对应的value，或者直接返回 _primary 对应的value。
_get_cross_replica ：返回跨副本value，这个留给派生类实现。
_get ：如果得到replica_id，就调用 _get_cross_replica 返回跨副本数值，或者返回本地数据。

概念图如下：

图 4 DistributedValues

DistributedValues 具体代码如下：

@tf_export("distribute.DistributedValues", v1=[])
class DistributedValues(object):
  """Base class for representing distributed values.

  A subclass instance of  tf.distribute.DistributedValues  is created when
  creating variables within a distribution strategy, iterating a
   tf.distribute.DistributedDataset  or through  tf.distribute.Strategy.run .
  This base class should never be instantiated directly.
   tf.distribute.DistributedValues  contains a value per replica. Depending on
  the subclass, the values could either be synced on update, synced on demand,
  or never synced.

   tf.distribute.DistributedValues  can be reduced to obtain single value across
  replicas, as input into  tf.distribute.Strategy.run  or the per-replica values
  inspected using  tf.distribute.Strategy.experimental_local_results .
  """

  def __init__(self, values):
    """Should only be called by subclass __init__."""
    self._values = tuple(values)

  def _get(self):
    """Returns the value for the current device or raises a ValueError."""
    replica_id = values_util.get_current_replica_id_as_int()
    if replica_id is None:
      return self._get_cross_replica() # 返回跨副本信息
    else:
      return self._values[replica_id] # 返回本地信息

  def _get_cross_replica(self):
    raise NotImplementedError(
        "DistributedValues._get_cross_replica should be implemented by "
        "sub-classes which support cross-replica accesses.")

  def _get_on_device_or_primary(self):
    """Returns value in same replica or device if possible, else the _primary."""
    # 获取当前副本id
    replica_id = values_util.get_current_replica_id_as_int()
    if replica_id is None: # 如果没有副本id，则看看本机上设备集合
      # Try to find a value on the current device.
      # 拿到当前设备名字，current_device 是一个string
      current_device = device_util.canonicalize(device_util.current())
      for value in self._values: # 遍历
        if device_util.canonicalize(value.device) == current_device:
          return value # 返回
      return self._primary # 返回 _primary
    else:
      # 返回本副本对应的value
      return self._values[replica_id]

  @property
  def _primary(self):
    """Returns a representative component."""
    return self._values[0]

  @property
  def _devices(self):
    return tuple(v.device for v in self._values)

上面代码之中大量用到了 get_current_replica_id_as_int，此函数定义在 tensorflow/python/distribute/values_util.py 之中，作用是获取当前副本id。

def get_current_replica_id_as_int():
  """Returns the current replica ID as an integer, or  None ."""
  replica_context = ds_context.get_replica_context()
  if replica_context:
    replica_id = replica_context._replica_id
    if not isinstance(replica_id, int):
      replica_id = tensor_util.constant_value(replica_id)
  else:
    replica_id = distribute_lib.get_update_replica_id()
  return replica_id

使用

我们从源码之中找出一些使用例子如下，都是使用 MirroredStrategy 来获取 DistributedValues。

# 1. Created from a  tf.distribute.DistributedDataset :
strategy = tf.distribute.MirroredStrategy(["GPU:0", "GPU:1"])
dataset = tf.data.Dataset.from_tensor_slices([5., 6., 7., 8.]).batch(2)
dataset_iterator = iter(strategy.experimental_distribute_dataset(dataset))
distributed_values = next(dataset_iterator)

# 2. Returned by  run :
strategy = tf.distribute.MirroredStrategy(["GPU:0", "GPU:1"])
@tf.function
def run():
   ctx = tf.distribute.get_replica_context()
   return ctx.replica_id_in_sync_group
distributed_values = strategy.run(run)

# 3. As input into  run :
strategy = tf.distribute.MirroredStrategy(["GPU:0", "GPU:1"])
dataset = tf.data.Dataset.from_tensor_slices([5., 6., 7., 8.]).batch(2)
dataset_iterator = iter(strategy.experimental_distribute_dataset(dataset))
distributed_values = next(dataset_iterator)
@tf.function
def run(input):
   return input + 1.0
updated_value = strategy.run(run, args=(distributed_values,))

# 4. Reduce value:
strategy = tf.distribute.MirroredStrategy(["GPU:0", "GPU:1"])
dataset = tf.data.Dataset.from_tensor_slices([5., 6., 7., 8.]).batch(2)
dataset_iterator = iter(strategy.experimental_distribute_dataset(dataset))
distributed_values = next(dataset_iterator)
reduced_value = strategy.reduce(tf.distribute.ReduceOp.SUM,
                                 distributed_values,
                                 axis = 0)

# 5. Inspect local replica values:
strategy = tf.distribute.MirroredStrategy(["GPU:0", "GPU:1"])
dataset = tf.data.Dataset.from_tensor_slices([5., 6., 7., 8.]).batch(2)
dataset_iterator = iter(strategy.experimental_distribute_dataset(dataset))
per_replica_values = strategy.experimental_local_results(distributed_values)
print(per_replica_values)

# 输出结果
#  (<tf.Tensor: shape=(1,), dtype=float32, numpy=array([5.], dtype=float32)>,
#   <tf.Tensor: shape=(1,), dtype=float32, numpy=array([6.], dtype=float32)>)

1.2.3 DistributedDelegate

接下来我们看看 DistributedDelegate 。

图 5 DistributedDelegate

DistributedDelegate 作用是在 DistributedValues 之上增加了计算功能。具体是通过 _get_as_operand 来调用基类 DistributedValues 的 _get 方法，得到value，然后进行计算。

图 6 如何计算

DistributedDelegate 定义如下，省略部分代码。

class DistributedDelegate(DistributedValues):
  """A map from device to values; acts as the same type as the values."""

  def __getattr__(self, name):
    # The '_use_resource_variables' and the attrs starts with '_self' are used
    # for restoring the saved_model proto, and '_attribute_sentinel' is used for
    # Layer tracking. At the point these attrs are queried, the variable has not
    # been initialized. Thus it should not query those of the underlying
    # components.
    if name.startswith("_self_") or name in ("_use_resource_variables",
                                             "_attribute_sentinel",
                                             "_distributed_container"):
      return super(DistributedDelegate, self).__getattr__(name)

    # This allows copy.copy(DistributedDelegate). When copying an object,
    # copy.copy doesn't invoke its __init__ method, instead it makes a new
    # empty object, then copies the attributes over. copy.copy looks for
    # attributes like "__getstate__" in case the object implements its custom
    # copying. Since DistributedDelegate doesn't have those attributes defined,
    # __getattr__ will be invoked, which tries to access "_values" attributes,
    # but that doesn't exist either because this is an empty object, and again
    # __getattr__ is invoked, leading to an infinite recursion.
    if name == "_values":
      raise AttributeError()

    # TODO(priyag): This needs to be made robust against pitfalls from mix use
    # __getattr__ and @property. See b/120402273.
    return getattr(self._get(), name)

  @property
  def values(self):
    """Returns the per replica values."""
    return self._values

  def _get_as_operand(self):
    """Returns the value for operations for the current device.

    Some implementations, e.g.  TPUMirroredVariable , are not able to return the
    value type within a replica context. They can, however, return a value that
    can be used by the operations below.
    """
    return self._get()

  def __add__(self, o):
    return self._get_as_operand() + o

  def __radd__(self, o):
    return o + self._get_as_operand()

  def __sub__(self, o):
    return self._get_as_operand() - o

  def __rsub__(self, o):
    return o - self._get_as_operand()

  # 省略大部分代码

1.2.4 PerReplica

PerReplica 的作用是：持有一个map，用来维持从副本到未同步value的映射。

class PerReplica(DistributedValues, composite_tensor.CompositeTensor):
  """Holds a map from replica to unsynchronized values."""

  @property
  def _type_spec(self):
    return PerReplicaSpec(
        *(type_spec.type_spec_from_value(v) for v in self._values))

  @property
  def values(self):
    """Returns the per replica values."""
    return self._values

1.2.5 Mirrored

接着我们来到 Mirrored这里。

图 7 Mirrored

Mirrored 代表了在多个设备上创建的变量，其通过对每个副本应用相同的更新来保持变量的同步。镜像变量（Mirrored variables）是用 tf.Variable(...synchronization=tf.VariableSynchronization.ON_WRITE...) 创建的。通常它们只用于同步训练。

回忆一下 DistributedValues 的功能，其保存一个从副本到值的映射，这些值将保持同步，其 _get_cross_replica 方法没有实现。而 Mirrored 的目的是在跨副本模式（cross-replica mode）下可以直接使用。所以 Mirrored 这里实现了 _get_cross_replica。_get_cross_replica 就是调用基类 DistributedValues 的 _get_on_device_or_primary 方法（具体请参见对应小节），作用是返回本副本对应的数值，或者直接返回 _primary 对应的数值。

概念图如下：

图 8 Mirrored 如何计算

Mirrored 定义如下：

# Note that unlike PerReplica, Mirrored values inherit from
# DistributedDelegate and so can be used directly in cross-replica mode.
class Mirrored(DistributedDelegate):
  """Holds a map from replica to values which are kept in sync."""

  def _get_cross_replica(self):
    return self._get_on_device_or_primary() # 调用基类 DistributedValues 的方法

  def _as_graph_element(self):
    obj = self._get() # 调用基类 DistributedValues 的方法
    conv_fn = getattr(obj, "_as_graph_element", None)
    if conv_fn and callable(conv_fn):
      return conv_fn()
    return obj

1.2.6 Policy

我们接下来看看分布式策略。

图 9 分布式策略

VariablePolicy

VariablePolicy 是分布式策略的基类，其定义了分布式变量的同步和聚合的策略。在 tf.distribution 范围内创建变量时，鉴于 tf.Variable 上设置了 synchronization 和 aggregation 参数， tf.distribution 会创建一个适当的策略对象并将其分配给分布式变量。所有的变量操作都被委托给相应的策略对象来完成。

class VariablePolicy(object):
  """Policy defining synchronization and aggregation of a distributed variable.

  Given  synchronization  and  aggregation  parameters set on a  tf.Variable 
  during variable creation within  tf.distribute  scope,  tf.distribute  creates
  an appropriate policy object and assigns it to the distributed variable. All
  variable operations are delegated to the respective policy object.
  """

  def __init__(self, aggregation):
    self._aggregation = aggregation

  def value(self):
    raise NotImplementedError(
        "VariablePolicy.value should be overriden by sub-classes.")

  def _is_mirrored(self):
    raise NotImplementedError(
        "VariablePolicy._is_mirrored should be overriden by sub-classes.")

  def _as_graph_element(self, _):
    raise NotImplementedError(
        "VariablePolicy._as_graph_element should be overriden by sub-classes.")

  def _get_cross_replica(self, var):
    raise NotImplementedError(
        "VariablePolicy._get_cross_replica should be overriden by sub-classes.")

  def _update_replica(self, var, update_fn, value, **kwargs):
    raise NotImplementedError(
        "VariablePolicy._update_replica should be overriden by sub-classes.")

OnReadPolicy

OnReadPolicy 是读取策略，比如其成员变量 _get_cross_replica 就会调用 var.distribute_strategy.reduce 来完成读取。

class OnReadPolicy(VariablePolicy):
  """Policy defined for  tf.VariableSynchronization.ON_READ  synchronization.

  This policy is created when  synchronization  is set to
   tf.VariableSynchronization.ON_READ  and  aggregation  is set to any of the
  values allowed by the  tf.VariableAggregation  enum such as  NONE ,  SUM ,
   MEAN  or  ONLY_FIRST_REPLICA when creating a  tf.Variable  in  tf.distribute 
  scope.
  """

  def _is_mirrored(self):
    return False

  def value(self, var):
    with ds_context.enter_or_assert_strategy(var.distribute_strategy):
      if (ds_context.in_cross_replica_context() and
          not values_util.in_replica_update_context()):
        if self._aggregation == vs.VariableAggregation.ONLY_FIRST_REPLICA:
          return var._get_replica(0).value()  
        return var._get_cross_replica()  
      else:
        return var._get_on_device_or_primary().value()  

  def _as_graph_element(self, var):
    with ds_context.enter_or_assert_strategy(var.distribute_strategy):
      if ds_context.in_cross_replica_context():
        return ops.convert_to_tensor(var._get_cross_replica())  
    return var._get()._as_graph_element()  

  def _get_cross_replica(self, var):
    if self._aggregation == vs.VariableAggregation.ONLY_FIRST_REPLICA:
      return var._get_replica(0)  # 从第一个副本读取
    if self._aggregation == vs.VariableAggregation.SUM:
      values_util.mark_as_unsaveable() # 不能更新
    with ds_context.enter_or_assert_strategy(var.distribute_strategy):
      # 调用 distribute_strategy 完成规约
      return var.distribute_strategy.reduce(
          reduce_util.ReduceOp.from_variable_aggregation(self._aggregation),
          var,
          axis=None)

  def _update_replica(self, var, update_fn, value, **kwargs):
    return update_fn(var._get_on_device_or_primary(), value, **kwargs)  

  def assign_add(self,
                 var,
                 value,
                 use_locking=False,
                 name=None,
                 read_value=True):
    """Adds a value to this variable."""
    with ds_context.enter_or_assert_strategy(var.distribute_strategy):
      if (ds_context.in_cross_replica_context() and
          not values_util.in_replica_update_context()):
        values_util.mark_as_unsaveable()
        return values_util.on_read_assign_add_cross_replica(
            var, value, read_value=read_value)
      else:
        return values_util.on_write_assign_add(
            var,
            value,
            use_locking=use_locking,
            name=name,
            read_value=read_value)

  def assign(self, var, value, use_locking=False, name=None, read_value=True):
    with ds_context.enter_or_assert_strategy(var.distribute_strategy):
      if (ds_context.in_cross_replica_context() and
          not values_util.in_replica_update_context()):
        values_util.mark_as_unsaveable()
        return values_util.on_read_assign_cross_replica(
            var, value, read_value=read_value)
      else:
        return values_util.on_write_assign(
            var,
            value,
            use_locking=use_locking,
            name=name,
            read_value=read_value)
    
  # 省略大部分代码

OnWritePolicy

OnWritePolicy 类用来实现写策略。其主要是调用 var._get_on_device_or_primary() 来完成各种操作，比如 _get_cross_replica 就是调用 var._get_on_device_or_primary() 来完成操作。而且也调用了 values_util 之中的各种基础操作。

class OnWritePolicy(VariablePolicy):
  """Policy defined for  tf.VariableSynchronization.ON_WRITE  synchronization.

  This policy is created when the following  synchronization  and  aggregation 
  parameters are specified when creating a  tf.Variable  in  tf.distribute 
  scope and  synchronization  is equal to  tf.VariableSynchronization.ON_WRITE 
  or  tf.VariableSynchronization.AUTO .
  """

  def _is_mirrored(self):
    return True

  def value(self, var):
    return var._get_on_device_or_primary().value()  

  def _as_graph_element(self, var):
    return var._get_on_device_or_primary()._as_graph_element()  

  def _get_cross_replica(self, var):
    # Return identity, to avoid directly exposing the variable to the user and
    # allowing it to be modified by mistake.
    return array_ops.identity(var._get_on_device_or_primary())  

  # 调用 update_fn 和 _on_write_update_replica 来完成对应操作
  def _update_replica(self, var, update_fn, value, **kwargs):
    if var.aggregation == variables_lib.VariableAggregation.NONE:
      return update_fn(var._get_on_device_or_primary(), value, **kwargs)  
    return _on_write_update_replica(var, update_fn, value, **kwargs)

  def assign(self, var, value, use_locking=False, name=None, read_value=True):
    return values_util.on_write_assign(
        var, value, use_locking=use_locking, name=name, read_value=read_value)

  def assign_add(self,
                 var,
                 value,
                 use_locking=False,
                 name=None,
                 read_value=True):
    # 调用 values_util 完成工作
    return values_util.on_write_assign_add(
        var, value, use_locking=use_locking, name=name, read_value=read_value)

  # 这里后续会提到
  def scatter_update(self, var, sparse_delta, use_locking=False, name=None):
    return values_util.scatter_update(
        var, sparse_delta, use_locking=use_locking, name=name)

  def get_saveable(self, var, primary_var, name):
    """Saveable ops for AUTO variables."""
    return values_util.get_on_write_saveable(var, primary_var, name)

  def get_restore_ops(self, var, tensor):
    return values_util.get_on_write_restore_ops(var, tensor)

  # 省略大部分代码

values_util

上面两种策略都使用了 on_write_assign_add ，其定义在 ensorflow/python/distribute/values_util.py 之中。

def on_write_assign_add(var, value, use_locking=False, name=None,
                        read_value=True):
  assign_add_fn = lambda var, *a, **kw: var.assign_add(*a, **kw)
  return var._update(  
      update_fn=assign_add_fn,
      value=value,
      use_locking=use_locking,
      name=name,
      read_value=read_value)

OnWritePolicy 也使用了 values_util 定义的 scatter_update，发现其还是调用回到了 var._update。

def scatter_update(var, sparse_delta, use_locking=False, name=None):
  scatter_update_fn = lambda var, *a, **kw: var.scatter_update(*a, **kw)
  return var._update( 
      update_fn=scatter_update_fn,
      value=sparse_delta,
      use_locking=use_locking,
      name=name)

1.2.7 DistributedVariable

顺着类关系，我们最后来到 DistributedVariable，这里其实是 MirroredVariable 的主要功能所在。

图 10 DistributedVariable

DistributedVariable 持有从副本到变量的映射，对于 MirroredVariable 来说，self._policy 就是 OnWritePolicy，具体更新变量就是通过 _policy 完成。

class DistributedVariable(DistributedDelegate, variables_lib.Variable,
                          core.Tensor):
  """Holds a map from replica to variables."""

  def __init__(self, strategy, values, aggregation, var_policy=None):
    if (aggregation == variables_lib.VariableAggregation.MEAN and
        not values[0].dtype.is_floating):
      raise ValueError(
          "creating distributed tf.Variable with aggregation=MEAN and a "
          "non-floating dtype is not supported, please use a different "
          "aggregation or dtype")
    self._distribute_strategy = strategy
    self._aggregation = aggregation
    super(DistributedVariable, self).__init__(values)
    self._common_name = self._primary.name.split(":")[0]
    # Use a weakref to make it easy to map from the contained values
    # to the container without introducing a reference cycle.
    for v in values:
      v._distributed_container = weakref.ref(self)  # pylint: disable=protected-access

    # Packed variable is used to reduce the overhead of function execution.
    # For a DistributedVariable, only one variable handle is captured into a
    # function graph. It's only supported in eager mode.
    if ops.executing_eagerly_outside_functions() and getattr(
        strategy, "_enable_packed_variable_in_eager_mode", False):
      name = "%s/packed/" % self._common_name
      self._packed_var = packed.PackedDistributedVariable(values, name=name)
    else:
      self._packed_var = None

    # tf.keras keeps track of variables initialized using this attribute. When
    # tf.keras gets the default session, it initializes all uninitialized vars.
    # We need to make _keras_initialized a member of DistributedVariable because
    # without this it will use  __getattr__  which will delegate to a component
    # variable.
    self._keras_initialized = False
    # Typically, a  DistributedVariable 's initializer is composed of the
    # initializers of the components variables. However, in some cases, such as
    # when restoring from a checkpoint, we may set the _initializer_op
    # property on the entire  DistributedVariable .
    self._initializer_op = None
    # Set a VariablePolicy which decides how we replicate/aggregate the given
    # variable.
    self._policy = var_policy

具体如何处理，需要看实际情况，但是最终都是归结到 strategy 或者 strategy.extended 之上。

读取

读取时候，会调用 _get_cross_replica，其内部调用 Policy。而 Policy 会调用 distribute_strategy 完成规约。

def _get_cross_replica(self):
  if values_util.is_saving_non_distributed(): 
    return self._primary # 如果是非分布式存储，就直接返回
  if self._policy:
    # 返回跨样本
    return self._policy._get_cross_replica(self)  

  raise NotImplementedError(
      "DistributedVariable._get_cross_replica requires a valid "
      "VariablePolicy. Please set the policy via the  var_policy  argument "
      "in the constructor, or override this method in sub-classes which "
      "support cross-replica accesses.")

具体如下：

图 11 DistributedVariable 读取

scatter_update

比如 scatter_update 也会调用 _policy 完成更新操作。

def scatter_update(self, sparse_delta, use_locking=False, name=None):
  if values_util.is_saving_non_distributed():
    return self._primary.scatter_update(sparse_delta, use_locking, name)
  if self._policy:
    return self._policy.scatter_update(
        self, sparse_delta, use_locking=use_locking, name=name)
  return values_util.scatter_update(
      self, sparse_delta, use_locking=use_locking, name=name)

前面在 OnWritePolicy 之中讨论过，scatter_update 最后会调用回到 DistributedVariable 自己的 _update 方法。

def scatter_update(var, sparse_delta, use_locking=False, name=None):
  scatter_update_fn = lambda var, *a, **kw: var.scatter_update(*a, **kw)
  return var._update(  
      update_fn=scatter_update_fn,
      value=sparse_delta,
      use_locking=use_locking,
      name=name)

var._update 里面有各种运行路径，我们只选择部分分析。

def _update(self, update_fn, value, **kwargs):
  """Applies updates depending on the context.

  The method calls  _update_replica  in replica context,
   _update_cross_replica  in cross replica context, and  update_fn  in update
  context.

  If  read_value  is True, the method returns the updated Variable. If
   read_value  is False, the method returns the update  tf.Operation .

  Args:
    update_fn: A callable to pass to  strategy.extended.update  to update the
      variable. It should have the same signature as  Variable.assign() .
    value: value to be passed to  update_fn .
    **kwargs: keyword arguments to  update_fn .

  Returns:
    Updated variable or  tf.Operation .

  """
  if values_util.is_saving_non_distributed():
    return update_fn(self._primary, value, **kwargs) # 非分布式

  with ds_context.enter_or_assert_strategy(self.distribute_strategy):
    if ds_context.in_cross_replica_context():
      update_replica_id = distribute_lib.get_update_replica_id()
      if update_replica_id is not None:
        replica_value = self._get_replica(update_replica_id)
        return update_fn(replica_value, value, **kwargs)
      return self._update_cross_replica(update_fn, value, **kwargs) # 跨副本更新
    else:
      values_util.assert_replica_context(self.distribute_strategy)
      return self._update_replica(update_fn, value, **kwargs)

然后调用了 _update_cross_replica 进行跨副本更新。

def _update_cross_replica(self, update_fn, value, **kwargs):
  """Applies updates across replicas.

  Args:
    update_fn: A callable to pass to  strategy.extended.update  to update the
      variable. It should has the same signature as  Variable.assign() .
    value: value to be passed to  update_fn .
    **kwargs: remaining arguments to  update_fn .

  Returns:
    Updated variable or  tf.Operation .
  """
  values_util.mark_as_unsaveable()
  return self.distribute_strategy.extended.update(
      self, update_fn, args=(value,), kwargs=kwargs, group=True)

我们展示如下：

图 12 DistributedVariable 更新

1.2.8 存储

我们接下来看看 MirroredVariable 如何存储，可以看到，在 _saveable_factory 之中使用 _MirroredSaveable 完成存储功能。

class MirroredVariable(DistributedVariable, Mirrored):

  def _gather_saveables_for_checkpoint(self):
    """Overrides Trackable method.

    This allows both name-based and object-based save and restore of
    MirroredVariables.

    Returns:
      A dictionary mapping attribute names to  SaveableObject  factories.
    """

    def _saveable_factory(name=self._common_name):
      return _MirroredSaveable(self, self._primary, name)

    return {trackable.VARIABLE_VALUE_KEY: _saveable_factory}

_MirroredSaveable 来定义如何存储 MirroredVariable。

class _MirroredSaveable(saveable_object.SaveableObject):
  """Class for defining how to restore a MirroredVariable."""

  def __init__(self, mirrored_variable, primary_variable, name):
    self._mirrored_variable = mirrored_variable
    # 这里调用到
    tensor, spec = values_util.get_on_write_saveable(self._mirrored_variable,
                                                     primary_variable, name)
    super(_MirroredSaveable, self).__init__(tensor, spec, name)

  def restore(self, restored_tensors, restored_shapes):
    """Restore the same value into all variables."""
    tensor, = restored_tensors
    return values_util.get_on_write_restore_ops(self._mirrored_variable, tensor)

get_on_write_saveable 代码如下：

def get_on_write_saveable(var, primary_var, name):
  """Return saveable spec for AUTO and ON_WRITE variables."""
  # We use a callable so that we don't have to evaluate this expression
  # in the case where we are trying to restore instead of save.
  def tensor():
    if context.executing_eagerly() and not primary_var.is_initialized():
      # A SaveSpec tensor value of  None  indicates that the variable is
      # uninitialized.
      return None
    strategy = var.distribute_strategy
    return strategy.extended.read_var(var) # 获取张量

  spec = saveable_object.SaveSpec(
      tensor=tensor,
      slice_spec="",
      name=name,
      dtype=var.dtype,
      device=primary_var.device)

  return tensor, [spec]

tensorflow/python/distribute/mirrored_strategy.py 这里会跨副本进行取值。

def read_var(self, replica_local_var):
  """Read the aggregate value of a replica-local variable."""
  if distribute_utils.is_sync_on_read(replica_local_var):
    return replica_local_var._get_cross_replica()
  return array_ops.identity(replica_local_var._get())

1.2.9 小结

经过上述分析，最终我们得到 MirroredVariable 继承体系注解版如下，其很多功能最终落实在 tf.distribute.Strategy 之上。

图 13 MirroredVariable 继承体系注解版

1.3 构建变量

在 MirroredStrategy 下创建的变量是一个 MirroredVariable。如果在策略的构造参数中没有指定设备，那么它将使用所有可用的 GPU。如果没有找到 GPU，它将使用可用的 CPU。请注意，TensorFlow 将一台机器上的所有 CPU 视为一个单一的设备，并在内部使用线程进行并行化。我们接下来看看如何构建 MirroredVariable。

1.3.1 StrategyBase

首先，在 tensorflow/python/distribute/distribute_lib.py 之中有如下代码，说明关于 scope 的使用，还是 _extended 起了作用。

def scope(self):
  """Returns a context manager selecting this Strategy as current.

  Inside a  with strategy.scope():  code block, this thread
  will use a variable creator set by  strategy , and will
  enter its "cross-replica context".

  Returns:
    A context manager.
  """
  return self._extended._scope(self)

1.3.2 StrategyExtendedV2

于是我们来到了 StrategyExtendedV2。StrategyExtendedV2 这里调用了 creator_with_resource_vars 来提供一种如何创建变量的机制，creator_with_resource_vars 内部则调用派生类的_create_variable 来建立变量。

def _scope(self, strategy):
  """Implementation of tf.distribute.Strategy.scope()."""

  def creator_with_resource_vars(next_creator, **kwargs):
    """Variable creator to use in  _CurrentDistributionContext ."""
    _require_strategy_scope_extended(self)
    kwargs["use_resource"] = True
    kwargs["distribute_strategy"] = strategy

    # Unwrap  initial_value  if it is a  CheckpointInitialValue  to avoid
    # dereferencing a  Tensor  that is without a  name . We still need to
    # propagate the metadata it's holding.
    if isinstance(kwargs["initial_value"], trackable.CheckpointInitialValue):
      checkpoint_restore_uid = kwargs[
          "initial_value"].checkpoint_position.restore_uid
      kwargs["initial_value"] = kwargs["initial_value"].wrapped_value
    elif isinstance(kwargs["initial_value"],
                    trackable.CheckpointInitialValueCallable):
      checkpoint_restore_uid = kwargs[
          "initial_value"].checkpoint_position.restore_uid
    elif (isinstance(kwargs["initial_value"], functools.partial) and
          isinstance(kwargs["initial_value"].func,
                     trackable.CheckpointInitialValueCallable)):
      # Some libraries (e.g, Keras) create partial function out of initializer
      # to bind shape/dtype, for example:
      #  initial_val = functools.partial(initializer, shape, dtype=dtype)
      # Therefore to get the restore_uid we need to examine the "func" of
      # the partial function.
      checkpoint_restore_uid = kwargs[
          "initial_value"].func.checkpoint_position.restore_uid
    else:
      checkpoint_restore_uid = None

    created = self._create_variable(next_creator, **kwargs)

    if checkpoint_restore_uid is not None:
      # Let the checkpointing infrastructure know that the variable was
      # already restored so it doesn't waste memory loading the value again.
      # In this case of CheckpointInitialValueCallable this may already be
      # done by the final variable creator, but it doesn't hurt to do it
      # again.
      created._maybe_initialize_trackable()
      created._update_uid = checkpoint_restore_uid
    return created

  def distributed_getter(getter, *args, **kwargs):
    return getter(*args, **kwargs)

  # 这里使用了 creator_with_resource_vars
  return _CurrentDistributionContext(
      strategy,
      variable_scope.variable_creator_scope(creator_with_resource_vars), # 配置如何建立变量
      variable_scope.variable_scope(
          variable_scope.get_variable_scope(),
          custom_getter=distributed_getter), self._default_device)

逻辑如下，进入scope之后经过一系列操作之后，返回了_CurrentDistributionContext，其内部又会有一系列操作，我们继续看看。

图 14 如何创建变量

1.3.3 _CurrentDistributionContext

_CurrentDistributionContext 维护了策略相关的信息，设置各种作用域，返回策略。

class _CurrentDistributionContext(object):
  """Context manager setting the current  tf.distribute.Strategy .

  Also: overrides the variable creator and optionally the current device.
  """

  def __init__(self,
               strategy,
               var_creator_scope,
               var_scope=None,
               resource_creator_scope=None,
               default_device=None):
    self._context = distribution_strategy_context._CrossReplicaThreadMode( 
        strategy)
    self._var_creator_scope = var_creator_scope
    self._var_scope = var_scope
    self._resource_creator_scope = resource_creator_scope
    if default_device:
      self._device_scope = ops.device(default_device)
    else:
      self._device_scope = None
    self._same_scope_again_count = 0

  def __enter__(self):
    # Allow this scope to be entered if this strategy is already in scope.
    if distribution_strategy_context.has_strategy():
      _require_cross_replica_or_default_context_extended(
          self._context.strategy.extended)
      self._same_scope_again_count += 1
    else:
      _push_per_thread_mode(self._context)
      if self._var_scope:
        self._var_scope.__enter__()
      self._var_creator_scope.__enter__()
      if self._resource_creator_scope:
        nest.map_structure(lambda scope: scope.__enter__(),
                           self._resource_creator_scope)
      if self._device_scope:
        self._device_scope.__enter__()
    return self._context.strategy

  def __exit__(self, exception_type, exception_value, traceback):
    if self._same_scope_again_count > 0:
      self._same_scope_again_count -= 1
      return
    if self._device_scope:
      try:
        self._device_scope.__exit__(exception_type, exception_value, traceback)
      except RuntimeError as e:
        six.raise_from(
            RuntimeError("Device scope nesting error: move call to "
                         "tf.distribute.set_strategy() out of  with  scope."),
            e)

    try:
      self._var_creator_scope.__exit__(
          exception_type, exception_value, traceback)
    except RuntimeError as e:
      six.raise_from(
          RuntimeError("Variable creator scope nesting error: move call to "
                       "tf.distribute.set_strategy() out of  with  scope."),
          e)

    if self._resource_creator_scope:
      try:
        if isinstance(self._resource_creator_scope, list):
          reversed_resource_creator_scope = self._resource_creator_scope[::-1]
          nest.map_structure(
              lambda scope: scope.__exit__(exception_type, exception_value,  
                                           traceback),
              reversed_resource_creator_scope)

        else:
          self._resource_creator_scope.__exit__(exception_type, exception_value,
                                                traceback)
      except RuntimeError as e:
        six.raise_from(
            RuntimeError("Resource creator scope nesting error: move call "
                         "to tf.distribute.set_strategy() out of  with  "
                         "scope."), e)

    if self._var_scope:
      try:
        self._var_scope.__exit__(exception_type, exception_value, traceback)
      except RuntimeError as e:
        six.raise_from(
            RuntimeError("Variable scope nesting error: move call to "
                         "tf.distribute.set_strategy() out of  with  scope."),
            e)
    _pop_per_thread_mode()

1.3.4 MirroredStrategy

有了上面的分析，我们可以知道，当使用了 Strategy 时候，会使用 Strategy 的 _create_variable 最终生成变量。

create_variable 负责具体业务。里面会用到 self._devices，然后调用到了 distribute_utils.create_mirrored_variable，其会使用 real_mirrored_creator，VARIABLE_CLASS_MAPPING 和 create_mirrored_variable 来建立变量。real_mirrored_creator会配置具体的变量名称，后续调用则会据此来设定变量应该放到哪个设备之上。对于第一个设备，这里依然采用原来的名字，而后续设备则在原变量名之后加上 /replica_设备号，这样就可以和原始变量区别。接着会把原来变量的值赋值给这些对应的副本变量。

def _create_variable(self, next_creator, **kwargs):
  """Create a mirrored variable. See  DistributionStrategy.scope ."""
  colocate_with = kwargs.pop("colocate_with", None)
  if colocate_with is None:
    devices = self._devices
  elif isinstance(colocate_with, numpy_dataset.SingleDevice):
    with ops.device(colocate_with.device):
      return next_creator(**kwargs)
  else:
    devices = colocate_with._devices  

  def _real_mirrored_creator(**kwargs):  
    value_list = []
    for i, d in enumerate(devices):
      with ops.device(d):
        kwargs["initial_value"] = self._get_variable_creator_initial_value(
            replica_id=i,
            device=d,
            primary_var=value_list[0] if value_list else None,
            **kwargs)
        if i > 0:
          # Give replicas meaningful distinct names:
          var0name = value_list[0].name.split(":")[0]
          # We append a / to variable names created on replicas with id > 0 to
          # ensure that we ignore the name scope and instead use the given
          # name as the absolute name of the variable.
          kwargs["name"] = "%s/replica_%d/" % (var0name, i)
        with context.device_policy(context.DEVICE_PLACEMENT_SILENT):
          # Don't record operations (e.g. other variable reads) during
          # variable creation.
          with tape.stop_recording():
            v = next_creator(**kwargs)
        assert not isinstance(v, values.DistributedVariable)
        value_list.append(v)
    return value_list

  return distribute_utils.create_mirrored_variable(
      self._container_strategy(), _real_mirrored_creator,
      distribute_utils.VARIABLE_CLASS_MAPPING,
      distribute_utils.VARIABLE_POLICY_MAPPING, **kwargs)

VARIABLE_CLASS_MAPPING 用来设定生成哪种类型的变量。VARIABLE_POLICY_MAPPING 设定使用何种策略来应对读写同步。

# The following mapping indicates the policy that you must use for a given
# variable  synchronization  and  aggregation  pair.
# OnWritePolicy is used for:
# (synchronization=Auto, aggregation=NONE,SUM,MEAN,ONLY_FIRST_REPLICA)
# (synchronization=ON_WRITE, aggregation=NONE,SUM,MEAN,ONLY_FIRST_REPLICA)
# OnReadPolicy is used for:
# (synchronization=ON_READ, aggregation=NONE,SUM,MEAN,ONLY_FIRST_REPLICA)
VARIABLE_POLICY_MAPPING = {
    vs.VariableSynchronization.ON_WRITE: values_lib.OnWritePolicy,
    vs.VariableSynchronization.ON_READ: values_lib.OnReadPolicy,
}

VARIABLE_CLASS_MAPPING = {
    "VariableClass": values_lib.DistributedVariable,
    vs.VariableSynchronization.ON_WRITE: values_lib.MirroredVariable, # 我们关注这里
    vs.VariableSynchronization.ON_READ: values_lib.SyncOnReadVariable,
}

1.3.5 distribute_utils

tensorflow/python/distribute/distribute_utils.py 的 create_mirrored_variable 会具体建立变量。对于我们的例子，class_mapping 就是 values_lib.MirroredVariable。

def create_mirrored_variable(strategy, real_mirrored_creator, class_mapping,
                             policy_mapping, **kwargs):
  """Create distributed variables with given synchronization and aggregation."""
  # Figure out what collections this variable should be added to.
  # We'll add the MirroredVariable to those collections instead.
  var_collections = kwargs.pop("collections", None)
  if var_collections is None:
    var_collections = [ops.GraphKeys.GLOBAL_VARIABLES]
  kwargs["collections"] = []

  synchronization = _validate_synchronization(kwargs)
  # Update synchronization in kwargs in case it's AUTO, which is converted to
  # ON_WRITE.
  kwargs["synchronization"] = synchronization
  aggregation = _validate_aggregation(kwargs)
  use_var_policy = getattr(strategy.extended, "_use_var_policy", False)

  # Ignore user-specified caching device, not needed for mirrored variables.
  kwargs.pop("caching_device", None)

  with tape.stop_recording():
    # 构建镜像变量列表    
    value_list = real_mirrored_creator(**kwargs)
    # MirroredVariable is recreated during saved_model loading, and its
    # component variables (value_list) will have None initializer. We
    # set their initializers to no_op so that consumer like
    #  global_variables_initializer  wouldn't complain, as it groups all
    # variables' initializers thus all variables have to have initializers.
    for v in value_list:
      if hasattr(v, "_initializer_op") and v._initializer_op is None:
        v._initializer_op = control_flow_ops.no_op()
    if use_var_policy:
      # 获取策略，得到类，生成变量
      var_policy_cls = policy_mapping.get(synchronization)
      var_policy = var_policy_cls(aggregation=aggregation)
      var_cls = class_mapping.get("VariableClass")
      result = var_cls(strategy, value_list, aggregation, var_policy=var_policy)
    else:
      var_cls = class_mapping.get(synchronization)
      result = var_cls(strategy, value_list, aggregation)

  # Add the wrapped variable to the requested collections.
  # The handling of eager mode and the global step matches
  # ResourceVariable._init_from_args().
  if not context.executing_eagerly():
    g = ops.get_default_graph()
    # If "trainable" is True, next_creator() will add the member variables
    # to the TRAINABLE_VARIABLES collection, so we manually remove
    # them and replace with the MirroredVariable. We can't set
    # "trainable" to False for next_creator() since that causes functions
    # like implicit_gradients to skip those variables.
    if kwargs.get("trainable", True):
      var_collections.append(ops.GraphKeys.TRAINABLE_VARIABLES)
      l = g.get_collection_ref(ops.GraphKeys.TRAINABLE_VARIABLES)
      for value in value_list:
        for i, trainable_variable in enumerate(l):
          if value is trainable_variable:
            del l[i]
            break

    g.add_to_collections(var_collections, result)
  elif ops.GraphKeys.GLOBAL_STEP in var_collections:
    ops.add_to_collections(ops.GraphKeys.GLOBAL_STEP, result)

  return result

最终构建逻辑如下，_CurrentDistributionContext 成员函数 _var_creator_scope 会指向 creator_with_resource_vars。当生成变量时候，调用时候 creator_with_resource_vars 会逐层调用，最后生成 MirroredVariable。

图 15 创建变量

1.4 总结

前面的问题我们目前为止回答如下：

如何调用到 Strategy 这里？
- 读写变量最终都会落到 strategy 或者 strategy.extended 之上。
如何生成 Mirrored Variable？
- 用户在 scope 之中会获得上下文，上下文提供了建立变量的方法，用户在上下文之中建立的变量自然就是 Mirrored Variable。
如何把张量分发到各个设备上？
- 当使用了 Strategy 时候，会使用 Strategy 的 _create_variable 生成变量。 _create_variable 最终调用到 _real_mirrored_creator 。
- _real_mirrored_creator 会配置具体的变量名称，后续调用则会据此来设定变量应该放到哪个设备之上。对于第一个设备，这里依然采用原来的名字，而后续设备则在原变量名之后加上 /replica _设备号，这样就可以和原始变量区别。
- 后续在布局（placement）时候，会根据设备名字进行分配，把变量放置到对应设备之上。
如果对外保持一个统一的视图？
- 在上下文之中，用户得到的是 Mirrored Variable， Mirrored Variable 对外屏蔽了内部变量，提供了统一视图。比如：读取时候，会调用 _get_cross_replica，其内部调用 Policy。而 Policy 会调用 distribute_strategy 完成规约。
变量之间如何保持一致？
- 在前面 scatter_update 分析时候知道，更新变量时候，会调用到 strategy.extended 之上，在 strategy.extended 中，变量之间通过例如 All-Reduce 来保持一致，这个我们后文会详细分析。

用示例图来演示下，假设有一个 MirroredVariable A 变量，其内部是由 3 个张量组成。每个 Worker 都觉得自己在更新 MirroredVariable A，实际上是分别更新不同的变量，变量之间通过例如 All-Reduce 来保持一致。

图 16 如何更新

2. ShardedVariable

在机器学习训练之中，如果变量太大，无法放入单个设备上（例如大型embedding），则可能需要在多个设备上对这个变量进行分片。在 TensorFlow 中，与这个思想对应的概念就是 ShardedVariable 。

图 17 ShardedVariable

变量分片（Variable sharding）是指将一个变量分割成多个较小的变量，这些变量被称为分片（shards）。ShardedVariable 可以被看做是一个容器，容器中的 "变量 "应被视为分片。ShardedVariable 类维护一个可以独立存储在不同设备（例如，多个参数服务器）上的较小变量的列表，并负责保存和恢复这些变量，就像它们是一个较大的变量一样。变量分片对于缓解分配访问这些分片时的网络负载很有用，它对于在多个参数服务器上分配一个普通变量的计算和存储也很有用。

图 18 ShardedVariable 容器

ShardedVariable 类的对象可以用给定数量的分片进行保存，然后从检查点恢复到不同数量的分片。SavedModel可以被 TF serving API 等程序使用，但是不支持 tf.saved_model.load 。由于 ShardedVariable 可以被保存，然后根据恢复环境恢复到不同数量的分片，例如，TF serving API 会恢复到只有一个分片以提高服务效率，所以当在tf.function 中使用 ShardedVariable 时，一般不应假设它在保存和加载时具有相同数量的分片。

2.1 问题

对于 ShardedVariable，我们依然用几个问题来引导分析。

如何实现参数存到参数服务器之上？
如何对参数实现分片存储？
如何把计算（梯度更新参数的操作）放到参数服务器之上？（会在后续章节进行分析）
Coordinator 是随机分配计算的吗？（会在后续章节进行分析）

2.2 定义

ShardedVariable 的定义其实没有太多内容，主要精华都在基类 ShardedVariableMixin 之中，我们稍后就会进行分析。

图 19 ShardedVariable 定义

具体定义代码如下：

class ShardedVariable(ShardedVariableMixin, composite_tensor.CompositeTensor):
  """A container for  Variables  that should be treated as shards.
  """

  @property
  def _type_spec(self):
    return ShardedVariableSpec(
        *(resource_variable_ops.VariableSpec(v.shape, v.dtype)
          for v in self._variables))

  @classmethod
  def _overload_all_operators(cls):
    """Register overloads for all operators."""
    for operator in ops.Tensor.OVERLOADABLE_OPERATORS:
      if operator == '__getitem__':
        continue

      cls._overload_operator(operator)

  @classmethod
  def _overload_operator(cls, operator):
    """Delegate an operator overload to  ops.Tensor ."""
    tensor_operator = getattr(ops.Tensor, operator)

    def _operator(v, *args, **kwargs):
      return tensor_operator(_var_to_tensor(v), *args, **kwargs)

    setattr(cls, operator, _operator)

2.3 如何分区

ShardedVariable 的精华之一就是分区，我们探究一下其机理。需要注意的是：ShardedVariable 只支持在第一个维度进行分区。

2.3.1 基类

基类 Partitioner 没有太多东西，其派生类需要实现 call。

@tf_export('distribute.experimental.partitioners.Partitioner', v1=[])
class Partitioner(object):
  """Partitioner base class: all partitiners inherit from this class.

  Partitioners should implement a  __call__  method with the following
  signature:

  ```python
  def __call__(self, shape, dtype, axis=0):
    # Partitions the given  shape  and returns the partition results.
    # See docstring of  __call__  method for the format of partition results.
  ```
  """

  def __call__(self, shape, dtype, axis=0):
    """Partitions the given  shape  and returns the partition results.

    Examples of a partitioner that allocates a fixed number of shards:

    ```python
    partitioner = FixedShardsPartitioner(num_shards=2)
    partitions = partitioner(tf.TensorShape([10, 3], tf.float32), axis=0)
    print(partitions) # [2, 0]
    ```

    Args:
      shape: a  tf.TensorShape , the shape to partition.
      dtype: a  tf.dtypes.Dtype  indicating the type of the partition value.
      axis: The axis to partition along.  Default: outermost axis.

    Returns:
      A list of integers representing the number of partitions on each axis,
      where i-th value correponds to i-th axis.
    """
    raise NotImplementedError

2.2.4 固定分区

FixedShardsPartitioner 会把变量分成固定的分片。注释之中有一个使用样例，对于本例来说，axis = 0 时候，min(self._num_shards, shape.dims[axis].value) = min(2, 10)，所以分成两个 shard。

@tf_export('distribute.experimental.partitioners.FixedShardsPartitioner', v1=[])
class FixedShardsPartitioner(Partitioner):
  """Partitioner that allocates a fixed number of shards.

  Examples:

  >>> # standalone usage:
  >>> partitioner = FixedShardsPartitioner(num_shards=2)
  >>> partitions = partitioner(tf.TensorShape([10, 3]), tf.float32)
  >>> [2, 1]
  >>>
  >>> # use in ParameterServerStrategy
  >>> # strategy = tf.distribute.experimental.ParameterServerStrategy(
  >>> #   cluster_resolver=cluster_resolver, variable_partitioner=partitioner)

  """

  def __init__(self, num_shards):
    """Creates a new  FixedShardsPartitioner .

    Args:
      num_shards:  int , number of shards to partition.
    """
    self._num_shards = num_shards

  def __call__(self, shape, dtype, axis=0):
    del dtype
    result = [1] * len(shape)
    result[axis] = min(self._num_shards, shape.dims[axis].value)
    return result

2.2.5 最小分区

MinSizePartitioner 为每个分片分配最小尺寸的分区器。该分区器确保每个分片至少有"min_shard_字节"，并尝试分配尽可能多的分片，即保持分片大小尽可能小。此类分片的最大数量（上限）由"max_Shard"给出。

@tf_export('distribute.experimental.partitioners.MinSizePartitioner', v1=[])
class MinSizePartitioner(Partitioner):
  """Partitioner that allocates a minimum size per shard.

  This partitioner ensures each shard has at least  min_shard_bytes , and tries
  to allocate as many shards as possible, i.e., keeping shard size as small as
  possible. The maximum number of such shards (upper bound) is given by
   max_shards .

  Examples:

  >>> partitioner = MinSizePartitioner(min_shard_bytes=4, max_shards=2)
  >>> partitions = partitioner(tf.TensorShape([6, 1]), tf.float32)
  >>> [2, 1]
  >>> partitioner = MinSizePartitioner(min_shard_bytes=4, max_shards=10)
  >>> partitions = partitioner(tf.TensorShape([6, 1]), tf.float32)
  >>> [6, 1]
  >>>
  >>> # use in ParameterServerStrategy
  >>> # strategy = tf.distribute.experimental.ParameterServerStrategy(
  >>> #   cluster_resolver=cluster_resolver, variable_partitioner=partitioner)
  """

  def __init__(self,
               min_shard_bytes=256 << 10,
               max_shards=1,
               bytes_per_string=16):
    """Creates a new  MinSizePartitioner .

    Args:
      min_shard_bytes: Minimum bytes of each shard. Defaults to 256K.
      max_shards: Upper bound on the number of shards. Defaults to 1.
      bytes_per_string: If the partition value is of type string, this provides
        an estimate of how large each string is.
    """
    self._min_shard_bytes = min_shard_bytes
    self._max_shards = max_shards
    self._bytes_per_string = bytes_per_string

  def __call__(self, shape, dtype, axis=0):
    return partitioned_variables.min_max_variable_partitioner(
        max_partitions=self._max_shards,
        axis=axis,
        min_slice_size=self._min_shard_bytes,
        bytes_per_string_element=self._bytes_per_string)(shape, dtype)

min_max_variable_partitioner 是具体业务实现。该方法返回一个分区器，该分区器对"给定形状和数据类型"的变量进行分区，使每个分区有的最小值为 min_slice_size 大小的切片。此类分区的最大数量（上限）由 max_partitions 给出。

@tf_export(v1=["min_max_variable_partitioner"])
def min_max_variable_partitioner(max_partitions=1, axis=0,
                                 min_slice_size=256 << 10,
                                 bytes_per_string_element=16):
  """Partitioner to allocate minimum size per slice.

  Returns a partitioner that partitions the variable of given shape and dtype
  such that each partition has a minimum of  min_slice_size  slice of the
  variable. The maximum number of such partitions (upper bound) is given by
   max_partitions .

  Args:
    max_partitions: Upper bound on the number of partitions. Defaults to 1.
    axis: Axis along which to partition the variable. Defaults to 0.
    min_slice_size: Minimum size of the variable slice per partition. Defaults
      to 256K.
    bytes_per_string_element: If the  Variable  is of type string, this provides
      an estimate of how large each scalar in the  Variable  is.

  Returns:
    A partition function usable as the  partitioner  argument to
     variable_scope  and  get_variable .

  """
  def _partitioner(shape, dtype):
    """Partitioner that partitions list for a variable of given shape and type.

    Ex: Consider partitioning a variable of type float32 with
      shape=[1024, 1024].
      If  max_partitions  >= 16, this function would return
        [(1024 * 1024 * 4) / (256 * 1024), 1] = [16, 1].
      If  max_partitions  < 16, this function would return
        [ max_partitions , 1].

    Args:
      shape: Shape of the variable.
      dtype: Type of the variable.

    Returns:
      List of partitions for each axis (currently only one axis can be
      partitioned).

    Raises:
      ValueError: If axis to partition along does not exist for the variable.
    """
    if axis >= len(shape):
      raise ValueError("Can not partition variable along axis %d when shape is "
                       "only %s" % (axis, shape))
    if dtype.base_dtype == dtypes.string:
      bytes_per_element = bytes_per_string_element
    else:
      bytes_per_element = dtype.size
    total_size_bytes = shape.num_elements() * bytes_per_element
    partitions = total_size_bytes / min_slice_size
    partitions_list = [1] * len(shape)
    # We can not partition the variable beyond what its shape or
    #  max_partitions  allows.
    partitions_list[axis] = max(1, min(shape.dims[axis].value,
                                       max_partitions,
                                       int(math.ceil(partitions))))
    return partitions_list
  return _partitioner

2.3.4 最大分区

此分区器确保每个碎片最多有 max_shard_bytes 大的尺寸，并尝试分配尽可能少的分片，即保持分片尽可能大。如果分区程序达到了 max_shard 限制，那么每个 shard 可能最终都会大于 max_shard_bytes。默认情况下，max_shards..等于 None，就是不限制分片的数量。

@tf_export('distribute.experimental.partitioners.MaxSizePartitioner', v1=[])
class MaxSizePartitioner(Partitioner):
  """Partitioner that keeps shards below  max_shard_bytes .

  This partitioner ensures each shard has at most  max_shard_bytes , and tries
  to allocate as few shards as possible, i.e., keeping shard size as large
  as possible.

  If the partitioner hits the  max_shards  limit, then each shard may end up
  larger than  max_shard_bytes . By default  max_shards  equals  None  and no
  limit on the number of shards is enforced.

  Examples:

  >>> partitioner = MaxSizePartitioner(max_shard_bytes=4)
  >>> partitions = partitioner(tf.TensorShape([6, 1]), tf.float32)
  >>> [6, 1]
  >>> partitioner = MaxSizePartitioner(max_shard_bytes=4, max_shards=2)
  >>> partitions = partitioner(tf.TensorShape([6, 1]), tf.float32)
  >>> [2, 1]
  >>> partitioner = MaxSizePartitioner(max_shard_bytes=1024)
  >>> partitions = partitioner(tf.TensorShape([6, 1]), tf.float32)
  >>> [1, 1]
  >>>
  >>> # use in ParameterServerStrategy
  >>> # strategy = tf.distribute.experimental.ParameterServerStrategy(
  >>> #   cluster_resolver=cluster_resolver, variable_partitioner=partitioner)
  """

  def __init__(self, max_shard_bytes, max_shards=None, bytes_per_string=16):
    """Creates a new  MaxSizePartitioner .

    Args:
      max_shard_bytes: The maximum size any given shard is allowed to be.
      max_shards: The maximum number of shards in  int  created taking
        precedence over  max_shard_bytes .
      bytes_per_string: If the partition value is of type string, this provides
        an estimate of how large each string is.
    """
    if max_shard_bytes < 1:
      raise ValueError('max_shard_bytes must be positive, got: %r' %
                       max_shard_bytes)
    if max_shards and max_shards < 1:
      raise ValueError('max_shards must be positive, got: %r' % max_shards)
    if bytes_per_string < 1:
      raise ValueError('bytes_per_string must be positive, got: %r' %
                       bytes_per_string)

    self._max_shard_bytes = max_shard_bytes
    self._max_shards = max_shards
    self._bytes_per_string = bytes_per_string

  def __call__(self, shape, dtype, axis=0):
    return partitioned_variables.variable_axis_size_partitioner(
        max_shard_bytes=self._max_shard_bytes,
        max_shards=self._max_shards,
        bytes_per_string_element=self._bytes_per_string,
        axis=axis)(shape, dtype)

variable_axis_size_partitioner 是具体业务功能。此分区程序将沿一个轴切分一个变量，试图将最大分片的大小保持在 max_shard_bytes 以下。如果分区程序达到了 max_shard 限制，那么每个 shard 可能最终都会大于 max_shard_bytes。默认情况下，max_shards 等于 None，意思是不限制碎片的数量。

max_shard_bytes 的一个合理值是（64<<20）-1，或者在 64MB 左右，这样可以保证低于 protobuf 字节的限制。

@tf_export(v1=["variable_axis_size_partitioner"])
def variable_axis_size_partitioner(
    max_shard_bytes, axis=0, bytes_per_string_element=16, max_shards=None):
  """Get a partitioner for VariableScope to keep shards below  max_shard_bytes .

  This partitioner will shard a Variable along one axis, attempting to keep
  the maximum shard size below  max_shard_bytes .  In practice, this is not
  always possible when sharding along only one axis.  When this happens,
  this axis is sharded as much as possible (i.e., every dimension becomes
  a separate shard).

  If the partitioner hits the  max_shards  limit, then each shard may end up
  larger than  max_shard_bytes . By default  max_shards  equals  None  and no
  limit on the number of shards is enforced.

  One reasonable value for  max_shard_bytes  is  (64 << 20) - 1 , or almost
   64MB , to keep below the protobuf byte limit.

  Args:
    max_shard_bytes: The maximum size any given shard is allowed to be.
    axis: The axis to partition along.  Default: outermost axis.
    bytes_per_string_element: If the  Variable  is of type string, this provides
      an estimate of how large each scalar in the  Variable  is.
    max_shards: The maximum number of shards in int created taking precedence
      over  max_shard_bytes .

  Returns:
    A partition function usable as the  partitioner  argument to
     variable_scope  and  get_variable .

  Raises:
    ValueError: If any of the byte counts are non-positive.
  """

  def _partitioner(shape, dtype):
    """Partitioner that partitions shards to have max_shard_bytes total size.

    Args:
      shape: A  TensorShape .
      dtype: A  DType .

    Returns:
      A tuple representing how much to slice each axis in shape.

    Raises:
      ValueError: If shape is not a fully defined  TensorShape  or dtype is not
        a  DType .
    """
    if dtype.base_dtype == dtypes.string:
      element_size = bytes_per_string_element
    else:
      element_size = dtype.size

    partitions = [1] * shape.ndims
    bytes_per_slice = 1.0 * (
        shape.num_elements() / shape.dims[axis].value) * element_size
    # How many slices can we fit on one shard of size at most max_shard_bytes?
    # At least one slice is required.
    slices_per_shard = max(1, math.floor(max_shard_bytes / bytes_per_slice))
    # How many shards do we need for axis given that each shard fits
    # slices_per_shard slices from a total of shape[axis] slices?
    axis_shards = int(math.ceil(
        1.0 * shape.dims[axis].value / slices_per_shard))
    if max_shards:
      axis_shards = min(max_shards, axis_shards)

    partitions[axis] = axis_shards

    return partitions

  return _partitioner

2.4 ShardedVariableMixin

前面提到了，ShardedVariableMixin 是核心所在，我们接下来就分析一下。ShardedVariableMixin 主要成员变量是：

_variables : 分区的变量。
_var_offsets : 分区变量在 ShardedVariableMixin 对应的偏移，就是把 _variables 看成是一个整体，然后用 offset 在其中查找对应的数据。
_shape : ShardedVariableMixin 的 shape。
_name : ShardedVariableMixin 的名字。

class ShardedVariableMixin(trackable.Trackable):
  """Mixin for ShardedVariable."""

  def __init__(self,
               variables: Sequence[variables_lib.Variable],
               name='ShardedVariable'):
    """Treats  variables  as shards of a larger Variable.

    Args:
      variables: A list of  ResourceVariable s that comprise this sharded
        variable. Variables should not be shared between different
         ShardedVariableMixin  objects.
      name: String. Name of this container. Defaults to "ShardedVariable".
    """
    super(ShardedVariableMixin, self).__init__()
    self._variables = variables
    self._name = name

    var_dtypes = {v.dtype for v in variables}
    first_var = variables[0]
    self._dtype = first_var.dtype

    # All variables must have the same shape for axes > 0.
    # 计算整体形状
    higher_dim_shapes = {tuple(v.shape.as_list()[1:]) for v in variables}
    first_dim = sum(int(v.shape.as_list()[0]) for v in variables)
    self._shape = tensor_shape.TensorShape([first_dim] +
                                           first_var.shape.as_list()[1:])
    
    # 计算每个分区在整体之中的偏移
    self._var_offsets = [
        [0 for _ in range(len(first_var.shape))] for _ in range(len(variables))
    ]
    for i in range(1, len(variables)):
      # Always partition on the first axis. Offsets on other axes are 0.
      self._var_offsets[i][0] += (
          self._var_offsets[i - 1][0] + variables[i - 1].shape.as_list()[0])

    save_slice_info = [v._get_save_slice_info() for v in variables]  

    # We create an uninitialized saving_variable with the full shape, which can
    # be later captured in signatures so that the signatures can treat this
    # ShardedVariable as one single variable.
    self._saving_variable = resource_variable_ops.UninitializedVariable(
        shape=self._shape, dtype=self._dtype, name=self._name)

2.4.1 使用

我们用如下示例看看如何使用。

variables = [
  tf.Variable(np.array([[3, 2]]), shape=(1, 2), dtype=tf.float32,),
  tf.Variable(np.array([[3, 2], [0, 1]]),  shape=(2, 2), dtype=tf.float32),
  tf.Variable(np.array([[3, 2]]),  shape=(1, 2), dtype=tf.float32)
]
sharded_variable = ShardedVariableMixin(variables)

sharded_variable 内部成员变量打印如下，可以看到，_var_offsets 就是把所有参数分区看为是一个整体，从中找到对应的分区。

_shape = {TensorShape: 2} (4, 2)
_var_offsets = {list: 3} [[0, 0], [1, 0], [3, 0]]
first_dim = {int} 4

比如上面例子之中，三个变量整体打包之后就是如下所示，用户可以使用 offset 在这里查找数据。

[[3,2][3,2],[0,1],[3,2]]

我们再用另一个图例看看。假设参数有4个分区，则具体如下：

图 20 分区

如果变量都放在参数服务器上，则具体如下。

图 21 分区与参数服务器

2.4.2 获取分区

我们接下来看看如何获取分区。就是从 sharded variable 之中把指定部分作为一个张量取出。具体逻辑是：分析传入的 spec，根据 spec 的内容对 sharded variable 进行处理，获得一个参数分区。

  def __getitem__(self, slice_spec):
    """Extracts the specified region as a Tensor from the sharded variable.

    The API contract is identical to  Tensor.__getitem__ . Assignment to the
    sliced range is not yet supported.

    Args:
      slice_spec: The arguments to __getitem__, specifying the global slicing of
        the sharded variable.

    Returns:
      The appropriate slice of tensor based on  slice_spec .

    Raises:
      IndexError: If a slice index is out of bound.
      TypeError: If  spec_spec  contains Tensor.
    """

    # 拿到分区 spec
    if (isinstance(slice_spec, bool) or (isinstance(slice_spec, ops.Tensor) and
                                         slice_spec.dtype == dtypes.bool) or
        (isinstance(slice_spec, np.ndarray) and slice_spec.dtype == bool)):
      tensor = _var_to_tensor(self)
      return array_ops.boolean_mask(tensor=tensor, mask=slice_spec)

    if not isinstance(slice_spec, (list, tuple)):
      slice_spec = (slice_spec,)

    s = slice_spec[0]
    if isinstance(s, slice):
      # 如果是 slice 类型，则解析分区
      first_dim_slice_specs = self._decompose_slice_spec(s)
      values = []
      for i, var in enumerate(self._variables):
        if first_dim_slice_specs[i] is not None:
          all_dim_slice_spec = (first_dim_slice_specs[i],) + slice_spec[1:]
          values.append(var[all_dim_slice_spec])
      if s.step is not None and s.step < 0:
        values.reverse()
      if not values:
        return constant_op.constant([],
                                    dtype=self._dtype,
                                    shape=((0,) + self._shape[1:]))
      return array_ops.concat(values, axis=0)
    elif s is Ellipsis:
      return array_ops.concat([var[slice_spec] for var in self._variables],
                              axis=0)
    elif s is array_ops.newaxis:
      return array_ops.concat([var[slice_spec[1:]] for var in self._variables],
                              axis=0)[array_ops.newaxis]
    else:
      if isinstance(s, ops.Tensor):
        raise TypeError(
            'ShardedVariable: using Tensor for indexing is not allowed.')
      if s < 0:
        s += self._shape[0]
        
      # 在参数分区之中遍历，用offset来提取数据
      for i in range(len(self._variables)):
        if i == len(self._variables) - 1 or (s > self._var_offsets[i][0] and
                                             s < self._var_offsets[i + 1][0]):
          return self._variables[i][(s - self._var_offsets[i][0],) +
                                    slice_spec[1:]]

Spec 一般来说是什么样式？下面示例讲述的比较清晰。

    For example, given component variables:
      v0 = [0, 1, 2]
      v1 = [3, 4, 5]
      v2 = [6, 7, 8, 9]

    If  slice_spec  is slice(start=None, stop=None, step=None), we will have:
      v0[returned[0]] = [0, 1, 2]
      v1[returned[1]] = [3, 4, 5]
      v2[returned[2]] = [6, 7, 8, 9]
    If  slice_spec  is slice(start=2, stop=8, step=3), we will have:
      v0[returned[0]] = [2]
      v1[returned[1]] = [5]
      returned[2] == None
    If  slice_spec  is slice(start=9, stop=3, step=-2), we will have:
      returned[0] == None
      v1[returned[1]] = [5]
      v2[returned[2]] = [9, 7]

获取/解析 spec 的代码具体如下：

  def _decompose_slice_spec(self, slice_spec):
    """Decompose a global slice_spec into a list of per-variable slice_spec.

     ShardedVariable  only supports first dimension partitioning, thus
     slice_spec  must be for first dimension.

    Args:
      slice_spec: A python  slice  object that specifies the global slicing.

    Returns:
      A list of python  slice  objects or None specifying the local slicing for
      each component variable. None means no slicing.

    """
    result = []
    # Normalize start, end and stop.
    slice_step = slice_spec.step if slice_spec.step is not None else 1
    if slice_step == 0:
      raise ValueError('slice step cannot be zero')
    slice_start = slice_spec.start
    if slice_start is None:
      slice_start = 0 if slice_step > 0 else self._shape[0] - 1
    elif slice_start < 0:
      slice_start += self._shape[0]
    slice_end = slice_spec.stop
    if slice_end is None:
      # After the normalization, we no longer interpret negative index, thus
      # "-1" conceptually refers to the element before the first one, which
      # doesn't exist. This is to ease the decomposition code.
      slice_end = self._shape[0] if slice_step > 0 else -1
    elif slice_end < 0:
      slice_end += self._shape[0]

    # To find the local slice_spec of each component variable, we start from
    # the start of the global slice, and iterate through each variable.
    # When iterating on a variable, we move the cursor ( cur ) to the first
    # index that falls into the variable's range, which becomes the start of
    # the variable's local slice_spec. The end of the local_spec is determined
    # by using whatever is smaller between global slice end and variable range
    # end.
    cur = slice_start
    if slice_step > 0:
      for i in range(len(self._var_offsets)):
        var_start = self._var_offsets[i][0]
        var_end = (
            self._var_offsets[i + 1][0]
            if i < len(self._var_offsets) - 1 else self._shape[0])
        if cur < var_start:
          cur += slice_step * int(math.ceil((var_start - cur) / slice_step))
        if cur >= var_end or cur >= slice_end:
          result.append(None)
        else:
          start = cur - var_start
          end = min(slice_end, var_end) - var_start
          result.append(slice(start, end, slice_step))
    else:  # slice_step < 0
      for i in range(len(self._var_offsets) - 1, -1, -1):
        var_start = self._var_offsets[i][0]
        var_end = (
            self._var_offsets[i + 1][0]
            if i < len(self._var_offsets) - 1 else self._shape[0])
        if cur >= var_end:
          cur += slice_step * int(math.ceil((var_end - cur - 1) / slice_step))
        if cur < var_start or cur <= slice_end:
          result.append(None)
        else:
          start = cur - var_start
          if slice_end >= var_start:
            end = slice_end - var_start
          else:
            end = None  # no explicit end: slice until hitting the boundary.
          result.append(slice(start, end, slice_step))

      result.reverse()

    return result

2.4.3 Embedding

接下来我们看看嵌入的查找。可以发现这里就是调用时候添加了对应的 partition_strategy，name, validate_indices, max_norm 等信息，然后传递给embedding_ops.embedding_lookup。这里分区策略是 'mod'。

# Override the behavior of embedding_lookup(sharded_variable, ...)
@dispatch.dispatch_for_types(embedding_ops.embedding_lookup, ShardedVariable)
def embedding_lookup(params,
                     ids,
                     partition_strategy='mod',
                     name=None,
                     validate_indices=True,
                     max_norm=None):
  if isinstance(params, list):
    params = params[0]
  return embedding_ops.embedding_lookup(params.variables, ids,
                                        partition_strategy, name,
                                        validate_indices, max_norm)

流程来到 embedding_lookup（tensorflow/python/ops/embedding_ops.py），我们需要继续看 _embedding_lookup_and_transform。

@tf_export(v1=["nn.embedding_lookup"])
@dispatch.add_dispatch_support
def embedding_lookup(
    params,
    ids,
    partition_strategy="mod",
    name=None,
    validate_indices=True,  # pylint: disable=unused-argument
    max_norm=None):
  """Looks up embeddings for the given  ids  from a list of tensors.

  This function is used to perform parallel lookups on the list of tensors in
   params .  It is a generalization of  tf.gather , where  params  is
  interpreted as a partitioning of a large embedding tensor.   params  may be
  a  PartitionedVariable  as returned by using  tf.compat.v1.get_variable() 
  with a partitioner.

  If  len(params) > 1 , each element  id  of  ids  is partitioned between
  the elements of  params  according to the  partition_strategy .
  In all strategies, if the id space does not evenly divide the number of
  partitions, each of the first  (max_id + 1) % len(params)  partitions will
  be assigned one more id.

  If the input ids are ragged tensors, partition variables are not supported and
  the partition strategy and the max_norm are ignored.
  The results of the lookup are concatenated into a dense
  tensor. The returned tensor has shape  shape(ids) + shape(params)[1:] .

  Args:
    params: A single tensor representing the complete embedding tensor, or a
      list of P tensors all of same shape except for the first dimension,
      representing sharded embedding tensors.  Alternatively, a
       PartitionedVariable , created by partitioning along dimension 0. Each
      element must be appropriately sized for the given  partition_strategy .
    ids: A  Tensor  or a 'RaggedTensor' with type  int32  or  int64  containing
      the ids to be looked up in  params .
    partition_strategy: A string specifying the partitioning strategy, relevant
      if  len(params) > 1 . Currently  "div"  and  "mod"  are supported. Default
      is  "mod" .
    name: A name for the operation (optional).
    validate_indices: DEPRECATED. If this operation is assigned to CPU, values
      in  indices  are always validated to be within range.  If assigned to GPU,
      out-of-bound indices result in safe but unspecified behavior, which may
      include raising an error.
    max_norm: If not  None , each embedding is clipped if its l2-norm is larger
      than this value.

  Returns:
    A  Tensor  or a 'RaggedTensor', depending on the input, with the same type
    as the tensors in  params .

  Raises:
    ValueError: If  params  is empty.
  """
  if isinstance(ids, ragged_tensor.RaggedTensor):
    return embedding_lookup_ragged(params, ids,
                                   partition_strategy=partition_strategy,
                                   max_norm=max_norm,
                                   name=name)

  return _embedding_lookup_and_transform(
      params=params,
      ids=ids,
      partition_strategy=partition_strategy,
      name=name,
      max_norm=max_norm,
      transform_fn=None)

_embedding_lookup_and_transform 这里是具体如何分区的代码，我们先用实例演示一下。

如果 "partition_strategy "是 "mod"，我们将每个id分配给分区 p = id % len(params) 。例如。
13个ID被分割到5个分区中，结果如下： [[0, 5, 10], [1, 6, 11], [2, 7, 12], [3, 8], [4, 9]]
如果 "partition_strategy "是 "div"，我们会以连续的方式将ID分配给分区。在这个例子中，13个ID被分成5个分区，结果如下: [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10], [11, 12]] 。

具体代码如下：

def _embedding_lookup_and_transform(params,
                                    ids,
                                    partition_strategy="mod",
                                    name=None,
                                    max_norm=None,
                                    transform_fn=None):
  """Helper function for embedding_lookup and _compute_sampled_logits.

  This function is a generalization of embedding_lookup that optionally
  applies a caller-specified transformation to each embedding. This is
  done through the  transform_fn  argument. If provided, the function is
  applied to each partitioned tensor of retrieved embeddings, colocated
  with the embeddings. This function will be called with a single  Tensor 
  argument of the same type as the  params  tensor and should return a
   Tensor . The shape of the argument will be the same as  params  except
  for the size of the first dimension. The first dimension of the result's
  shape must be the same size as the argument's.

  Args:
    params: See embedding_lookup.
    ids: See embedding_lookup.
    partition_strategy: See embedding_lookup.
    name: See embedding_lookup.
    max_norm: See embedding_lookup.
    transform_fn: An optional function to apply to each retrieved embedding. If
      max_norm is provided, transform_fn is applied to the norm-limited
      embeddings.

  Returns:
    See embedding_lookup for details.
  Raises:
    ValueError: If  params  is empty.
  """

  with ops.name_scope(name, "embedding_lookup", params + [ids]) as name:
    # 省略代码
    else:
      # Flatten the ids. There are two cases where we need to do this.
      # - There is more than one params tensor.
      # - There is a transform_fn and ids is not statically known to be 1-D.
      #   We must flatten in this case because transform_fn expects a flat
      #   tensor of embeddings.
      flat_ids = array_ops.reshape(ids, [-1])
      original_indices = math_ops.range(array_ops.size(flat_ids))

      # Create p_assignments and set new_ids depending on the strategy.
      if partition_strategy == "mod":
        p_assignments = flat_ids % np
        new_ids = flat_ids // np
      elif partition_strategy == "div":
        # Compute num_total_ids as the sum of dim-0 of params, then assign to
        # partitions based on a constant number of ids per partition. Optimize
        # if we already know the full shape statically.
        dim_0_size = tensor_shape.Dimension(
            tensor_shape.dimension_value(params[0].get_shape()[0]))
        for p in xrange(1, np):
          dim_0_size += tensor_shape.Dimension(
              tensor_shape.dimension_value(params[p].get_shape()[0]))
        if dim_0_size.value:
          num_total_ids = constant_op.constant(dim_0_size.value, flat_ids.dtype)
        else:
          dim_0_sizes = []
          for p in xrange(np):
            param_p_dim = tensor_shape.dimension_value(params[p].get_shape()[0])
            if param_p_dim is not None:
              dim_0_sizes.append(param_p_dim)
            else:
              with ops.colocate_with(params[p]):
                dim_0_sizes.append(array_ops.shape(params[p])[0])
          num_total_ids = math_ops.reduce_sum(
              math_ops.cast(array_ops.stack(dim_0_sizes), flat_ids.dtype))
        ids_per_partition = num_total_ids // np
        extras = num_total_ids % np

        p_assignments = math_ops.maximum(flat_ids // (ids_per_partition + 1),
                                         (flat_ids - extras) //
                                         ids_per_partition)

        # Emulate a conditional using a boolean indicator tensor
        new_ids = array_ops.where(p_assignments < extras,
                                  flat_ids % (ids_per_partition + 1),
                                  (flat_ids - extras) % ids_per_partition)
      else:
        raise ValueError("Unrecognized partition strategy: " +
                         partition_strategy)

  # 省略其他代码

如何使用 embedding？我们从注释之中提取使用方法如下，这里构建了一个 ShardedVariable，模型通过 embedding_lookup 来对此变量进行操作。

  >>> class Model(tf.Module):
  ...   def __init__(self):
  ...     self.sharded_variable = ShardedVariable([
  ...       tf.Variable([3.0], dtype=tf.float32),
  ...       tf.Variable([2.0], dtype=tf.float32)
  ...     ])
  ...
  ...   @tf.function(input_signature=[tf.TensorSpec([], dtype=tf.int32)])
  ...   def fn(self, x):
  ...     return tf.nn.embedding_lookup(self.sharded_variable.variables, x)
  ...
  ...   @tf.function(input_signature=[tf.TensorSpec([], dtype=tf.int32)])
  ...   def serve_fn(self, x):
  ...     return tf.nn.embedding_lookup(self.sharded_variable.variables, x)
  >>>
  >>> model = Model()
  >>> model.fn(1).numpy()
  2.0
  >>> tf.saved_model.save(model, export_dir='/tmp/saved_model',
  ...   signatures=model.serve_fn)

如果用图例表示，则下面 worker 会在两个参数服务器上并行操作来提取 embedding。

图 22 处理 embedding

2.5 构建

关于 ShardedVariable 的构建，我们直接看 ParameterServerStrategyV2 之中的构建过程。

2.5.1 变量分片

要启用变量分片，你可以在构建 ParameterServerStrategy 对象时传入一个 variable_partitioner。每次创建变量时，variable_partitioner 都会被调用，并希望它能沿变量的每个维度返回分片的数量。系统提供了一些开箱即用的 variable_partitioner，比如 tf.distribution.experimental.partitioners.MinSizePartitioner 。建议使用基于大小（size-based）的分区器，如 tf.distribution.experimental.partitioners.MinSizePartitioner ，以避免对小变量进行分区，因为那样可能对模型训练速度产生负面影响。

当传入 variable_partitioner 时候，如果你直接在 strategy.scope() 下创建一个变量，它将成为一个具有 variables 属性（property）的容器类型，此属性将提供对分片列表的访问。在大多数情况下，这个容器将通过连接（concatenating）所有的分片自动转换为一个张量。因此，它可以作为一个正常的变量使用。另一方面，一些TensorFlow方法，如 tf.nn.embedding_lookup 为这种容器类型提供了有效的实现，这些方法可以避免自动连接。

3.2.4 初始化

在 ParameterServerStrategyV2Extended 初始化时候，会把传入的 variable_partitioner 设置到 _variable_partitioner 之中，也会配置参数服务器数目和 worker 数目。

class ParameterServerStrategyV2Extended(
    parameter_server_strategy.ParameterServerStrategyExtended):
  """Extended class for ParameterServerStrategyV2.

  Please see  tf.distribute.StrategyExtended  doc for more information.
  """

  def __init__(self, container_strategy, cluster_resolver,
               variable_partitioner):
    """Initialization of ParameterServerStrategyV2Extended."""
    super(ParameterServerStrategyV2Extended, self).__init__(container_strategy)
    self._num_ps = len(cluster_resolver.cluster_spec().as_dict().get("ps", []))
    self._num_workers = len(cluster_resolver.cluster_spec().as_dict().get(
        "worker", []))
    self._variable_count = 0

    self._variable_partitioner = variable_partitioner

2.5.3 构建

我们接下来看看创建过程，也就是如何把变量分片到不同参数服务器上。具体思路是：

没有配置分区生成器的话，就用 RR 策略（_create_variable_round_robin）把变量分配到参数服务器之上。
如果配置了分区生成器，则做如下操作：
- 对 rank-0 不做分区。
- 通过 _variable_partitioner 得到分区数目。
- 分区数目需要大于第一维，否则用第一维。
- 计算张量 offset。
- 生成很多小张量。
- 使用 _create_variable_round_robin 构建小张量列表。
- 用小张量列表来生成 ShardedVariable。

  def _create_variable(self, next_creator, **kwargs):
    """Implements StrategyExtendedV2._create_variable.

    Creates a  Variable  or a  ShardedVariable . A  ShardedVariable  will be
    created if satisfying all the following criteria:
      1.  self._variable_partitioner  results in more than one partition on the
         first axis.
      2. variable's rank is greater than 0.
      3. variable is not colocated with another variable.
    Otherwise a  Variable  will be created.

    Args:
      next_creator: See  variable_scope.variable_creator_scope ; the next
        creator in the chain.
      **kwargs: Passed through to the next creator.

    Returns:
      A  Variable  or  ShardedVariable .
    """

    var_creator = self._create_var_creator(next_creator, **kwargs)
    if "colocate_with" in kwargs:  # Never partition colocated_with variables.
      colocate_with = kwargs["colocate_with"]
      # Clear the variable scope to avoid possible conflicts between device
      # scope and colocation scope.
      with ops.device(None):
        with ops.colocate_with(colocate_with):
          var = var_creator(**kwargs)
          return var

    # 没有配置分区生成器的话，就用 RR 策略把变量分配到参数服务器之上
    if self._variable_partitioner is None:
      return self._create_variable_round_robin(var_creator, **kwargs)

  # 下面是配置了分区生成器
    name = kwargs.get("name", None)
    initial_value = kwargs.get("initial_value", None)

    # Two cases where initial_value can be a callable:
    #   1. initial_value is passed as a callable, e.g, an  initializer  class.
    #   2. restoring from checkpoint, initial_value is a
    #     "CheckpointInitialValueCallable".
    init_from_fn = callable(initial_value)

    dtype = kwargs.get("dtype", None)
    shape = kwargs.get("shape", None)
    if init_from_fn and (shape is None or dtype is None):
      init_from_fn = False
      initial_value = initial_value()
    if not init_from_fn:
      # The initial_value is created on coordinator, it will need to be sent to
      # ps for variable initialization, which can be inefficient and can
      # potentially hit the 2GB limit on protobuf serialization.
      initial_value = ops.convert_to_tensor(initial_value, dtype=dtype)
      dtype = initial_value.dtype
      shape = initial_value.shape
    else:
      shape = tensor_shape.as_shape(shape)

    # rank-0 不做分区
    if shape.rank == 0:  # Skip partitioning rank-0 variable.
      return self._create_variable_round_robin(var_creator, **kwargs)

    # 得到分区数目
    num_partitions = self._variable_partitioner(shape=shape, dtype=dtype)
    if num_partitions[0] == 1:  # no partition
      return self._create_variable_round_robin(var_creator, **kwargs)

    # 分区数目需要大于第一维，否则用第一维
    # Use "div" partition strategy to partition the variable.
    num_partitions = min(num_partitions[0], shape[0])
    base = shape[0] // num_partitions
    
    # 计算 offset
    extra = shape[0] % num_partitions
    # An example: num_partitions=4, shape[0]=10, partitions: [3, 3, 2, 2]
    # offsets: [0, 3, 6, 8, 10]
    offsets = []
    for i in range(num_partitions):
      if i == 0:
        offsets.append(0)
      else:
        prev_shard_size = base + (1 if i - 1 < extra else 0)
        offsets.append(offsets[i - 1] + prev_shard_size)
    offsets.append(shape[0])

    def init_shard_fn(shard_index):
      if not init_from_fn:
        return initial_value[offsets[shard_index]:offsets[shard_index + 1]]
    
      partition_shape = (offsets[shard_index + 1] -
                         offsets[shard_index],) + shape[1:]
      partition_offset = (offsets[shard_index],) + (0,) * len(shape[1:])
      arg_spec = tf_inspect.getfullargspec(initial_value)
      if ("shard_info" not in arg_spec.args and
          "shard_info" not in arg_spec.kwonlyargs):
        try:
          value = initial_value(
              partition_shape=partition_shape,
              partition_offset=partition_offset)
        except (TypeError, ValueError):
          # TypeError: Initializer doesn't accept kwargs
          # ValueError: Initializer doesn't accept partition kwargs
          # In both cases we go ahead creating the full value and then slice.
          value = initial_value()

        if value.shape == partition_shape:
          # Initializer supports partition: value is the partition value.
          return value
        else:
          # Initializer doesn't support partition: value is the full value
          # and needs to be sliced to get the partition value.
          return value[offsets[shard_index]:offsets[shard_index + 1]]
      else:
        # For compatibility with  CheckpointInitialValueCallable .
        return initial_value(
            shard_info=trackable.ShardInfo(
                shape=tensor_shape.as_shape(partition_shape),
                offset=partition_offset))

    # 生成很多小张量
    var_list = []
    for i in range(num_partitions):
      kwargs["shape"] = (offsets[i + 1] - offsets[i],) + shape[1:]
      kwargs["initial_value"] = lambda: init_shard_fn(i) # 初始化
      if name is not None:
        kwargs["name"] = "{}/part_{}".format(name, i)
      # 使用 _create_variable_round_robin 得到张量如何分配  
      var_list.append(self._create_variable_round_robin(var_creator, **kwargs))

    #用小张量列表来生成 ShardedVariable
    result = sharded_variable.ShardedVariable(var_list)
    return result

上面逻辑之中，两个分支都使用了 _create_variable_round_robin，其使用 RR 策略决定具体 placement 如何做。其实，就是给张量配置了对应的设备名字，后续做布局操作时候，就按照设备名字进行操作。

  def _create_variable_round_robin(self, next_creator, **kwargs):
    # Clear the colocation scope to avoid possible conflicts between device
    # scope and colocation scope.
    with ops.colocate_with(None, ignore_existing=True):
      # Explicitly set CPU:0 device for PS in case create variable is called
      # inside replica_fn and worker has with GPU:0 scope.
      with ops.device("/job:ps/task:%d/device:CPU:0" %
                      (self._variable_count % self._num_ps)):
        var = next_creator(**kwargs)
        logging.debug(
            "Creating variable (name:%s, shape:%r) on "
            "/job:ps/task:%d/device:CPU:0",
            var.name, var.shape, (self._variable_count % self._num_ps))
        self._variable_count += 1
        return var

_create_variable_round_robin 的参数 next_creator 一般来说是如下方法，这里使用了 AggregatingVariable 和 CachingVariable 来构建变量列表 var_list，然后才是利用 var_list 构建 ShardedVariable。我们主要介绍 AggregatingVariable。

  def _create_var_creator(self, next_creator, **kwargs):
    aggregation = kwargs.pop("aggregation", vs.VariableAggregation.NONE)

    def var_creator(**kwargs):
      """Create an AggregatingVariable."""
      # Create and wrap the variable.
      v = next_creator(**kwargs)
      wrapped_v = ps_values.CachingVariable(v)
      wrapped = ps_values.AggregatingVariable(self._container_strategy(),
                                              wrapped_v, aggregation)
      return wrapped

    if self._num_replicas_in_sync > 1:
      if aggregation not in (
          vs.VariableAggregation.NONE,
          vs.VariableAggregation.SUM,
          vs.VariableAggregation.MEAN,
          vs.VariableAggregation.ONLY_FIRST_REPLICA
      ):
        raise ValueError("Invalid variable aggregation mode: " + aggregation +
                         " for variable: " + kwargs["name"])
      return var_creator
    else:
      def variable_creator_single_replica(**kwargs):
        v = next_creator(**kwargs)
        return ps_values.CachingVariable(v)
      return variable_creator_single_replica

2.5.4 AggregatingVariable

AggregatingVariable 作用是对变量进行包装，该变量可以进行跨副本汇集更改。以 _assign_func 为例，可以看到，其使用 _distribute_strategy.extended.update 对变量进行操作。

# Variable used in PSStrategy TF 1, TF2 and CentralStorageStrategy.
class AggregatingVariable(resource_variable_ops.BaseResourceVariable,
                          core.Tensor):
  """A wrapper around a variable that aggregates updates across replicas."""

  def __init__(self, strategy, v, aggregation):
    self._distribute_strategy = strategy
    self._v = v
    # NOTE: We don't use "_distributed_container" here because we don't want
    # to trigger that code path in regroup().
    v._aggregating_container = weakref.ref(self)  # pylint: disable=protected-access
    self._aggregation = aggregation

  def __deepcopy__(self, memo):
    """Perform a deepcopy of the  AggregatingVariable .

    Unlike the deepcopy of a regular tf.Variable, this keeps the original
    strategy and devices of the  AggregatingVariable .  To avoid confusion
    with the behavior of deepcopy on a regular  Variable  (which does
    copy into new devices), we only allow a deepcopy of a  AggregatingVariable 
    within its originating strategy scope.

    Args:
      memo: The memoization object for  deepcopy .

    Returns:
      A deep copy of the current  AggregatingVariable .

    Raises:
      RuntimeError: If trying to deepcopy into a different strategy.
    """
    with ds_context.enter_or_assert_strategy(self._distribute_strategy):
      v = copy.deepcopy(self._v, memo)

    copied_variable = type(self)(
        strategy=self._distribute_strategy,
        v=v,
        aggregation=self._aggregation)

    memo[id(self)] = copied_variable

    return copied_variable

  def get(self):
    return self._v

  @property
  def distribute_strategy(self):
    return self._distribute_strategy

  def __getattr__(self, name):
    return getattr(self._v, name)

  def _assign_func(self, *args, **kwargs):
    with ds_context.enter_or_assert_strategy(self._distribute_strategy):
      f = kwargs.pop("f")
      if ds_context.in_cross_replica_context():
        if distribute_lib.get_update_replica_id() is not None:
          # We are calling an assign function in an update context.
          return f(self._v, *args, **kwargs)

        # We are calling an assign function in cross replica context, wrap it in
        # an update call.
        return self._distribute_strategy.extended.update(
            self, f, args=args, kwargs=kwargs)
      else:
        replica_context = ds_context.get_replica_context()
          # We are calling an assign function in replica context.
        # We reduce the value we want to assign/add/sub. More details about how
        # we handle the different use cases can be found in the _reduce method.
        # We call the function with the reduced value.
        if self._aggregation == vs.VariableAggregation.NONE:
          raise ValueError(
              values_util.aggregation_error_msg.format(
                  variable_type="AggregatingVariable"))

        def merge_fn(strategy,
                     value,
                     use_locking=False,
                     name=None,
                     read_value=True):
          v = values_util.apply_aggregation(strategy, value, self._aggregation,
                                            self)
          if name and isinstance(name, values.PerReplica):
            name = name.values[0]
          return strategy.extended.update(
              self,
              f,
              args=(v,),
              kwargs={
                  "use_locking": use_locking,
                  "name": name,
                  "read_value": read_value
              })
        return replica_context.merge_call(merge_fn, args=args, kwargs=kwargs)

2.6 使用

下面示例展示了 ShardedVariable 如何使用。在 Dense 之中构建了一个 ShardedVariable，就是 self.w，其 shape 是 [100, 10]，分区之后的结果是两个 (50, 10) 的张量。

  class Dense(tf.Module):
    def __init__(self, name=None):
      super().__init__(name=name)
      self.w = tf.Variable(tf.random.normal([100, 10]), name='w')

    def __call__(self, x):
      return x * self.w

  # Partition the dense layer into 2 shards.
  variable_partitioner = (
    tf.distribute.experimental.partitioners.FixedShardsPartitioner(
      num_shards = 2))
  strategy = tf.distribute.experimental.ParameterServerStrategy(
    cluster_resolver=...,
    variable_partitioner = variable_partitioner)
  with strategy.scope():
    dense = Dense() # 位于 strategy 上下文之中，于是生成的变量被自动分成 2 个分区。
    
  assert len(dense.variables) == 2
  assert isinstance(dense.variables[0], tf.Variable)
  assert isinstance(dense.variables[1], tf.Variable)
  assert dense.variables[0].shape == (50, 10)
  assert dense.variables[1].shape == (50, 10)

ShardedVariable 也是一种形式上的模型并行，比如把 AB 这个矩阵分解到两个参数服务器之上，分别与 C 相乘，最后把相乘结果在 worker 上聚合起来， concatenation 成一个最终结果张量。

图 23 合并张量

0xFF 参考

tensorflow源码解析之distributed_runtime

TensorFlow 篇 | TensorFlow 2.x 分布式训练概览

《用TensorFlow 2.4 实现分布式训练》周玥枫 https://www.bilibili.com/video/BV1MT4y1M7Ym

深入 TensorFlow：参数服务器训练 https://www.bilibili.com/video/BV1u5411H798