【问题标题】:Short circuit computation in mixture of experts model using tensorflow keras functional api使用 tensorflow keras 功能 api 混合专家模型的短路计算
【发布时间】:2019-12-23 19:50:51
【问题描述】:

我正在尝试根据“门控”层的输出(作为专家的混合)在多个不同的“专家”层之间进行交换。 我创建了一个自定义层,它接收专家层和门控层的输出,但这最终会丢弃一些输出,而不是一开始就不计算它们。

如何使模型“短路”以仅评估门控层和选定的专家层以节省计算时间?

我正在使用 tensorflow 2.0 gpu 和 keras 函数式 api

【问题讨论】:

    标签: python python-3.x tensorflow tensorflow2.0 tf.keras


    【解决方案1】:

    Keras 模型可以完全动态实现,以支持您提到的高效路由。以下示例显示了一种可以完成此操作的方法。该示例是在以下前提下编写的:

    1. 假设有两个专家(LayerALayerB
    2. 假设混合专家模型 (MixOfExpertsModel) 根据 Keras Dense 层的每个示例输出在两个专家层类之间动态切换
    3. 它满足了对模型进行批量训练的需要。

    注意代码中的cmets,看看是怎么切换的。

    import numpy as np
    import tensorflow as tf
    
    
    # This is your Expert A class.
    class LayerA(tf.keras.layers.Layer):
    
      def build(self, input_shape):
        self.weight = self.add_weight("weight_a", shape=input_shape[1:])
    
      @tf.function
      def call(self, x):
        return x + self.weight
    
    
    # This is your Expert B class.
    class LayerB(tf.keras.layers.Layer):
    
      def build(self, input_shape):
        self.weight = self.add_weight("weight_b", shape=input_shape[1:])
    
      @tf.function
      def call(self, x):
        return x * self.weight
    
    
    class MixOfExpertsModel(tf.keras.models.Model):
    
      def __init__(self):
        super(MixOfExpertsModel, self).__init__()
        self._expert_a = LayerA()
        self._expert_b = LayerB()
        self._gating_layer = tf.keras.layers.Dense(1, activation="sigmoid")
    
      @tf.function
      def call(self, x):
        z = self._gating_layer(x)
        # The switching logic:
        #   - examples with gating output <= 0.5 are routed to expert A
        #   - examples with gating output > 0.5 are routed to expert B.
        mask_a = tf.squeeze(tf.less_equal(z, 0.5), axis=-1)
        mask_b = tf.squeeze(tf.greater(z, 0.5), axis=-1)
        # `input_a` is a subset of slices of the original input (`x`).
        # So is `input_b`. As such, no compute is wasted.
        input_a = tf.boolean_mask(x, mask_a, axis=0)
        input_b = tf.boolean_mask(x, mask_b, axis=0)
        if tf.size(input_a) > 0:
          output_a = self._expert_a(input_a)
        else:
          output_a = tf.zeros_like(input_a)
        if tf.size(input_b) > 0:
          output_b = self._expert_b(input_b)
        else:
          output_b = tf.zeros_like(input_b)
        # Return `mask_a`, and `mask_b`, so that the caller can know
        # which example is routed to which expert and whether its output
        # appears in `output_a` or `output_b`. # This is necessary
        # for writing a (custom) loss function for this class.
        return output_a, output_b, mask_a, mask_b
    
    
    # Create an intance of the mix-of-experts model.
    mix_of_experts_model = MixOfExpertsModel()
    
    # Generate some dummy data.
    num_examples = 32
    xs = np.random.random([num_examples, 8]).astype(np.float32)
    
    # Call the model.
    print(mix_of_experts_model(xs))
    
    

    我没有编写自定义损失函数来支持该课程的训练。但这可以通过使用MixOfExpertsModel.call() 的返回值来实现,即输出和掩码。

    【讨论】:

    • 谢谢,这很有帮助。如何将此模型扩展到任意数量的专家/选定的输出层?
    • 如果你想拥有 >2 位专家,你可以将 MixOfExpersModel 的密集层从 sigmoid 更改为 softmax 激活,并将输出大小从 1 更改为 N,N 是你拥有的专家数量。然后在其输出中使用argmax 来确定使用哪个专家。
    猜你喜欢
    • 1970-01-01
    • 2019-12-11
    • 1970-01-01
    • 2021-03-17
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-03-07
    • 1970-01-01
    相关资源
    最近更新 更多