Python Bigtable 客户端在 deepcopy 上花费了很多时间答案

【问题标题】：Python Bigtable client spends ages on deepcopyPython Bigtable 客户端在 deepcopy 上花费了很多时间
【发布时间】：2019-01-06 16:23:49
【问题描述】：

我有一个 Python Kafka 消费者有时从 Bigtable 中读取的速度非常慢。它从 Bigtable 读取一行，执行一些计算并偶尔写回一些信息，然后继续。

问题在于，从 GCE 中的 1 个 vCPU 虚拟机读取/写入速度非常快，消费者咀嚼 100-150 条消息/秒。没问题。

但是，当部署在多区域 (europe-west1-b/c/d) 的生产 Kubernetes 集群 (GKE) 上时，它会通过大约 0.5 条消息/秒。 是 - 每条消息 2 秒。

Bigtable 位于 europe-west1-d - 但调度在同一区域 (d) 中的节点上的 pod 与其他区域中的节点上的 pod 具有相同的性能，这很奇怪。

pod 不断达到 CPU 限制（1 个 vCPU）。分析程序显示大部分时间（95%）都花在PartialRowData.cells() 函数内部，在copy.py:132(deepcopy)

它使用最新的google-cloud-bigtable==0.29.0 包。

现在，我知道该软件包处于 alpha 阶段，但究竟是什么因素导致性能大幅降低 300 倍？

读取行数据的代码是这样的：

def _row_to_dict(cls, row):
    if row is None:
        return {}

    item_dict = {}

    if COLUMN_FAMILY in row.cells:
        structured_cells = {}
        for field_name in STRUCTURED_STATIC_FIELDS:
            if field_name.encode() in row.cells[COLUMN_FAMILY]:
                structured_cells[field_name] = row.cells[COLUMN_FAMILY][field_name.encode()][
                    0].value.decode()
        item_dict[COLUMN_FAMILY] = structured_cells

    return item_dict

传入的row来自哪里

row = self.bt_table.read_row(row_key, filter_=filter_)

可能有大约 50 个STRUCTURED_STATIC_FIELDS。

deepcopy 真的只是需要很长时间才能复制吗？还是在等待 Bigtable 的数据传输？我是否以某种方式滥用图书馆？有关如何提高性能的任何指示？

提前非常感谢。

【问题讨论】：

最好的办法是在这里创建一个问题：github.com/GoogleCloudPlatform/google-cloud-python/issues
@SolomonDuskis 已经在 master 分支中实现了更好的方法，例如 PartialRowData.cell_value，可以避免这种情况。但在 0.29 版本中不可用，这是 pip 上的最新版本。

标签： google-cloud-platform google-kubernetes-engine bigtable google-cloud-bigtable

【解决方案1】：

原来库将row.cells 的getter 定义为：

@property
def cells(self):
    """Property returning all the cells accumulated on this partial row.

    :rtype: dict
    :returns: Dictionary of the :class:`Cell` objects accumulated. This
              dictionary has two-levels of keys (first for column families
              and second for column names/qualifiers within a family). For
              a given column, a list of :class:`Cell` objects is stored.
    """
    return copy.deepcopy(self._cells)

因此，除了查找之外，每次调用字典都会执行deepcopy。

添加一个

row_cells = row.cells

随后仅提及解决了该问题。

dev/prod 环境的性能差异还在于 prod 表已经有更多的时间戳/版本的单元格，而 dev 表只有几个。这使得必须深度复制的返回字典变得更大。

使用CellsColumnLimitFilter 链接现有过滤器有助于进一步：

filter_ = RowFilterChain(filters=[filter_, CellsColumnLimitFilter(num_cells=1)])

【讨论】：

我提出问题再次讨论这个问题：github.com/GoogleCloudPlatform/google-cloud-python/issues/5725