谷歌云大表与谷歌云数据存储答案

【问题标题】：Google Cloud Bigtable vs Google Cloud Datastore谷歌云大表与谷歌云数据存储
【发布时间】：2015-07-17 02:19:30
【问题描述】：

Google Cloud Bigtable 和 Google Cloud Datastore/App Engine 数据存储有什么区别，主要的实际优势/劣势是什么？ AFAIK Cloud Datastore 建立在 Bigtable 之上。

【问题讨论】：

请不要关闭。目前没有关于这些的官方文档，谷歌可能会在这里发表评论。

标签： google-app-engine google-cloud-platform google-cloud-datastore google-cloud-bigtable

【解决方案1】：

根据使用 Datastore 和阅读 Bigtable docs 的经验，主要区别在于：

Bigtable 最初是为 HBase 兼容性而设计的，但现在有 client libraries in multiple languages。 Datastore 最初更适合 Python/Java/Go 网络应用程序开发人员（最初是 App Engine）
Bigtable 比 Datastore 更“多一点 IaaS”，因为它不是“就在那里”，而是需要一个集群为 configured。
Bigtable 仅支持一个索引 - “行键”（Datastore 中的实体键）
- 这意味着查询在 Key 上，这与 Datastore 的索引属性不同
Bigtable 仅在单行上支持原子性 - 没有事务
突变和删除在 Bigtable 中似乎不是原子的，而 Datastore 提供最终的强一致性，具体取决于读取/查询方法
计费模式非常不同：
- 数据存储区的读/写操作、存储和带宽费用
- Bigtable charges 用于“节点”、存储和带宽

【讨论】：

【解决方案2】：

Bigtable 针对大量数据和分析进行了优化

Cloud Bigtable 不会跨区域或区域复制数据（单个集群内的数据是复制和持久的），这意味着 Bigtable 更快、更高效、成本更低，尽管它的持久性较低且在默认配置中可用
它使用 HBase API - 没有锁定风险或学习新范式
它与开源大数据工具集成，这意味着您可以在客户使用的大多数分析工具（Hadoop、Spark 等）中分析存储在 Bigtable 中的数据
Bigtable 由单个 Row Key 索引
Bigtable 位于单个区域中

Cloud Bigtable 专为需要大量数据和复杂后端工作负载的大型公司和企业而设计。

数据存储区经过优化，可为应用程序提供高价值的交易数据

Cloud Datastore 通过复制和数据同步具有极高的可用性
Datastore 由于其多功能性和高可用性，价格更高
由于同步复制，Datastore 写入数据的速度较慢
Datastore 在事务和查询方面具有更好的功能（因为存在二级索引）

【讨论】：

Bigtable 现在可以跨区域复制以在区域中断时提供可用性：cloudplatform.googleblog.com/2018/07/…
我认为事务对于数据存储来说并不是一个强大的卖点。来自其 [doc|cloud.google.com/datastore/docs/concepts/transactions]]“事务是对多达 25 个实体组中的一个或多个实体的一组 Google Cloud Datastore 操作。”此外，数据存储构建在 Bigtable 之上，对吗？

【解决方案3】：

Bigtable 和 Datastore 截然不同。是的，数据存储是建立在 Bigtable 之上的，但这并不像它。这有点像说汽车是建立在车轮之上的，所以汽车与车轮并没有太大区别。

Bigtable 和 Datastore 提供了非常不同的数据模型和非常不同的数据更改语义。

主要区别在于，Datastore 在称为实体组的数据子集上提供类似于 SQL 数据库的 ACID 事务（尽管查询语言 GQL 比 SQL 限制性更强）。 Bigtable 是严格意义上的 NoSQL，并且具有更弱的保证。

【讨论】：

直到最后一段你都做得很好。数据存储提供事务，但它们与 SQL 完全不同，绝对不是 ACID。
@DanielRoseman 实际上，确实如此。以下是关于 Megastore（Datastore 构建于其上）的论文中的一句话：“每个 Megastore 实体组都充当提供可序列化 ACID 语义的小型数据库。” “我们对数据存储进行分区并分别复制每个分区，在分区内提供完整的 ACID 语义”。 (research.google.com/pubs/pub36971.html)
我认为将其称为 Sql 会产生误导。最多一个子集。没有有效的计数/组，所有查询都必须使用索引等
查询语言和事务隔离是不同的东西，你似乎把它们混在一起了。我正在就后者（ACID 事务）提出索赔。在您的评论中，您假设我在谈论前者。也许一些连字符会澄清？我将明确提及查询语言问题以消除任何疑问。

【解决方案4】：

我将尝试总结以上所有答案以及 Coursea Google Cloud Platform Big Data and Machine Learning Fundamentals 中给出的答案

+---------------------+------------------------------------------------------------------+------------------------------------------+--+
|      Category       |                             BigTable                             |                Datastore                 |  |
+---------------------+------------------------------------------------------------------+------------------------------------------+--+
| Technology          | Based on HBase(uses HBase API)                                   | Uses BigTable itself                     |  |
| ----------------    |                                                                  |                                          |  |
| Access Mataphor     | Key/Value (column-families) like Hbase                           | Persistent hashmap                       |  |
| ----------------    |                                                                  |                                          |  |
| Read                | Scan Rows                                                        | Filter Objects on property               |  |
| ----------------    |                                                                  |                                          |  |
| Write               | Put Row                                                          | Put Object                               |  |
| ----------------    |                                                                  |                                          |  |
| Update Granularity  | can't update row ( you should write a new row, can't update one) | can update attribute                     |  |
| ----------------    |                                                                  |                                          |  |
| Capacity            | Petabytes                                                        | Terbytes                                 |  |
| ----------------    |                                                                  |                                          |  |
| Index               | Index key only (you should properly design the key)              | You can index any property of the object |  |
| Usage and use cases | High throughput, scalable flatten data                           | Structured data for Google App Engine    |  |
+---------------------+------------------------------------------------------------------+------------------------------------------+--+

也请查看此图片：

【讨论】：

【解决方案5】：

如果你阅读论文，BigTable 是this，Datastore 是MegaStore。数据存储是 BigTable 加上复制、事务和索引。（而且要贵得多）。

【讨论】：

真的更贵吗？ BigTable 的最低要求是 3 个节点，在 10GB 硬盘上是 1400 美元/月。看起来很高没有？
@ben，在我过去的经验中是这样。数据存储按操作而不是按小时收费。（如果您不经常使用它，那么是的，您不会向 Datastore 支付太多费用。但是如果您有高流量，那么我认为 bigtable 便宜得多。）我认为 Bigtable 声称每秒 10k 操作？实际上，我发现它更低，例如 1-2k 左右，但仍有 3 个节点 > 5k/s。如果您将该吞吐量维持一个月并将其映射到 Datastore 定价，它可能远高于 1.4k。
MegaStore 链接已损坏

【解决方案6】：

这可能是 Google Cloud Bigtable 和 Google Cloud Datastore 以及其他服务之间的另一组关键区别。下图显示的内容也可以帮助您选择合适的服务。

【讨论】：

【解决方案7】：

需要考虑的相对较小的一点是，截至 2016 年 11 月，bigtable python 客户端library 仍处于 Alpha 阶段，这意味着未来的更改可能无法向后兼容。此外，bigtable python 库与 App Engine 的标准环境不兼容。你必须使用灵活的。

【讨论】：

截至 2016 年 11 月，Java 也是如此

【解决方案8】：

Cloud Datastore is a highly-scalable NoSQL database for your applications.
Like Cloud Bigtable, there is no need for you to provision database instances.
Cloud Datastore uses a distributed architecture to automatically manage
scaling. Your queries scale with the size of your result set, not the size of your
data set.
Cloud Datastore runs in Google data centers, which use redundancy to
minimize impact from points of failure. Your application can still use Cloud
Datastore when the service receives a planned upgrade.

 Choose Bigtable if the data is:
Big
● Large quantities (>1 TB) of semi-structured or structured data
Fast
● Data is high throughput or rapidly changing
NoSQL
● Transactions, strong relational semantics not required
And especially if it is:
Time series
● Data is time-series or has natural semantic ordering
Big data
● You run asynchronous batch or real-time processing on the data
Machine learning
● You run machine learning algorithms on the data
Bigtable is designed to handle massive workloads at consistent low latency
and high throughput, so it's a great choice for both operational and analytical
applications, including IoT, user analytics, and financial data analysis.

【讨论】：

【解决方案9】：

Datastore 更适合应用程序，适用于广泛的服务，尤其是微服务。

Datastore 的底层技术是 Big Table，所以你可以想象 Big Table 更强大。

Datastore 每天提供 20K 次免费操作，您可以期望以零成本托管具有可靠数据库的服务器。

您还可以查看这个 Datastore ORM 库，它有很多很棒的功能 https://www.npmjs.com/package/ts-datastore-orm

【讨论】：