【问题标题】:assistance with troubleshooting when creating a rook-ceph cluster on a single node在单个节点上创建 rook-ceph 集群时帮助进行故障排除
【发布时间】:2021-03-18 23:37:47
【问题描述】:

我知道您不应该在单个节点上创建 ceph 集群。但这只是一个小型私人项目,因此我没有资源或需要真正的集群。

但我想建立一个集群,但我遇到了一些问题。目前我的集群已关闭,并且出现以下健康问题。

[root@rook-ceph-tools-6bdcd78654-vq7kn /]# ceph status
  cluster:
    id:     12d9fbb9-73f3-4229-9ef4-6b7670324629
    health: HEALTH_WARN
            Reduced data availability: 33 pgs inactive
            68 slow ops, oldest one blocked for 26686 sec, osd.0 has slow ops
 
  services:
    mon: 1 daemons, quorum g (age 15m)
    mgr: a(active, since 44m)
    osd: 1 osds: 1 up (since 8m), 1 in (since 9m)
 
  data:
    pools:   2 pools, 33 pgs
    objects: 0 objects, 0 B
    usage:   1.0 GiB used, 465 GiB / 466 GiB avail
    pgs:     100.000% pgs unknown
             33 unknown

[root@rook-ceph-tools-6bdcd78654-vq7kn /]# ceph health detail
HEALTH_WARN Reduced data availability: 33 pgs inactive; 68 slow ops, oldest one blocked for 26691 sec, osd.0 has slow ops
[WRN] PG_AVAILABILITY: Reduced data availability: 33 pgs inactive
    pg 2.0 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.0 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.1 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.2 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.3 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.4 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.5 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.6 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.7 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.8 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.9 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.a is stuck inactive for 44m, current state unknown, last acting []
    pg 3.b is stuck inactive for 44m, current state unknown, last acting []
    pg 3.c is stuck inactive for 44m, current state unknown, last acting []
    pg 3.d is stuck inactive for 44m, current state unknown, last acting []
    pg 3.e is stuck inactive for 44m, current state unknown, last acting []
    pg 3.f is stuck inactive for 44m, current state unknown, last acting []
    pg 3.10 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.11 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.12 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.13 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.14 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.15 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.16 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.17 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.18 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.19 is stuck inactive for 44m, current state unknown, last acting []
    pg 3.1a is stuck inactive for 44m, current state unknown, last acting []
    pg 3.1b is stuck inactive for 44m, current state unknown, last acting []
    pg 3.1c is stuck inactive for 44m, current state unknown, last acting []
    pg 3.1d is stuck inactive for 44m, current state unknown, last acting []
    pg 3.1e is stuck inactive for 44m, current state unknown, last acting []
    pg 3.1f is stuck inactive for 44m, current state unknown, last acting []
[WRN] SLOW_OPS: 68 slow ops, oldest one blocked for 26691 sec, osd.0 has slow ops

ceph 版本 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) 章鱼(稳定版)

客户端版本:version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16 :58:53Z", GoVersion:"go1.13.9", 编译器:"gc", 平台:"linux/amd64"}

服务器版本:version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16 :51:04Z", GoVersion:"go1.13.9", 编译器:"gc", 平台:"linux/amd64"}

kubeadm 版本:&version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16 :56:34Z", GoVersion:"go1.13.9", 编译器:"gc", 平台:"linux/amd64"}

如果有人知道从哪里开始或如何解决我的问题,请帮忙!

【问题讨论】:

  • 有一些默认设置,例如新池的复制大小 3(Ceph 被设计为抗故障存储系统,因此您需要冗余)。这意味着您需要三个 OSD 才能激活所有 PG。再添加两个磁盘,您的集群很可能会进入健康状态。如果您无法添加更多磁盘,您可以尝试将池的 min_size 和大小减少到 1(这很危险),为此您还需要此设置:osd_crush_chooseleaf_type = 0。一般来说,如果没有冗余,为什么要使用 ceph,为什么不使用具有常规文件系统的磁盘?

标签: linux kubernetes ceph kubernetes-rook


【解决方案1】:

是的,同意上面提到的 eblock。如果每个 OSD 上至少有 3 个对象副本,则应该有 3 个以上的 OSD(最少 3 个磁盘,或 3 个卷......无论如何)。归置组中的对象内容存储在一组 OSD 中,归置组不拥有 OSD,它们与来自同一个池甚至其他池的其他归置组共享。

  • 如果一个 OSD 发生故障并且它包含的对象的所有副本都将丢失。对于放置组中的所有对象,副本的数量突然从三个下降到两个。 Ceph 通过选择一个新的 OSD 来重新创建所有对象的第三个副本来开始恢复这个归置组。

  • 如果同一归置组中的另一个 OSD 在新 OSD 完全填充第三个副本之前发生故障。一些对象将只有一个幸存的副本。

  • 如果同一归置组中的第三个 OSD 在恢复完成之前发生故障,则此 OSD 包含对象的唯一剩余副本,它将永久丢失。

因此,在创建池时选择正确的 pg 号非常重要:

总 PGs = (OSDs×100)/poolsize

池大小是副本的数量(在本例中为 3)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-10-25
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多