作者：李晓辉

联系方式：

1. 微信：Lxh_Chat

2. 邮箱：939958092@qq.com

Ceph Manager (MGR)

红帽 Ceph 存储管理器 (MGR) 的作用在于收集集群统计数据。当 MGR 节点关闭时，客户端 I/O 操作会继续正常运行，但集群统计数据的查询会失败。每个集群至少部署两个 MGR，以提供高可用性

使用 ceph mgr fail 命令从活跃 MGR 手动故障转移到备用 MGR。

使用 ceph mgr stat 命令可查看 MGR 状态。

[root@serverc ~]# ceph mgr stat
{
    "epoch": 23,
    "available": true,
    "active_name": "serverc.fumvff",
    "num_standby": 1
}
[root@serverc ~]# ceph mgr fail serverc.fumvff
[root@serverc ~]# ceph mgr stat
{
    "epoch": 26,
    "available": true,
    "active_name": "servere.twqmsd",
    "num_standby": 0
}

查看启用的 MGR 模块

[root@serverc ~]# ceph mgr module ls | more
{
    "always_on_modules": [
        "balancer",
        "crash",
        "devicehealth",
        "orchestrator",
        "pg_autoscaler",
        "progress",
        "rbd_support",
        "status",
        "telemetry",
        "volumes"
    ],
    "enabled_modules": [
        "cephadm",
        "dashboard",
        "iostat",
        "prometheus",
        "restful"
    ],
    "disabled_modules": [
        {
            "name": "alerts",
            "can_run": true,

查看dashboard服务信息

[root@serverc ~]# ceph mgr services
{
    "dashboard": "https://172.25.250.12:8443/",
    "prometheus": "http://172.25.250.12:9283/"
}

监控集群运行状况

你可以使用 ceph health 命令快速验证集群的状态。此命令会返回下列状态之一：

HEALTH_OK 表示集群正在正常运行。
HEALTH_WARN 表示集群处于警告状态。例如，某个 OSD 停机，但有足够的 OSD 正常工作，可以让集群正常运行。
HEALTH_ERR 表示集群处于错误状态。例如，一个全满的 OSD 可能会对集群的功能造成影响。

如果 Ceph 集群处于警告或错误状态，ceph health detail 命令会提供更多详细信息。

1 2	[root@serverc ~]# ceph health detail HEALTH_OK

ceph -w 命令显示 Ceph 集群中正在发生的事件的其他实时监控信息。

[root@serverc ~]# ceph -w TAB
alerts             dashboard          healthcheck        mgr                progress           service            versions
auth               device             influx             mon                prometheus         status             zabbix
balancer           df                 insights           nfs                quorum_status      telegraf
cephadm            features           iostat             node               rbd                telemetry
config             fs                 k8sevents          orch               report             tell
config-key         fsid               log                osd                restful            test_orchestrator
crash              health             mds                pg                 rgw                time-sync-status

管理Ceph服务

容器化服务由容器主机系统上的 systemd 控制。在容器主机系统运行 systemctl 命令，以启动、停止或重新启动集群守护进程。

集群守护进程通过类型 $daemon 和守护进程 $id 来引用。$daemon 的类型是 mon、mgr、mds、osd、 rgw、rbd-mirror、crash 或 cephfs-mirror。

MON、MGR、RGW 和 MDS 的守护进程 $id 是主机名。OSD 的守护进程 $id 是 OSD ID。MDS 的守护进程 $id 是文件系统名后跟主机名。

使用 ceph orch ps 命令可列出所有集群守护进程。使用 --daemon_type=DAEMON 选项可筛选特定守护进程类型。

[root@serverc ~]# ceph orch ps --daemon_type=osd
NAME   HOST                     STATUS         REFRESHED  AGE  PORTS  VERSION           IMAGE ID      CONTAINER ID
osd.0  serverc.lab.example.com  running (36m)  5m ago     23M  -      16.2.0-117.el8cp  2142b60d7974  51f0813a7153
osd.1  serverc.lab.example.com  running (36m)  5m ago     23M  -      16.2.0-117.el8cp  2142b60d7974  3373c2f4fd1f
osd.2  serverc.lab.example.com  running (36m)  5m ago     23M  -      16.2.0-117.el8cp  2142b60d7974  5d8fad5f53ed
osd.3  serverd.lab.example.com  running (36m)  5m ago     22M  -      16.2.0-117.el8cp  2142b60d7974  d61622e0007b
osd.4  servere.lab.example.com  running (36m)  5m ago     22M  -      16.2.0-117.el8cp  2142b60d7974  137e9060a389
osd.5  serverd.lab.example.com  running (36m)  5m ago     22M  -      16.2.0-117.el8cp  2142b60d7974  189d14d860fa
osd.6  servere.lab.example.com  running (36m)  5m ago     22M  -      16.2.0-117.el8cp  2142b60d7974  f8d0f0c0f629
osd.7  serverd.lab.example.com  running (36m)  5m ago     22M  -      16.2.0-117.el8cp  2142b60d7974  36055cff31ca
osd.8  servere.lab.example.com  running (36m)  5m ago     22M  -      16.2.0-117.el8cp  2142b60d7974  c6372a72a5b3

若要在主机上停止、启动或重新启动守护进程，请使用 systemctl 命令和守护进程名称

[root@serverc ~]# systemctl list-units 'ceph*'
UNIT                                                                                 LOAD   ACTIVE SUB     DESCRIPTION
ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@alertmanager.serverc.service               loaded active running Ceph alertmanager.serverc for 2ae6d05a-229>
ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@crash.serverc.service                      loaded active running Ceph crash.serverc for 2ae6d05a-229a-11ec->
ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@grafana.serverc.service                    loaded active running Ceph grafana.serverc for 2ae6d05a-229a-11e>
ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@mgr.serverc.lab.example.com.aiqepd.service loaded active running Ceph mgr.serverc.lab.example.com.aiqepd fo>
ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@mon.serverc.lab.example.com.service        loaded active running Ceph mon.serverc.lab.example.com for 2ae6d>
ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@node-exporter.serverc.service              loaded active running Ceph node-exporter.serverc for 2ae6d05a-22>
ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@osd.0.service                              loaded active running Ceph osd.0 for 2ae6d05a-229a-11ec-925e-525>
ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@osd.1.service                              loaded active running Ceph osd.1 for 2ae6d05a-229a-11ec-925e-525>
ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@osd.2.service                              loaded active running Ceph osd.2 for 2ae6d05a-229a-11ec-925e-525>
ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@prometheus.serverc.service                 loaded active running Ceph prometheus.serverc for 2ae6d05a-229a->
ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@rgw.realm.zone.serverc.bqwjcv.service      loaded active running Ceph rgw.realm.zone.serverc.bqwjcv for 2ae>
ceph-2ae6d05a-229a-11ec-925e-52540000fa0c.target                                     loaded active active  Ceph cluster 2ae6d05a-229a-11ec-925e-52540>
ceph.target                                                                          loaded active active  All Ceph clusters and services

使用ceph.target管理集群节点上所有守护进程

1	[root@serverc ~]# systemctl restart ceph.target

也可以使用ceph orch命令来管理集群服务。首先，使用ceph orch ls命令获取服务名称。例如，查找集群OSDs的服务名称，重新启动服务

1	[root@serverc ~]# ceph orch ls/ps

1 2	[root@serverc ~]# ceph orch restart prometheus Scheduled to restart prometheus.ceph1 on host 'serverc'

可以使用ceph orch daemon命令管理单个集群守护进程

1	[ceph: root@serverc /]# ceph orch daemon restart osd.1

关闭或重启集群

在重启集群或执行集群维护时，应当设置一些标志，可使用集群标志来限制集群组件故障的影响，或预防集群性能问题。

使用 ceph osd set 和 ceph osd unset 命令来管理这些标志：

noup

OSD在启动时将不会自动被标记为”up”状态

nodown

即使OSD停止，MON也不会将其标记为”down”状态

noout

此标志的作用是防止MON将任何OSD从Crush映射中删除。当对OSD进行维护时，可以设置此标志以避免CRUSH在OSD停止时自动进行数据重平衡

noin

设置noin标志可以防止数据自动分配到OSD上。这在某些维护或故障排除情况下可能很有用，以避免数据在不适当的时间被分配

norecover

阻止集群尝试恢复任何处于恢复状态的PG，这通常在维护期间使用，以避免在维护窗口期间进行数据恢复操作

nobackfill

防止集群进行数据回填操作。回填是指当一个OSD重新上线后，将之前复制到其他OSD的数据重新复制回该OSD，以保证数据副本的数量。设置此标志可以避免在OSD重新加入集群时触发数据迁移。

norebalance

阻止CRUSH算法重新平衡数据。这通常在集群扩容或缩容时使用，以避免在调整资源时进行不必要的数据迁移

noscrub

禁用PG的清理操作。清理操作是Ceph用来验证存储在同一PG内的数据副本是否一致的过程。在低带宽环境中或者在执行维护时，可以设置此标志以减少网络和磁盘I/O的负载

nodeep-scrub

与noscrub类似，nodeep-scrub是更深层次的清理禁用，它不仅阻止了清理操作，还阻止了对PG的更深入的校验，这可以在需要减少集群负载时使用

集群关机

执行下列步骤以关闭整个集群：

防止客户端访问集群。
确保集群运行正常 (HEALTH_OK)，并且所有 PG 都处于 active+clean 状态，然后继续。
关闭 CephFS。
设置 noout、norecover、norebalance、nobackfill、nodown 和 pause 标志。
关闭所有 Ceph 对象网关 (RGW) 和 iSCSI 网关。
逐一关闭 OSD 节点。
逐一关闭 MON 和 MGR 节点。
关闭管理节点。

集群开机

执行下列步骤以打开集群：

按照以下顺序执行集群开机：管理节点、MON 和 MGR 节点、OSD 节点、MDS 节点。
清除 noout、norecover、norebalance、nobackfill、nodown 和 pause 标志。
启动 Ceph 对象网关和 iSCSI 网关。
启动 CephFS。

监控集群

使用 ceph mon stat 或 ceph quorum_status -f json-pretty 命令查看 MON 仲裁状态

[root@serverc ~]# ceph mon stat
e4: 4 mons at {clienta=[v2:172.25.250.10:3300/0,v1:172.25.250.10:6789/0],serverc.lab.example.com=[v2:172.25.250.12:3300/0,v1:172.25.250.12:6789/0],serverd=[v2:172.25.250.13:3300/0,v1:172.25.250.13:6789/0],servere=[v2:172.25.250.14:3300/0,v1:172.25.250.14:6789/0]}, election epoch 46, leader 0 serverc.lab.example.com, quorum 0,1,2,3 serverc.lab.example.com,clienta,serverd,servere
[root@serverc ~]# ceph quorum_status -f json-pretty

{
    "election_epoch": 46,
    "quorum": [
        0,
        1,
        2,
        3
    ],
    "quorum_names": [
        "serverc.lab.example.com",
        "clienta",
        "serverd",
        "servere"
    ],
    "quorum_leader_name": "serverc.lab.example.com",
    "quorum_age": 595,

也可以在“Dashboard”中查看MONs的状态

查看守护进程日志

若要查看守护进程日志，请使用 journalctl -u $daemon@$id 命令

[root@serverc ~]# journalctl -u ceph-2ae6d05a-229a-11ec-925e-52540000fa0c@osd.1.service
-- Logs begin at Fri 2023-09-08 12:04:22 EDT, end at Fri 2023-09-08 12:52:56 EDT. --
Sep 08 12:04:33 serverc.lab.example.com systemd[1]: Starting Ceph osd.1 for 2ae6d05a-229a-11ec-925e-52540000fa0c...
Sep 08 12:04:39 serverc.lab.example.com bash[1192]: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1
Sep 08 12:04:39 serverc.lab.example.com bash[1192]: Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-3cc209>
Sep 08 12:04:39 serverc.lab.example.com bash[1192]: Running command: /usr/bin/ln -snf /dev/ceph-3cc20990-17be-43da-a816-14983a1f7adf/osd-block-3f7513>
Sep 08 12:04:39 serverc.lab.example.com bash[1192]: Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-1/block
Sep 08 12:04:39 serverc.lab.example.com bash[1192]: Running command: /usr/bin/chown -R ceph:ceph /dev/dm-1
Sep 08 12:04:39 serverc.lab.example.com bash[1192]: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1
Sep 08 12:04:39 serverc.lab.example.com bash[1192]: --> ceph-volume lvm activate successful for osd ID: 1
Sep 08 12:04:40 serverc.lab.example.com bash[1192]: 3373c2f4fd1f347cad910a1e2a1e48ba2a963c6a3eebff8fdb958d4cd558d81a
Sep 08 12:04:40 serverc.lab.example.com systemd[1]: Started Ceph osd.1 for 2ae6d05a-229a-11ec-925e-52540000fa0c.

Ceph 容器为各个守护进程写入单独的日志文件。通过将守护进程的 log_to_file 设置配置为 TRUE，为每个特定 Ceph 守护进程启用日志记录。本例为 MON 节点启用了日志记录。

1	[root@serverc ~]# ceph config set mon log_to_file true

监控osd

如果集群运行状况不正常，Ceph 会显示包含以下信息的详细状态报告：

OSD 的当前状态 (up/down/out/in)
OSD 接近容量限制信息（nearfull 或 full）。
放置组 (PG) 的当前状态

与 ceph status 和 ceph health 命令报告空间相关的警告或错误情况。各种 ceph osd 子命令可报告 OSD 的使用详情、状态和位置信息。

分析OSD使用

ceph osd df 命令显示池使用量统计数据。使用 ceph osd df tree 命令，可在命令输出中显示 CRUSH 树。

[root@serverc ~]# ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE    RAW USE  DATA     OMAP     META     AVAIL   %USE  VAR   PGS  STATUS
 0    hdd  0.00980   1.00000  10 GiB   18 MiB  2.4 MiB      0 B   16 MiB  10 GiB  0.18  0.85   34      up
 1    hdd  0.00980   1.00000  10 GiB   27 MiB  2.4 MiB      0 B   25 MiB  10 GiB  0.26  1.28   42      up
 2    hdd  0.00980   1.00000  10 GiB   12 MiB  2.4 MiB      0 B  9.5 MiB  10 GiB  0.12  0.56   29      up
 3    hdd  0.00980   1.00000  10 GiB   28 MiB  2.4 MiB      0 B   25 MiB  10 GiB  0.27  1.31   39      up
 5    hdd  0.00980   1.00000  10 GiB   15 MiB  2.4 MiB      0 B   12 MiB  10 GiB  0.14  0.70   31      up
 7    hdd  0.00980   1.00000  10 GiB   18 MiB  2.4 MiB      0 B   16 MiB  10 GiB  0.18  0.86   35      up
 4    hdd  0.00980   1.00000  10 GiB   27 MiB  2.4 MiB      0 B   24 MiB  10 GiB  0.26  1.26   34      up
 6    hdd  0.00980   1.00000  10 GiB   19 MiB  2.4 MiB      0 B   17 MiB  10 GiB  0.19  0.91   39      up
 8    hdd  0.00980   1.00000  10 GiB   27 MiB  2.4 MiB      0 B   24 MiB  10 GiB  0.26  1.25   32      up
                       TOTAL  90 GiB  190 MiB   22 MiB  8.5 KiB  169 MiB  90 GiB  0.21
MIN/MAX VAR: 0.56/1.31  STDDEV: 0.06

使用 ceph osd perf 命令查看 OSD 性能统计信息。

[root@serverc ~]# ceph osd perf
osd  commit_latency(ms)  apply_latency(ms)
  8                   0                  0
  7                   0                  0
  6                   0                  0
  1                   0                  0
  0                   0                  0
  2                   0                  0
  3                   0                  0
  4                   0                  0
  5                   0                  0

解读 OSD 状态

根据这两个标志的组合，OSD 守护进程可以处于以下四种状态之一：

down 或 up ，表示守护进程是否正在运行和正在与 MON 通信。
out 或 in ，表示 OSD 是否参与了集群数据放置。

正常运行中的 OSD 的状态为 up 和 in。

例如，简单的网络中断可能会导致 OSD 失去与集群的通信，并且临时报告为 down。在由 mon_osd_down_out_interval 配置选项短暂控制（默认为五分钟）后，集群会报告 OSD 处于 down 和 out 状态。这时，分配至故障 OSD 的 PG 会迁移到其他 OSD。

如果故障 OSD 之后返回到 up 和 in 状态，集群会根据新的 OSD 集并通过重新平衡集群中的对象来重新分配 PG。

OSD 在固定的时间间隔（默认为六秒钟）互相验证状态。它们默认每隔 120 秒会将自己的状态报告给 MON。如果某个 OSD 处于 down 状态，则其他 OSD 或 MON 不会收到来自该 down OSD 的心跳响应。

监控OSD容量

达到或超过 mon_osd_full_ratio 设置的值时，集群会停止接受来自客户端的写入请求，并进入 HEALTH_ERR 状态。默认情况下，全满比率为集群中可用存储空间的 0.95 (95%)。

mon_osd_nearfull_ratio 设置是一个更为保守的限值。达到或超过 mon_osd_nearfull_ratio 限制值时，集群会进入 HEALTH_WARN 状态。这是为了在达到全满比率之前，提醒你需要添加 OSD 到集群或者修复问题。默认情况下，将满比率为集群中可用存储空间的 0.85 (85%)。

mon_osd_backfillfull_ratio 设置是集群 OSD 被视为太满而无法开始回填操作的阈值。默认情况下，回填全满比率为集群中可用存储空间的 0.90 (90%)。

使用 ceph osd set-full-ratio、ceph osd set-nearfull-ratio 和 ceph osd set-backfillfull-ratio 命令来配置这些设置。

[root@serverc ~]# ceph osd set-full-ratio .85
osd set-full-ratio 0.85
[root@serverc ~]# ceph osd set-nearfull-ratio .75
osd set-nearfull-ratio 0.75
[root@serverc ~]# ceph osd set-backfillfull-ratio .80
osd set-backfillfull-ratio 0.8

[root@serverc ~]# ceph osd dump  | grep full
full_ratio 0.85
backfillfull_ratio 0.8
nearfull_ratio 0.75

默认的比率设置适用于小型集群，生产集群通常需要较低的比例

监控放置组

每个放置组 (PG) 都分配有一个状态字符串，用于指示其运行状况。当所有放置组都为 active+clean 状态时，集群运行正常。PG 状态 scrubbing 或 deep-scrubbing 也可能发生于正常运行的集群，不表示存在问题。

放置组 scrubbing 是一个后台进程，通过将对象大小和其他元数据与其在其他 OSD 上的副本进行比较并报告不一致情况来验证数据一致性。Deep scrubbing 是一个资源密集型过程，通过按位比较来比较数据对象的内容，并重新计算校验和，以识别驱动器上的坏扇区。

放置组可以具有下列状态：

PG 状态	描述
`creating`	正在进行 PG 创建。
`peering`	OSD 已经就 PG 中对象的当前状态达成一致。
`active`	peering 操作已经完成。PG 可用于读取和写入请求。
`clean`	PG 具有正确数量的副本，没有离群的副本。
`degraded`	PG 含有副本数量不正确的对象。
`recovering`	对象正在迁移，或者与副本同步。
`recovery_wait`	PG 正在等待本地或远程预留。
`undersized`	PG 被配置为存储较多的副本，超过 PG 中可用的 OSD 数。
`inconsistent`	此 PG 中的副本不一致。PG 中有一个或多个副本不同，表明存在某种形式的 PG 损坏。
`replay`	在 OSD 崩溃后，PG 正在等待客户端从日志中重演操作。
`repair`	PG 已计划了修复。
`backfill`、 `backfill_wait`、 `backfill_toofull`	回填操作正在等待、正在执行，或因为存储不足而无法完成。
`incomplete`	PG 的历史日志中缺少有关可能发生的写入的信息。这可能表明 OSD 已发生故障或尚未启动。
`stale`	PG 处于未知状态（OSD 报告超时）。
`inactive`	PG 已太长时间处于不活跃状态。
`unclean`	PG 已太长时间处于不干净状态。
`remapped`	执行集合已被更改，PG 在主要 OSD 恢复或回填期间已暂时重新映射到另一组 OSD。
`down`	PG 为离线状态。
`splitting`	PG 正在被分割；PG 的数量正在增加。
`scrubbing`、`deep-scrubbing`	正在进行 PG 清理或深度清理操作。

当 OSD 添加到 PG 时，PG 会进入 peering 状态，以确保所有节点就 PG 的状态达成一致。Peering 操作完成后，如果 PG 可以处理读取和写入请求，则会报告 active 状态。如果 PG 的所有对象都具有正确数量的副本，则会报告 clean 状态。在写入完成后，PG 的正常运作状态为 active+clean。

当对象写入到 PG 的主要 OSD 时，该 PG 会报告 degraded 状态，直到所有副本 OSD 确认它们也已写入了这个对象。

backfill 状态表示正在复制或迁移数据，以在 OSD 之间重新平衡 PG。如果有新的 OSD 添加到 PG 中，它会逐渐回填对象，以避免网络流量过高。回填在后台进行，以最大限度降低对集群的性能影响。出现 backfill_wait 状态表明回填操作待处理。出现 backfill 状态表明回填操作正在进行中。出现 backfill_too_full 状态表明已请求回填操作，但由于存储容量不足而无法完成。

被标记为inconsistent PG 可能具有与其他 PG 不同的副本，从检测上看表现为一个或多个副本存在不同的数据校验和或元数据大小。Ceph 集群中的时钟偏移和对象内容损坏也可能会触发inconsistent的 PG 状态。

识别卡滞的 PG

PG 在故障后转换到 degraded 或 peering 状态。如果某个 PG 长时间停留在其中某个状态，MON 会将其标记为 stuck。卡滞的 PG 可能会处于以下一个或多个状态：

inactive PG 可能具有 peering 问题。
unclean PG 可能在故障后恢复期间遇到问题。
stale PG 没有 OSD 报告，可能表明所有 OSD 都处于 down 和 out 状态。
undersized PG 没有足够的 OSD 来存储所配置的副本数量

如果有许多 PG 停留在 peering 状态，可通过 ceph osd blocked-by 命令显示妨碍 OSD peering 操作的具体 OSD

集群升级

使用ceph orch upgrade命令升级你的Red Hat ceph Storage 5集群。

首先，通过运行cephadm-ansible preflight剧本，并将upgrade_cepph_packages选项设置为true来更新cephadm

使用ceph orch upgrade命令升级Red Hat ceph Storage 5集群

首先，更新cephadm通过运行cephadm-ansible preflight playbook与upgrade_cepph_packages选项设置为true

1 2	[root@node ~]# ansible-playbook -i /etc/ansible/hosts/cephadm-preflight.yml \ --extra-vars "ceph_origin=rhcs upgrade_ceph_packages=true"

然后执行ceph orch upgrade start –ceph-version VERSION命令

1	[ceph: root@node /]# ceph orch upgrade start --ceph-version 16.2.0-117.el8cp

执行ceph status命令，查询升级进度

[ceph: root@node /)# ceph status 
... output omitted ... 
progress: 
Upgrade to 16.2.0-115 .el8cp (ls)

不要将使用不同版本的Red Hat Ceph Storage的客户端和集群节点混合在同一个集群中。客户端包括RADOS网关、iSCSI网关以及其他使用librados、librbd或libceph的应用程序。

在集群升级后，使用ceph versions命令检查是否安装了匹配的版本

1	[ceph: root@node /]# ceph versions

使用Balancer模块

Ceph 存储提供了一个名为 balancer（平衡器）的 MGR 模块，它可自动优化 OSD 之间的 PG 放置来实现均衡分布。此模块也可以手动运行。

使用 ceph balancer status 命令来显示平衡器状态。

[root@serverc ~]# ceph balancer status
{
    "active": true,
    "last_optimize_duration": "0:00:00.003807",
    "last_optimize_started": "Thu Aug  8 14:45:31 2024",
    "mode": "upmap",
    "no_optimization_needed": true,
    "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect",
    "plans": []
}

Red Hat Ceph Storage提供了一个名为balancer的MGR模块，它可以自动优化PGs在osd之间的放置，以实现均衡分布。该模块也可以手动运行

缺省情况下，balancer模块处于启用状态。使用ceph balancer on和ceph balancer off命令启用或禁用平衡器。

ceph balancer status命令用来查看均衡器状态。

1	[ceph: root@node /]# ceph balancer status

自动平衡

自动平衡采用以下模式之一：

crush-compat

此模式使用 compat weight-set 功能来计算和管理 CRUSH 层次结构中设备的另一组权重。平衡器可以优化这些权重集值，以较小的增量进行上下调整，从而达成与目标分布尽可能匹配的分布。

此模式全面向后兼容较旧的客户端。

upmap

PG upmap 模式支持在单个 OSD 映射中存储单个 OSD 的显式 PG 映射，以作为普通 CRUSH 放置计算的例外。upmap 模式分析 PG 放置，然后运行所需的 pg-upmap-items 命令，以优化 PG 放置并实现均衡分布。

由于这些 upmap 条目提供了对 PG 映射的精细控制，因此 upmap 模式通常能够在 OSD 之间均匀分布 PG，如果存在奇数个 PG，则相差 +/-1 个 PG。

将模式设置为 upmap 需要所有客户端都是 Luminous 或更高版本。使用 ceph osd set-require-min-compat-client luminous 命令来设置所需的最低客户端版本。

使用 ceph balancer mode upmap 命令将平衡器模式设置为 upmap。

1	[root@serverc ~]# ceph balancer mode upmap

手动平衡

可以手动运行balancer来控制何时进行平衡，并且在执行平衡器计划之前进行评估。若要手动运行平衡器，请使用下列命令来禁用自动平衡，然后生成和执行计划。

关闭自动平衡

1	[root@serverc ~]# ceph balancer off

评估并为集群的当前分布打分

1 2	[root@serverc ~]# ceph balancer eval current cluster score 0.133397 (lower is better)

对特定池的当前分布进行评估和评分

1 2	[root@serverc ~]# ceph balancer eval cephfs.lxhfsname-2.data pool "cephfs.lxhfsname-2.data" score 0.004847 (lower is better)

生成PG优化计划并为其命名

这个不是报错，是说你的集群已经很好了，不用优化

1 2	[root@ceph1 ~]# ceph balancer optimize lxh Error EALREADY: Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect

显示计划的内容

1	[root@serverc ~]# ceph balancer show lxh

分析计划执行的预期结果

1	[root@serverc ~]# ceph balancer eval lxh

如果你认可预期的结果，那么就执行计划
1
[root@serverc ~]# ceph balancer execute lxh

只有在你希望它改善分布时才执行计划。计划执行后被丢弃

使用ceph balancer ls/rm命令显示和删除当前记录的计划

1	[root@serverc ~]# ceph balancer ls/rm lxh