1
2
3
4
5
6
7
作者:李晓辉

联系方式:

1. 微信:Lxh_Chat

2. 邮箱:939958092@qq.com

描绘业务蓝图

下面的示例描绘了一个现有云计算平台的基本结构

这里有一家云计算公司名为lxhcloud,在上海这个城市开展业务,在上海的徐汇区有一个机房,在机房中,有两个机柜,两个机柜中,分别有一些服务器,每个服务器分别有一些硬盘

我们现在来设计一下Ceph的Crush map以便于能实现业务版图

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
lxhcloud
└── Shanghai
├── Datacenter-XuHui
│ ├── Rack-1
│ │ ├── Serverc
│ │ │ ├── osd.1
│ │ │ ├── osd.2
│ │ │ └── osd.3
│ │ ├── Serverd
│ │ │ ├── osd.3
│ │ │ ├── osd.5
│ │ │ └── osd.7
│ ├── Rack-2
│ │ ├── Servere
│ │ │ ├── osd.4
│ │ │ ├── osd.6
│ │ │ └── osd.8

先查询Ceph中现有的crush map

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@serverc ~]# ceph osd crush tree
ID CLASS WEIGHT TYPE NAME
-1 0.08817 root default
-3 0.02939 host serverc
0 hdd 0.00980 osd.0
1 hdd 0.00980 osd.1
2 hdd 0.00980 osd.2
-5 0.02939 host serverd
3 hdd 0.00980 osd.3
5 hdd 0.00980 osd.5
7 hdd 0.00980 osd.7
-7 0.02939 host servere
4 hdd 0.00980 osd.4
6 hdd 0.00980 osd.6
8 hdd 0.00980 osd.8

先创建出这家公司

1
2
[root@serverc ~]# ceph osd crush add-bucket lxhcloud root
added bucket lxhcloud type root to crush map

创建出上海这个城市

1
2
[root@serverc ~]# ceph osd crush add-bucket Shanghai region
added bucket Shanghai type region to crush map

再创建出上海的徐汇机房

1
2
[root@serverc ~]# ceph osd crush add-bucket Datacenter-XuHui datacenter
added bucket Datacenter-XuHui type datacenter to crush map

再创建出两个机柜

1
2
3
4
[root@serverc ~]# ceph osd crush add-bucket Rack-1 rack
added bucket Rack-1 type rack to crush map
[root@serverc ~]# ceph osd crush add-bucket Rack-2 rack
added bucket Rack-2 type rack to crush map

把这几个对象按照业务整合起来,形成层级关系

1
2
3
4
5
6
7
8
[root@serverc ~]# ceph osd crush move Shanghai root=lxhcloud
moved item id -10 name 'Shanghai' to location {root=lxhcloud} in crush map
[root@serverc ~]# ceph osd crush move Datacenter-XuHui region=Shanghai
moved item id -11 name 'Datacenter-XuHui' to location {region=Shanghai} in crush map
[root@serverc ~]# ceph osd crush move Rack-1 datacenter=Datacenter-XuHui
moved item id -15 name 'Rack-1' to location {datacenter=Datacenter-XuHui} in crush map
[root@serverc ~]# ceph osd crush move Rack-2 datacenter=Datacenter-XuHui
moved item id -16 name 'Rack-2' to location {datacenter=Datacenter-XuHui} in crush map

再次查询crush map,发现层级关系是有了,但是主机还是在default这个根下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[root@serverc ~]# ceph osd crush tree
ID CLASS WEIGHT TYPE NAME
-9 0 root lxhcloud
-10 0 region Shanghai
-11 0 datacenter Datacenter-XuHui
-15 0 rack Rack-1
-16 0 rack Rack-2
-1 0.08817 root default
-3 0.02939 host serverc
0 hdd 0.00980 osd.0
1 hdd 0.00980 osd.1
2 hdd 0.00980 osd.2
-5 0.02939 host serverd
3 hdd 0.00980 osd.3
5 hdd 0.00980 osd.5
7 hdd 0.00980 osd.7
-7 0.02939 host servere
4 hdd 0.00980 osd.4
6 hdd 0.00980 osd.6
8 hdd 0.00980 osd.8

移动服务器过来

1
2
3
4
5
6
[root@serverc ~]# ceph osd crush move serverc rack=Rack-1
moved item id -3 name 'serverc' to location {rack=Rack-1} in crush map
[root@serverc ~]# ceph osd crush move serverd rack=Rack-1
moved item id -5 name 'serverd' to location {rack=Rack-1} in crush map
[root@serverc ~]# ceph osd crush move servere rack=Rack-2
moved item id -7 name 'servere' to location {rack=Rack-2} in crush map

再次查询crush map,发现我们已经按照业务规划,做了crush map制定

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[root@serverc ~]# ceph osd crush tree
ID CLASS WEIGHT TYPE NAME
-9 0.08817 root lxhcloud
-10 0.08817 region Shanghai
-11 0.08817 datacenter Datacenter-XuHui
-15 0.05878 rack Rack-1
-3 0.02939 host serverc
0 hdd 0.00980 osd.0
1 hdd 0.00980 osd.1
2 hdd 0.00980 osd.2
-5 0.02939 host serverd
3 hdd 0.00980 osd.3
5 hdd 0.00980 osd.5
7 hdd 0.00980 osd.7
-16 0.02939 rack Rack-2
-7 0.02939 host servere
4 hdd 0.00980 osd.4
6 hdd 0.00980 osd.6
8 hdd 0.00980 osd.8
-1 0 root default

修改Crush map的放置规则

只创建crush map是不够的,还得修改现有的规则来默认使用我们的新crush map,而不是使用旧的crush map

导出当前映射的二进制副本

1
2
3
[root@serverc ~]# cephadm shell
[ceph: root@serverc /]# ceph osd getcrushmap -o lxh.bin
32

将 CRUSH 映射二进制文件解译为文本文件

1
[ceph: root@serverc /]# crushtool -d lxh.bin -o lxh.txt

修改lxh.txt为以下内容,这个文件很长,这里只修改# rules部分,在最下面

1
[ceph: root@serverc /]# vi lxh.txt

这里将step take 后面的default改为了lxhcloud,将step chooseleaf firstn 0 type后面的host改为了osd

1
2
3
4
5
6
7
8
9
10
# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take lxhcloud
step chooseleaf firstn 0 type osd
step emit
}

从文本重新编译 CRUSH 映射

1
[ceph: root@serverc /]# crushtool -c lxh.txt -o new.bin

对二进制 CRUSH 映射执行空运行,并且模拟创建放置组

1
2
3
4
5
6
[ceph: root@serverc /]# crushtool -i new.bin --test --show-mappings | more
CRUSH rule 0 x 0 [6]
CRUSH rule 0 x 1 [5]
CRUSH rule 0 x 2 [8]
CRUSH rule 0 x 3 [4]
CRUSH rule 0 x 4 [5]

模拟看上去没问题,就把我们的新映射,导入到集群

需要注意,这里导入成功之后,原有的数据会重新做位置移动,会有大量的I/O和网络流量,集群会暂时处于警告状态

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
[ceph: root@serverc /]# ceph osd setcrushmap -i new.bin
33
[ceph: root@serverc /]# ceph -s
cluster:
id: 2ae6d05a-229a-11ec-925e-52540000fa0c
health: HEALTH_WARN
Reduced data availability: 3 pgs inactive, 21 pgs peering
Degraded data redundancy: 172/663 objects degraded (25.943%), 25 pgs degraded

services:
mon: 4 daemons, quorum serverc.lab.example.com,clienta,serverd,servere (age 6m)
mgr: serverc.lab.example.com.aiqepd(active, since 26h), standbys: clienta.nncugs, servere.kjwyko, serverd.klrkci
osd: 9 osds: 9 up (since 26h), 9 in (since 2y); 28 remapped pgs
rgw: 2 daemons active (2 hosts, 1 zones)

data:
pools: 5 pools, 105 pgs
objects: 221 objects, 4.9 KiB
usage: 507 MiB used, 89 GiB / 90 GiB avail
pgs: 48.571% pgs not active
172/663 objects degraded (25.943%)
76/663 objects misplaced (11.463%)
41 active+clean
23 remapped+peering
14 activating+undersized+degraded+remapped
8 active+recovery_wait+degraded
7 activating
4 activating+undersized+remapped
3 active+recovery_wait+undersized+degraded+remapped
3 activating+remapped
1 active+recovering+undersized+remapped
1 active

io:
recovery: 0 B/s, 0 objects/s

创建新的Crush map放置规则

如果暂时不想修改默认规则,或者对于导出和编译过程不是很熟悉,可以创建新的规则,然后将池的生效规则进行修改即可

创建的命令格式为:

1
ceph osd crush rule create-replicated/create-erasure <name> <root> <type> [<class>]

创建一个将数据放到lxhcloud的不同osd的复制池规则

复制池

1
2
3
4
[root@serverc ~]# ceph osd crush rule create-replicated lxhrep lxhcloud osd
[root@serverc ~]# ceph osd crush rule ls
replicated_rule
lxhrep

纠删代码池

1
2
3
4
5
6
7
8
9
10
11
[root@serverc ~]# ceph osd erasure-code-profile set lxhecprofile k=4 m=2 crush-root=lxhcloud crush-failure-domain=osd
[root@serverc ~]# ceph osd erasure-code-profile get lxhecprofile
crush-device-class=
crush-failure-domain=osd
crush-root=lxhcloud
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8

Ceph 会自动为您创建的每个纠删代码池创建规则。规则的名称是新池的名称

1
2
3
4
5
6
[root@serverc ~]# ceph osd pool create ecpool erasure lxhecprofile
pool 'ecpool' created
[root@serverc ~]# ceph osd crush rule ls
replicated_rule
lxhrep
ecpool

修改现有存储池的crush规则

将现有池的crush规则修改到lxhrep这个规则

以下命令批量修改了现有池的放置规则为lxhrep

1
2
3
4
5
6
[root@serverc ~]# for pool in $(ceph osd pool ls);do ceph osd pool set $pool crush_rule lxhrep;done
set pool 1 crush_rule to lxhrep
set pool 2 crush_rule to lxhrep
set pool 3 crush_rule to lxhrep
set pool 4 crush_rule to lxhrep
set pool 5 crush_rule to lxhrep

优化放置组

放置组自动扩展程序可用于优化 PG 分布,并默认开启。必要时,您还可手动设置每个池的 PG 数量,红帽建议按照每个 OSD 大约 100 到 200 个 PG 的数量设置

计算放置组的数量

对于单个池的集群,可以使用以下公式,每个OSD 100个放置组

1
Total PGs = (OSDs * 100)/Number of replicas 

Red Hat推荐使用每个池计算Ceph放置组,https://access.redhat.com/labs/cephpgc/manual/

手动映射PG

使用 ceph osd pg-upmap-items 命令将 PG 手动映射到特定的 OSD,luminous 版本及以上的Ceph版本才支持,先设置一下集群需要的最小客户端为luminous

1
2
[root@serverc ~]# ceph osd set-require-min-compat-client luminous
set require_min_compat_client to luminous

下面的例子将PG 7.16从ODs 2和 [3,0,4,2,7,5]映射到[1 2 3 4 5 6]:

1
2
3
4
5
[root@serverc ~]# ceph pg map 7.16
osdmap e273 pg 7.16 (7.16) -> up [3,0,4,2,7,5] acting [3,0,4,2,7,5]
[root@serverc ~]# ceph osd pg-upmap-items 7.16 1 2 3 4 5 6
set 7.16 pg_upmap_items mapping to [1->2,3->4,5->6]
[root@serverc ~]# ceph pg map 7.16

像这样重新映射数百个 PG 是不切实际的。这时可使用 osdmaptool 命令,它获取一个池的实际地图,分析它,再生成要运行的 ceph osd pg-upmap-items 命令,从而实现理想的分布

  1. 将映射导出到一个文件,下面的命令将映射保存到./ om文件:
1
2
[ceph: root@serverc /]# ceph osd getmap -o ./om
got osdmap epoch 278
  1. 使用 osdmaptool 命令的 --test-map-pgs 选项,显示实际的 PG 分布。以下命令打印 ID 为 6 的池的分布:

我们发现,count部分基本是均衡的,不需要再次优化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[ceph: root@serverc /]# osdmaptool ./om --test-map-pgs --pool 6
osdmaptool: osdmap file './om'
pool 6 pg_num 32
#osd count first primary c wt wt
osd.0 13 2 2 0.00999451 1
osd.1 13 4 4 0.00999451 1
osd.2 17 3 3 0.00999451 1
osd.3 13 1 1 0.00999451 1
osd.4 15 5 5 0.00999451 1
osd.5 12 3 3 0.00999451 1
osd.6 12 6 6 0.00999451 1
osd.7 15 2 2 0.00999451 1
osd.8 18 6 6 0.00999451 1
in 9
avg 14 stddev 2.0548 (0.146772x) (expected 3.55556 0.253968x))
min osd.5 12
max osd.8 18
size 4 32

输出显示了osd.2只有27个PG而osd.1有39 PG

  1. 生成命令来重新平衡 PG。使用 osdmaptool 命令的 --upmap 选项,将命令保存到文件中:

--upmap-deviation 这个参数默认不需要指定,我只是为了确保有内容输出而已,这个值默认是5,就是这些count的偏差值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[ceph: root@serverc /]# osdmaptool ./om --upmap ./cmds.txt --pool 6 --upmap-deviation 1
osdmaptool: osdmap file './om'
writing upmap command output to: ./cmds.txt
checking for upmap cleanups
upmap, max-count 10, max deviation 1
pools ecpool abc device_health_metrics default.rgw.meta .rgw.root default.rgw.log default.rgw.control
prepared 10/10 changes
[ceph: root@serverc /]# cat cmds.txt
ceph osd pg-upmap-items 7.6 7 5
ceph osd pg-upmap-items 7.7 3 6
ceph osd pg-upmap-items 7.9 4 5
ceph osd pg-upmap-items 7.16 5 6
ceph osd pg-upmap-items 7.17 2 8 7 6
ceph osd pg-upmap-items 7.18 0 5
ceph osd pg-upmap-items 7.19 3 6
ceph osd pg-upmap-items 7.1a 3 8
ceph osd pg-upmap-items 7.1b 2 6
  1. 执行命令:
1
2
3
4
5
6
7
8
9
10
[ceph: root@serverc /]# bash cmds.txt
set 7.6 pg_upmap_items mapping to [7->5]
set 7.7 pg_upmap_items mapping to [3->6]
set 7.9 pg_upmap_items mapping to [4->5]
set 7.16 pg_upmap_items mapping to [5->6]
set 7.17 pg_upmap_items mapping to [2->8,7->6]
set 7.18 pg_upmap_items mapping to [0->5]
set 7.19 pg_upmap_items mapping to [3->6]
set 7.1a pg_upmap_items mapping to [3->8]
set 7.1b pg_upmap_items mapping to [2->6]