CRUSH Location
The location of an OSD in terms of the CRUSH map’s hierarchy is
referred to as a crush location
. This location specifier takes the
form of a list of key and value pairs describing a position. For
example, if an OSD is in a particular row, rack, chassis and host, and
is part of the ‘default’ CRUSH tree (this is the case for the vast
majority of clusters), its crush location could be described as:
root=default row=a rack=a2 chassis=a2a host=a2a1
Note:
Note that the order of the keys does not matter.
The key name (left of =
) must be a valid CRUSH type
. By default
these include root, datacenter, room, row, pod, pdu, rack, chassis and host,
but those types can be customized to be anything appropriate by modifying
the CRUSH map.
Not all keys need to be specified. For example, by default, Ceph
automatically sets a ceph-osd
daemon’s location to be
root=default host=HOSTNAME
(based on the output from hostname -s
).
The crush location for an OSD is normally expressed via the crush location
config option being set in the ceph.conf
file. Each time the OSD starts,
it verifies it is in the correct location in the CRUSH map and, if it is not,
it moved itself. To disable this automatic CRUSH map management, add the
following to your configuration file in the [osd]
section:
osd crush update on start = false
Custom location hooks
A customized location hook can be used to generate a more complete
crush location on startup. The sample ceph-crush-location
utility
will generate a CRUSH location string for a given daemon. The
location is based on, in order of preference:
A crush location
option in ceph.conf.
A default of root=default host=HOSTNAME
where the hostname is
generated with the hostname -s
command.
This is not useful by itself, as the OSD itself has the exact same
behavior. However, the script can be modified to provide additional
location fields (for example, the rack or datacenter), and then the
hook enabled via the config option:
crush location hook = /path/to/customized-ceph-crush-location
This hook is passed several arguments (below) and should output a single line
to stdout with the CRUSH location description.:
$ ceph-crush-location --cluster CLUSTER --id ID --type TYPE
where the cluster name is typically ‘ceph’, the id is the daemon
identifier (the OSD number), and the daemon type is typically osd
.
CRUSH structure
The CRUSH map consists of, loosely speaking, a hierarchy describing
the physical topology of the cluster, and a set of rules defining
policy about how we place data on those devices. The hierarchy has
devices (ceph-osd
daemons) at the leaves, and internal nodes
corresponding to other physical features or groupings: hosts, racks,
rows, datacenters, and so on. The rules describe how replicas are
placed in terms of that hierarchy (e.g., ‘three replicas in different
racks’).
Devices
Devices are individual ceph-osd
daemons that can store data. You
will normally have one defined here for each OSD daemon in your
cluster. Devices are identified by an id (a non-negative integer) and
a name, normally osd.N
where N
is the device id.
Devices may also have a device class associated with them (e.g.,
hdd
or ssd
), allowing them to be conveniently targetted by a
crush rule.
Types and Buckets
A bucket is the CRUSH term for internal nodes in the hierarchy: hosts,
racks, rows, etc. The CRUSH map defines a series of types that are
used to describe these nodes. By default, these types include:
osd (or device)
host
chassis
rack
row
pdu
pod
room
datacenter
region
root
Most clusters make use of only a handful of these types, and others
can be defined as needed.
The hierarchy is built with devices (normally type osd
) at the
leaves, interior nodes with non-device types, and a root node of type
root
. For example,
Each node (device or bucket) in the hierarchy has a weight
associated with it, indicating the relative proportion of the total
data that device or hierarchy subtree should store. Weights are set
at the leaves, indicating the size of the device, and automatically
sum up the tree from there, such that the weight of the default node
will be the total of all devices contained beneath it. Normally
weights are in units of terabytes (TB).
You can get a simple view the CRUSH hierarchy for your cluster,
including the weights, with:
Rules
Rules define policy about how data is distributed across the devices
in the hierarchy.
CRUSH rules define placement and replication strategies or
distribution policies that allow you to specify exactly how CRUSH
places object replicas. For example, you might create a rule selecting
a pair of targets for 2-way mirroring, another rule for selecting
three targets in two different data centers for 3-way mirroring, and
yet another rule for erasure coding over six storage devices. For a
detailed discussion of CRUSH rules, refer to CRUSH - Controlled,
Scalable, Decentralized Placement of Replicated Data, and more
specifically to Section 3.2.
In almost all cases, CRUSH rules can be created via the CLI by
specifying the pool type they will be used for (replicated or
erasure coded), the failure domain, and optionally a device class.
In rare cases rules must be written by hand by manually editing the
CRUSH map.
You can see what rules are defined for your cluster with:
You can view the contents of the rules with:
Device classes
Each device can optionally have a class associated with it. By
default, OSDs automatically set their class on startup to either
hdd, ssd, or nvme based on the type of device they are backed
by.
The device class for one or more OSDs can be explicitly set with:
ceph osd crush set-device-class <class> <osd-name> [...]
Once a device class is set, it cannot be changed to another class
until the old class is unset with:
ceph osd crush rm-device-class <osd-name> [...]
This allows administrators to set device classes without the class
being changed on OSD restart or by some other script.
A placement rule that targets a specific device class can be created with:
ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
A pool can then be changed to use the new rule with:
ceph osd pool set <pool-name> crush_rule <rule-name>
Device classes are implemented by creating a “shadow” CRUSH hierarchy
for each device class in use that contains only devices of that class.
Rules can then distribute data over the shadow hierarchy. One nice
thing about this approach is that it is fully backward compatible with
old Ceph clients. You can view the CRUSH hierarchy with shadow items
with:
ceph osd crush tree --show-shadow
For older clusters created before Luminous that relied on manually
crafted CRUSH maps to maintain per-device-type hierarchies, there is a
reclassify tool available to help transition to device classes
without triggering data movement (see Migrating from a legacy SSD rule to device classes).
Weights sets
A weight set is an alternative set of weights to use when
calculating data placement. The normal weights associated with each
device in the CRUSH map are set based on the device size and indicate
how much data we should be storing where. However, because CRUSH is
based on a pseudorandom placement process, there is always some
variation from this ideal distribution, the same way that rolling a
dice sixty times will not result in rolling exactly 10 ones and 10
sixes. Weight sets allow the cluster to do a numerical optimization
based on the specifics of your cluster (hierarchy, pools, etc.) to achieve
a balanced distribution.
There are two types of weight sets supported:
A compat weight set is a single alternative set of weights for
each device and node in the cluster. This is not well-suited for
correcting for all anomalies (for example, placement groups for
different pools may be different sizes and have different load
levels, but will be mostly treated the same by the balancer).
However, compat weight sets have the huge advantage that they are
backward compatible with previous versions of Ceph, which means
that even though weight sets were first introduced in Luminous
v12.2.z, older clients (e.g., firefly) can still connect to the
cluster when a compat weight set is being used to balance data.
A per-pool weight set is more flexible in that it allows
placement to be optimized for each data pool. Additionally,
weights can be adjusted for each position of placement, allowing
the optimizer to correct for a suble skew of data toward devices
with small weights relative to their peers (and effect that is
usually only apparently in very large clusters but which can cause
balancing problems).
When weight sets are in use, the weights associated with each node in
the hierarchy is visible as a separate column (labeled either
(compat)
or the pool name) from the command:
When both compat and per-pool weight sets are in use, data
placement for a particular pool will use its own per-pool weight set
if present. If not, it will use the compat weight set if present. If
neither are present, it will use the normal CRUSH weights.
Although weight sets can be set up and manipulated by hand, it is
recommended that the balancer module be enabled to do so
automatically.