mirror of
https://github.com/Fishwaldo/Star64_linux.git
synced 2025-07-23 07:12:09 +00:00
docs: device-mapper: move it to the admin-guide
The DM support describes lots of aspects related to mapped disk partitions from the userspace PoV. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
This commit is contained in:
parent
159a5e78bd
commit
6cf2a73cb2
35 changed files with 5 additions and 6 deletions
131
Documentation/admin-guide/device-mapper/cache-policies.rst
Normal file
131
Documentation/admin-guide/device-mapper/cache-policies.rst
Normal file
|
@ -0,0 +1,131 @@
|
|||
=============================
|
||||
Guidance for writing policies
|
||||
=============================
|
||||
|
||||
Try to keep transactionality out of it. The core is careful to
|
||||
avoid asking about anything that is migrating. This is a pain, but
|
||||
makes it easier to write the policies.
|
||||
|
||||
Mappings are loaded into the policy at construction time.
|
||||
|
||||
Every bio that is mapped by the target is referred to the policy.
|
||||
The policy can return a simple HIT or MISS or issue a migration.
|
||||
|
||||
Currently there's no way for the policy to issue background work,
|
||||
e.g. to start writing back dirty blocks that are going to be evicted
|
||||
soon.
|
||||
|
||||
Because we map bios, rather than requests it's easy for the policy
|
||||
to get fooled by many small bios. For this reason the core target
|
||||
issues periodic ticks to the policy. It's suggested that the policy
|
||||
doesn't update states (eg, hit counts) for a block more than once
|
||||
for each tick. The core ticks by watching bios complete, and so
|
||||
trying to see when the io scheduler has let the ios run.
|
||||
|
||||
|
||||
Overview of supplied cache replacement policies
|
||||
===============================================
|
||||
|
||||
multiqueue (mq)
|
||||
---------------
|
||||
|
||||
This policy is now an alias for smq (see below).
|
||||
|
||||
The following tunables are accepted, but have no effect::
|
||||
|
||||
'sequential_threshold <#nr_sequential_ios>'
|
||||
'random_threshold <#nr_random_ios>'
|
||||
'read_promote_adjustment <value>'
|
||||
'write_promote_adjustment <value>'
|
||||
'discard_promote_adjustment <value>'
|
||||
|
||||
Stochastic multiqueue (smq)
|
||||
---------------------------
|
||||
|
||||
This policy is the default.
|
||||
|
||||
The stochastic multi-queue (smq) policy addresses some of the problems
|
||||
with the multiqueue (mq) policy.
|
||||
|
||||
The smq policy (vs mq) offers the promise of less memory utilization,
|
||||
improved performance and increased adaptability in the face of changing
|
||||
workloads. smq also does not have any cumbersome tuning knobs.
|
||||
|
||||
Users may switch from "mq" to "smq" simply by appropriately reloading a
|
||||
DM table that is using the cache target. Doing so will cause all of the
|
||||
mq policy's hints to be dropped. Also, performance of the cache may
|
||||
degrade slightly until smq recalculates the origin device's hotspots
|
||||
that should be cached.
|
||||
|
||||
Memory usage
|
||||
^^^^^^^^^^^^
|
||||
|
||||
The mq policy used a lot of memory; 88 bytes per cache block on a 64
|
||||
bit machine.
|
||||
|
||||
smq uses 28bit indexes to implement its data structures rather than
|
||||
pointers. It avoids storing an explicit hit count for each block. It
|
||||
has a 'hotspot' queue, rather than a pre-cache, which uses a quarter of
|
||||
the entries (each hotspot block covers a larger area than a single
|
||||
cache block).
|
||||
|
||||
All this means smq uses ~25bytes per cache block. Still a lot of
|
||||
memory, but a substantial improvement nontheless.
|
||||
|
||||
Level balancing
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
mq placed entries in different levels of the multiqueue structures
|
||||
based on their hit count (~ln(hit count)). This meant the bottom
|
||||
levels generally had the most entries, and the top ones had very
|
||||
few. Having unbalanced levels like this reduced the efficacy of the
|
||||
multiqueue.
|
||||
|
||||
smq does not maintain a hit count, instead it swaps hit entries with
|
||||
the least recently used entry from the level above. The overall
|
||||
ordering being a side effect of this stochastic process. With this
|
||||
scheme we can decide how many entries occupy each multiqueue level,
|
||||
resulting in better promotion/demotion decisions.
|
||||
|
||||
Adaptability:
|
||||
The mq policy maintained a hit count for each cache block. For a
|
||||
different block to get promoted to the cache its hit count has to
|
||||
exceed the lowest currently in the cache. This meant it could take a
|
||||
long time for the cache to adapt between varying IO patterns.
|
||||
|
||||
smq doesn't maintain hit counts, so a lot of this problem just goes
|
||||
away. In addition it tracks performance of the hotspot queue, which
|
||||
is used to decide which blocks to promote. If the hotspot queue is
|
||||
performing badly then it starts moving entries more quickly between
|
||||
levels. This lets it adapt to new IO patterns very quickly.
|
||||
|
||||
Performance
|
||||
^^^^^^^^^^^
|
||||
|
||||
Testing smq shows substantially better performance than mq.
|
||||
|
||||
cleaner
|
||||
-------
|
||||
|
||||
The cleaner writes back all dirty blocks in a cache to decommission it.
|
||||
|
||||
Examples
|
||||
========
|
||||
|
||||
The syntax for a table is::
|
||||
|
||||
cache <metadata dev> <cache dev> <origin dev> <block size>
|
||||
<#feature_args> [<feature arg>]*
|
||||
<policy> <#policy_args> [<policy arg>]*
|
||||
|
||||
The syntax to send a message using the dmsetup command is::
|
||||
|
||||
dmsetup message <mapped device> 0 sequential_threshold 1024
|
||||
dmsetup message <mapped device> 0 random_threshold 8
|
||||
|
||||
Using dmsetup::
|
||||
|
||||
dmsetup create blah --table "0 268435456 cache /dev/sdb /dev/sdc \
|
||||
/dev/sdd 512 0 mq 4 sequential_threshold 1024 random_threshold 8"
|
||||
creates a 128GB large mapped device named 'blah' with the
|
||||
sequential threshold set to 1024 and the random_threshold set to 8.
|
337
Documentation/admin-guide/device-mapper/cache.rst
Normal file
337
Documentation/admin-guide/device-mapper/cache.rst
Normal file
|
@ -0,0 +1,337 @@
|
|||
=====
|
||||
Cache
|
||||
=====
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
dm-cache is a device mapper target written by Joe Thornber, Heinz
|
||||
Mauelshagen, and Mike Snitzer.
|
||||
|
||||
It aims to improve performance of a block device (eg, a spindle) by
|
||||
dynamically migrating some of its data to a faster, smaller device
|
||||
(eg, an SSD).
|
||||
|
||||
This device-mapper solution allows us to insert this caching at
|
||||
different levels of the dm stack, for instance above the data device for
|
||||
a thin-provisioning pool. Caching solutions that are integrated more
|
||||
closely with the virtual memory system should give better performance.
|
||||
|
||||
The target reuses the metadata library used in the thin-provisioning
|
||||
library.
|
||||
|
||||
The decision as to what data to migrate and when is left to a plug-in
|
||||
policy module. Several of these have been written as we experiment,
|
||||
and we hope other people will contribute others for specific io
|
||||
scenarios (eg. a vm image server).
|
||||
|
||||
Glossary
|
||||
========
|
||||
|
||||
Migration
|
||||
Movement of the primary copy of a logical block from one
|
||||
device to the other.
|
||||
Promotion
|
||||
Migration from slow device to fast device.
|
||||
Demotion
|
||||
Migration from fast device to slow device.
|
||||
|
||||
The origin device always contains a copy of the logical block, which
|
||||
may be out of date or kept in sync with the copy on the cache device
|
||||
(depending on policy).
|
||||
|
||||
Design
|
||||
======
|
||||
|
||||
Sub-devices
|
||||
-----------
|
||||
|
||||
The target is constructed by passing three devices to it (along with
|
||||
other parameters detailed later):
|
||||
|
||||
1. An origin device - the big, slow one.
|
||||
|
||||
2. A cache device - the small, fast one.
|
||||
|
||||
3. A small metadata device - records which blocks are in the cache,
|
||||
which are dirty, and extra hints for use by the policy object.
|
||||
This information could be put on the cache device, but having it
|
||||
separate allows the volume manager to configure it differently,
|
||||
e.g. as a mirror for extra robustness. This metadata device may only
|
||||
be used by a single cache device.
|
||||
|
||||
Fixed block size
|
||||
----------------
|
||||
|
||||
The origin is divided up into blocks of a fixed size. This block size
|
||||
is configurable when you first create the cache. Typically we've been
|
||||
using block sizes of 256KB - 1024KB. The block size must be between 64
|
||||
sectors (32KB) and 2097152 sectors (1GB) and a multiple of 64 sectors (32KB).
|
||||
|
||||
Having a fixed block size simplifies the target a lot. But it is
|
||||
something of a compromise. For instance, a small part of a block may be
|
||||
getting hit a lot, yet the whole block will be promoted to the cache.
|
||||
So large block sizes are bad because they waste cache space. And small
|
||||
block sizes are bad because they increase the amount of metadata (both
|
||||
in core and on disk).
|
||||
|
||||
Cache operating modes
|
||||
---------------------
|
||||
|
||||
The cache has three operating modes: writeback, writethrough and
|
||||
passthrough.
|
||||
|
||||
If writeback, the default, is selected then a write to a block that is
|
||||
cached will go only to the cache and the block will be marked dirty in
|
||||
the metadata.
|
||||
|
||||
If writethrough is selected then a write to a cached block will not
|
||||
complete until it has hit both the origin and cache devices. Clean
|
||||
blocks should remain clean.
|
||||
|
||||
If passthrough is selected, useful when the cache contents are not known
|
||||
to be coherent with the origin device, then all reads are served from
|
||||
the origin device (all reads miss the cache) and all writes are
|
||||
forwarded to the origin device; additionally, write hits cause cache
|
||||
block invalidates. To enable passthrough mode the cache must be clean.
|
||||
Passthrough mode allows a cache device to be activated without having to
|
||||
worry about coherency. Coherency that exists is maintained, although
|
||||
the cache will gradually cool as writes take place. If the coherency of
|
||||
the cache can later be verified, or established through use of the
|
||||
"invalidate_cblocks" message, the cache device can be transitioned to
|
||||
writethrough or writeback mode while still warm. Otherwise, the cache
|
||||
contents can be discarded prior to transitioning to the desired
|
||||
operating mode.
|
||||
|
||||
A simple cleaner policy is provided, which will clean (write back) all
|
||||
dirty blocks in a cache. Useful for decommissioning a cache or when
|
||||
shrinking a cache. Shrinking the cache's fast device requires all cache
|
||||
blocks, in the area of the cache being removed, to be clean. If the
|
||||
area being removed from the cache still contains dirty blocks the resize
|
||||
will fail. Care must be taken to never reduce the volume used for the
|
||||
cache's fast device until the cache is clean. This is of particular
|
||||
importance if writeback mode is used. Writethrough and passthrough
|
||||
modes already maintain a clean cache. Future support to partially clean
|
||||
the cache, above a specified threshold, will allow for keeping the cache
|
||||
warm and in writeback mode during resize.
|
||||
|
||||
Migration throttling
|
||||
--------------------
|
||||
|
||||
Migrating data between the origin and cache device uses bandwidth.
|
||||
The user can set a throttle to prevent more than a certain amount of
|
||||
migration occurring at any one time. Currently we're not taking any
|
||||
account of normal io traffic going to the devices. More work needs
|
||||
doing here to avoid migrating during those peak io moments.
|
||||
|
||||
For the time being, a message "migration_threshold <#sectors>"
|
||||
can be used to set the maximum number of sectors being migrated,
|
||||
the default being 2048 sectors (1MB).
|
||||
|
||||
Updating on-disk metadata
|
||||
-------------------------
|
||||
|
||||
On-disk metadata is committed every time a FLUSH or FUA bio is written.
|
||||
If no such requests are made then commits will occur every second. This
|
||||
means the cache behaves like a physical disk that has a volatile write
|
||||
cache. If power is lost you may lose some recent writes. The metadata
|
||||
should always be consistent in spite of any crash.
|
||||
|
||||
The 'dirty' state for a cache block changes far too frequently for us
|
||||
to keep updating it on the fly. So we treat it as a hint. In normal
|
||||
operation it will be written when the dm device is suspended. If the
|
||||
system crashes all cache blocks will be assumed dirty when restarted.
|
||||
|
||||
Per-block policy hints
|
||||
----------------------
|
||||
|
||||
Policy plug-ins can store a chunk of data per cache block. It's up to
|
||||
the policy how big this chunk is, but it should be kept small. Like the
|
||||
dirty flags this data is lost if there's a crash so a safe fallback
|
||||
value should always be possible.
|
||||
|
||||
Policy hints affect performance, not correctness.
|
||||
|
||||
Policy messaging
|
||||
----------------
|
||||
|
||||
Policies will have different tunables, specific to each one, so we
|
||||
need a generic way of getting and setting these. Device-mapper
|
||||
messages are used. Refer to cache-policies.txt.
|
||||
|
||||
Discard bitset resolution
|
||||
-------------------------
|
||||
|
||||
We can avoid copying data during migration if we know the block has
|
||||
been discarded. A prime example of this is when mkfs discards the
|
||||
whole block device. We store a bitset tracking the discard state of
|
||||
blocks. However, we allow this bitset to have a different block size
|
||||
from the cache blocks. This is because we need to track the discard
|
||||
state for all of the origin device (compare with the dirty bitset
|
||||
which is just for the smaller cache device).
|
||||
|
||||
Target interface
|
||||
================
|
||||
|
||||
Constructor
|
||||
-----------
|
||||
|
||||
::
|
||||
|
||||
cache <metadata dev> <cache dev> <origin dev> <block size>
|
||||
<#feature args> [<feature arg>]*
|
||||
<policy> <#policy args> [policy args]*
|
||||
|
||||
================ =======================================================
|
||||
metadata dev fast device holding the persistent metadata
|
||||
cache dev fast device holding cached data blocks
|
||||
origin dev slow device holding original data blocks
|
||||
block size cache unit size in sectors
|
||||
|
||||
#feature args number of feature arguments passed
|
||||
feature args writethrough or passthrough (The default is writeback.)
|
||||
|
||||
policy the replacement policy to use
|
||||
#policy args an even number of arguments corresponding to
|
||||
key/value pairs passed to the policy
|
||||
policy args key/value pairs passed to the policy
|
||||
E.g. 'sequential_threshold 1024'
|
||||
See cache-policies.txt for details.
|
||||
================ =======================================================
|
||||
|
||||
Optional feature arguments are:
|
||||
|
||||
|
||||
==================== ========================================================
|
||||
writethrough write through caching that prohibits cache block
|
||||
content from being different from origin block content.
|
||||
Without this argument, the default behaviour is to write
|
||||
back cache block contents later for performance reasons,
|
||||
so they may differ from the corresponding origin blocks.
|
||||
|
||||
passthrough a degraded mode useful for various cache coherency
|
||||
situations (e.g., rolling back snapshots of
|
||||
underlying storage). Reads and writes always go to
|
||||
the origin. If a write goes to a cached origin
|
||||
block, then the cache block is invalidated.
|
||||
To enable passthrough mode the cache must be clean.
|
||||
|
||||
metadata2 use version 2 of the metadata. This stores the dirty
|
||||
bits in a separate btree, which improves speed of
|
||||
shutting down the cache.
|
||||
|
||||
no_discard_passdown disable passing down discards from the cache
|
||||
to the origin's data device.
|
||||
==================== ========================================================
|
||||
|
||||
A policy called 'default' is always registered. This is an alias for
|
||||
the policy we currently think is giving best all round performance.
|
||||
|
||||
As the default policy could vary between kernels, if you are relying on
|
||||
the characteristics of a specific policy, always request it by name.
|
||||
|
||||
Status
|
||||
------
|
||||
|
||||
::
|
||||
|
||||
<metadata block size> <#used metadata blocks>/<#total metadata blocks>
|
||||
<cache block size> <#used cache blocks>/<#total cache blocks>
|
||||
<#read hits> <#read misses> <#write hits> <#write misses>
|
||||
<#demotions> <#promotions> <#dirty> <#features> <features>*
|
||||
<#core args> <core args>* <policy name> <#policy args> <policy args>*
|
||||
<cache metadata mode>
|
||||
|
||||
|
||||
========================= =====================================================
|
||||
metadata block size Fixed block size for each metadata block in
|
||||
sectors
|
||||
#used metadata blocks Number of metadata blocks used
|
||||
#total metadata blocks Total number of metadata blocks
|
||||
cache block size Configurable block size for the cache device
|
||||
in sectors
|
||||
#used cache blocks Number of blocks resident in the cache
|
||||
#total cache blocks Total number of cache blocks
|
||||
#read hits Number of times a READ bio has been mapped
|
||||
to the cache
|
||||
#read misses Number of times a READ bio has been mapped
|
||||
to the origin
|
||||
#write hits Number of times a WRITE bio has been mapped
|
||||
to the cache
|
||||
#write misses Number of times a WRITE bio has been
|
||||
mapped to the origin
|
||||
#demotions Number of times a block has been removed
|
||||
from the cache
|
||||
#promotions Number of times a block has been moved to
|
||||
the cache
|
||||
#dirty Number of blocks in the cache that differ
|
||||
from the origin
|
||||
#feature args Number of feature args to follow
|
||||
feature args 'writethrough' (optional)
|
||||
#core args Number of core arguments (must be even)
|
||||
core args Key/value pairs for tuning the core
|
||||
e.g. migration_threshold
|
||||
policy name Name of the policy
|
||||
#policy args Number of policy arguments to follow (must be even)
|
||||
policy args Key/value pairs e.g. sequential_threshold
|
||||
cache metadata mode ro if read-only, rw if read-write
|
||||
|
||||
In serious cases where even a read-only mode is
|
||||
deemed unsafe no further I/O will be permitted and
|
||||
the status will just contain the string 'Fail'.
|
||||
The userspace recovery tools should then be used.
|
||||
needs_check 'needs_check' if set, '-' if not set
|
||||
A metadata operation has failed, resulting in the
|
||||
needs_check flag being set in the metadata's
|
||||
superblock. The metadata device must be
|
||||
deactivated and checked/repaired before the
|
||||
cache can be made fully operational again.
|
||||
'-' indicates needs_check is not set.
|
||||
========================= =====================================================
|
||||
|
||||
Messages
|
||||
--------
|
||||
|
||||
Policies will have different tunables, specific to each one, so we
|
||||
need a generic way of getting and setting these. Device-mapper
|
||||
messages are used. (A sysfs interface would also be possible.)
|
||||
|
||||
The message format is::
|
||||
|
||||
<key> <value>
|
||||
|
||||
E.g.::
|
||||
|
||||
dmsetup message my_cache 0 sequential_threshold 1024
|
||||
|
||||
|
||||
Invalidation is removing an entry from the cache without writing it
|
||||
back. Cache blocks can be invalidated via the invalidate_cblocks
|
||||
message, which takes an arbitrary number of cblock ranges. Each cblock
|
||||
range's end value is "one past the end", meaning 5-10 expresses a range
|
||||
of values from 5 to 9. Each cblock must be expressed as a decimal
|
||||
value, in the future a variant message that takes cblock ranges
|
||||
expressed in hexadecimal may be needed to better support efficient
|
||||
invalidation of larger caches. The cache must be in passthrough mode
|
||||
when invalidate_cblocks is used::
|
||||
|
||||
invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]*
|
||||
|
||||
E.g.::
|
||||
|
||||
dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789
|
||||
|
||||
Examples
|
||||
========
|
||||
|
||||
The test suite can be found here:
|
||||
|
||||
https://github.com/jthornber/device-mapper-test-suite
|
||||
|
||||
::
|
||||
|
||||
dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
|
||||
/dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0'
|
||||
dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
|
||||
/dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \
|
||||
mq 4 sequential_threshold 1024 random_threshold 8'
|
31
Documentation/admin-guide/device-mapper/delay.rst
Normal file
31
Documentation/admin-guide/device-mapper/delay.rst
Normal file
|
@ -0,0 +1,31 @@
|
|||
========
|
||||
dm-delay
|
||||
========
|
||||
|
||||
Device-Mapper's "delay" target delays reads and/or writes
|
||||
and maps them to different devices.
|
||||
|
||||
Parameters::
|
||||
|
||||
<device> <offset> <delay> [<write_device> <write_offset> <write_delay>
|
||||
[<flush_device> <flush_offset> <flush_delay>]]
|
||||
|
||||
With separate write parameters, the first set is only used for reads.
|
||||
Offsets are specified in sectors.
|
||||
Delays are specified in milliseconds.
|
||||
|
||||
Example scripts
|
||||
===============
|
||||
|
||||
::
|
||||
|
||||
#!/bin/sh
|
||||
# Create device delaying rw operation for 500ms
|
||||
echo "0 `blockdev --getsz $1` delay $1 0 500" | dmsetup create delayed
|
||||
|
||||
::
|
||||
|
||||
#!/bin/sh
|
||||
# Create device delaying only write operation for 500ms and
|
||||
# splitting reads and writes to different devices $1 $2
|
||||
echo "0 `blockdev --getsz $1` delay $1 0 0 $2 0 500" | dmsetup create delayed
|
173
Documentation/admin-guide/device-mapper/dm-crypt.rst
Normal file
173
Documentation/admin-guide/device-mapper/dm-crypt.rst
Normal file
|
@ -0,0 +1,173 @@
|
|||
========
|
||||
dm-crypt
|
||||
========
|
||||
|
||||
Device-Mapper's "crypt" target provides transparent encryption of block devices
|
||||
using the kernel crypto API.
|
||||
|
||||
For a more detailed description of supported parameters see:
|
||||
https://gitlab.com/cryptsetup/cryptsetup/wikis/DMCrypt
|
||||
|
||||
Parameters::
|
||||
|
||||
<cipher> <key> <iv_offset> <device path> \
|
||||
<offset> [<#opt_params> <opt_params>]
|
||||
|
||||
<cipher>
|
||||
Encryption cipher, encryption mode and Initial Vector (IV) generator.
|
||||
|
||||
The cipher specifications format is::
|
||||
|
||||
cipher[:keycount]-chainmode-ivmode[:ivopts]
|
||||
|
||||
Examples::
|
||||
|
||||
aes-cbc-essiv:sha256
|
||||
aes-xts-plain64
|
||||
serpent-xts-plain64
|
||||
|
||||
Cipher format also supports direct specification with kernel crypt API
|
||||
format (selected by capi: prefix). The IV specification is the same
|
||||
as for the first format type.
|
||||
This format is mainly used for specification of authenticated modes.
|
||||
|
||||
The crypto API cipher specifications format is::
|
||||
|
||||
capi:cipher_api_spec-ivmode[:ivopts]
|
||||
|
||||
Examples::
|
||||
|
||||
capi:cbc(aes)-essiv:sha256
|
||||
capi:xts(aes)-plain64
|
||||
|
||||
Examples of authenticated modes::
|
||||
|
||||
capi:gcm(aes)-random
|
||||
capi:authenc(hmac(sha256),xts(aes))-random
|
||||
capi:rfc7539(chacha20,poly1305)-random
|
||||
|
||||
The /proc/crypto contains a list of curently loaded crypto modes.
|
||||
|
||||
<key>
|
||||
Key used for encryption. It is encoded either as a hexadecimal number
|
||||
or it can be passed as <key_string> prefixed with single colon
|
||||
character (':') for keys residing in kernel keyring service.
|
||||
You can only use key sizes that are valid for the selected cipher
|
||||
in combination with the selected iv mode.
|
||||
Note that for some iv modes the key string can contain additional
|
||||
keys (for example IV seed) so the key contains more parts concatenated
|
||||
into a single string.
|
||||
|
||||
<key_string>
|
||||
The kernel keyring key is identified by string in following format:
|
||||
<key_size>:<key_type>:<key_description>.
|
||||
|
||||
<key_size>
|
||||
The encryption key size in bytes. The kernel key payload size must match
|
||||
the value passed in <key_size>.
|
||||
|
||||
<key_type>
|
||||
Either 'logon' or 'user' kernel key type.
|
||||
|
||||
<key_description>
|
||||
The kernel keyring key description crypt target should look for
|
||||
when loading key of <key_type>.
|
||||
|
||||
<keycount>
|
||||
Multi-key compatibility mode. You can define <keycount> keys and
|
||||
then sectors are encrypted according to their offsets (sector 0 uses key0;
|
||||
sector 1 uses key1 etc.). <keycount> must be a power of two.
|
||||
|
||||
<iv_offset>
|
||||
The IV offset is a sector count that is added to the sector number
|
||||
before creating the IV.
|
||||
|
||||
<device path>
|
||||
This is the device that is going to be used as backend and contains the
|
||||
encrypted data. You can specify it as a path like /dev/xxx or a device
|
||||
number <major>:<minor>.
|
||||
|
||||
<offset>
|
||||
Starting sector within the device where the encrypted data begins.
|
||||
|
||||
<#opt_params>
|
||||
Number of optional parameters. If there are no optional parameters,
|
||||
the optional paramaters section can be skipped or #opt_params can be zero.
|
||||
Otherwise #opt_params is the number of following arguments.
|
||||
|
||||
Example of optional parameters section:
|
||||
3 allow_discards same_cpu_crypt submit_from_crypt_cpus
|
||||
|
||||
allow_discards
|
||||
Block discard requests (a.k.a. TRIM) are passed through the crypt device.
|
||||
The default is to ignore discard requests.
|
||||
|
||||
WARNING: Assess the specific security risks carefully before enabling this
|
||||
option. For example, allowing discards on encrypted devices may lead to
|
||||
the leak of information about the ciphertext device (filesystem type,
|
||||
used space etc.) if the discarded blocks can be located easily on the
|
||||
device later.
|
||||
|
||||
same_cpu_crypt
|
||||
Perform encryption using the same cpu that IO was submitted on.
|
||||
The default is to use an unbound workqueue so that encryption work
|
||||
is automatically balanced between available CPUs.
|
||||
|
||||
submit_from_crypt_cpus
|
||||
Disable offloading writes to a separate thread after encryption.
|
||||
There are some situations where offloading write bios from the
|
||||
encryption threads to a single thread degrades performance
|
||||
significantly. The default is to offload write bios to the same
|
||||
thread because it benefits CFQ to have writes submitted using the
|
||||
same context.
|
||||
|
||||
integrity:<bytes>:<type>
|
||||
The device requires additional <bytes> metadata per-sector stored
|
||||
in per-bio integrity structure. This metadata must by provided
|
||||
by underlying dm-integrity target.
|
||||
|
||||
The <type> can be "none" if metadata is used only for persistent IV.
|
||||
|
||||
For Authenticated Encryption with Additional Data (AEAD)
|
||||
the <type> is "aead". An AEAD mode additionally calculates and verifies
|
||||
integrity for the encrypted device. The additional space is then
|
||||
used for storing authentication tag (and persistent IV if needed).
|
||||
|
||||
sector_size:<bytes>
|
||||
Use <bytes> as the encryption unit instead of 512 bytes sectors.
|
||||
This option can be in range 512 - 4096 bytes and must be power of two.
|
||||
Virtual device will announce this size as a minimal IO and logical sector.
|
||||
|
||||
iv_large_sectors
|
||||
IV generators will use sector number counted in <sector_size> units
|
||||
instead of default 512 bytes sectors.
|
||||
|
||||
For example, if <sector_size> is 4096 bytes, plain64 IV for the second
|
||||
sector will be 8 (without flag) and 1 if iv_large_sectors is present.
|
||||
The <iv_offset> must be multiple of <sector_size> (in 512 bytes units)
|
||||
if this flag is specified.
|
||||
|
||||
Example scripts
|
||||
===============
|
||||
LUKS (Linux Unified Key Setup) is now the preferred way to set up disk
|
||||
encryption with dm-crypt using the 'cryptsetup' utility, see
|
||||
https://gitlab.com/cryptsetup/cryptsetup
|
||||
|
||||
::
|
||||
|
||||
#!/bin/sh
|
||||
# Create a crypt device using dmsetup
|
||||
dmsetup create crypt1 --table "0 `blockdev --getsz $1` crypt aes-cbc-essiv:sha256 babebabebabebabebabebabebabebabe 0 $1 0"
|
||||
|
||||
::
|
||||
|
||||
#!/bin/sh
|
||||
# Create a crypt device using dmsetup when encryption key is stored in keyring service
|
||||
dmsetup create crypt2 --table "0 `blockdev --getsize $1` crypt aes-cbc-essiv:sha256 :32:logon:my_prefix:my_key 0 $1 0"
|
||||
|
||||
::
|
||||
|
||||
#!/bin/sh
|
||||
# Create a crypt device using cryptsetup and LUKS header with default cipher
|
||||
cryptsetup luksFormat $1
|
||||
cryptsetup luksOpen $1 crypt1
|
272
Documentation/admin-guide/device-mapper/dm-dust.txt
Normal file
272
Documentation/admin-guide/device-mapper/dm-dust.txt
Normal file
|
@ -0,0 +1,272 @@
|
|||
dm-dust
|
||||
=======
|
||||
|
||||
This target emulates the behavior of bad sectors at arbitrary
|
||||
locations, and the ability to enable the emulation of the failures
|
||||
at an arbitrary time.
|
||||
|
||||
This target behaves similarly to a linear target. At a given time,
|
||||
the user can send a message to the target to start failing read
|
||||
requests on specific blocks (to emulate the behavior of a hard disk
|
||||
drive with bad sectors).
|
||||
|
||||
When the failure behavior is enabled (i.e.: when the output of
|
||||
"dmsetup status" displays "fail_read_on_bad_block"), reads of blocks
|
||||
in the "bad block list" will fail with EIO ("Input/output error").
|
||||
|
||||
Writes of blocks in the "bad block list will result in the following:
|
||||
|
||||
1. Remove the block from the "bad block list".
|
||||
2. Successfully complete the write.
|
||||
|
||||
This emulates the "remapped sector" behavior of a drive with bad
|
||||
sectors.
|
||||
|
||||
Normally, a drive that is encountering bad sectors will most likely
|
||||
encounter more bad sectors, at an unknown time or location.
|
||||
With dm-dust, the user can use the "addbadblock" and "removebadblock"
|
||||
messages to add arbitrary bad blocks at new locations, and the
|
||||
"enable" and "disable" messages to modulate the state of whether the
|
||||
configured "bad blocks" will be treated as bad, or bypassed.
|
||||
This allows the pre-writing of test data and metadata prior to
|
||||
simulating a "failure" event where bad sectors start to appear.
|
||||
|
||||
Table parameters:
|
||||
-----------------
|
||||
<device_path> <offset> <blksz>
|
||||
|
||||
Mandatory parameters:
|
||||
<device_path>: path to the block device.
|
||||
<offset>: offset to data area from start of device_path
|
||||
<blksz>: block size in bytes
|
||||
(minimum 512, maximum 1073741824, must be a power of 2)
|
||||
|
||||
Usage instructions:
|
||||
-------------------
|
||||
|
||||
First, find the size (in 512-byte sectors) of the device to be used:
|
||||
|
||||
$ sudo blockdev --getsz /dev/vdb1
|
||||
33552384
|
||||
|
||||
Create the dm-dust device:
|
||||
(For a device with a block size of 512 bytes)
|
||||
$ sudo dmsetup create dust1 --table '0 33552384 dust /dev/vdb1 0 512'
|
||||
|
||||
(For a device with a block size of 4096 bytes)
|
||||
$ sudo dmsetup create dust1 --table '0 33552384 dust /dev/vdb1 0 4096'
|
||||
|
||||
Check the status of the read behavior ("bypass" indicates that all I/O
|
||||
will be passed through to the underlying device):
|
||||
$ sudo dmsetup status dust1
|
||||
0 33552384 dust 252:17 bypass
|
||||
|
||||
$ sudo dd if=/dev/mapper/dust1 of=/dev/null bs=512 count=128 iflag=direct
|
||||
128+0 records in
|
||||
128+0 records out
|
||||
|
||||
$ sudo dd if=/dev/zero of=/dev/mapper/dust1 bs=512 count=128 oflag=direct
|
||||
128+0 records in
|
||||
128+0 records out
|
||||
|
||||
Adding and removing bad blocks:
|
||||
-------------------------------
|
||||
|
||||
At any time (i.e.: whether the device has the "bad block" emulation
|
||||
enabled or disabled), bad blocks may be added or removed from the
|
||||
device via the "addbadblock" and "removebadblock" messages:
|
||||
|
||||
$ sudo dmsetup message dust1 0 addbadblock 60
|
||||
kernel: device-mapper: dust: badblock added at block 60
|
||||
|
||||
$ sudo dmsetup message dust1 0 addbadblock 67
|
||||
kernel: device-mapper: dust: badblock added at block 67
|
||||
|
||||
$ sudo dmsetup message dust1 0 addbadblock 72
|
||||
kernel: device-mapper: dust: badblock added at block 72
|
||||
|
||||
These bad blocks will be stored in the "bad block list".
|
||||
While the device is in "bypass" mode, reads and writes will succeed:
|
||||
|
||||
$ sudo dmsetup status dust1
|
||||
0 33552384 dust 252:17 bypass
|
||||
|
||||
Enabling block read failures:
|
||||
-----------------------------
|
||||
|
||||
To enable the "fail read on bad block" behavior, send the "enable" message:
|
||||
|
||||
$ sudo dmsetup message dust1 0 enable
|
||||
kernel: device-mapper: dust: enabling read failures on bad sectors
|
||||
|
||||
$ sudo dmsetup status dust1
|
||||
0 33552384 dust 252:17 fail_read_on_bad_block
|
||||
|
||||
With the device in "fail read on bad block" mode, attempting to read a
|
||||
block will encounter an "Input/output error":
|
||||
|
||||
$ sudo dd if=/dev/mapper/dust1 of=/dev/null bs=512 count=1 skip=67 iflag=direct
|
||||
dd: error reading '/dev/mapper/dust1': Input/output error
|
||||
0+0 records in
|
||||
0+0 records out
|
||||
0 bytes copied, 0.00040651 s, 0.0 kB/s
|
||||
|
||||
...and writing to the bad blocks will remove the blocks from the list,
|
||||
therefore emulating the "remap" behavior of hard disk drives:
|
||||
|
||||
$ sudo dd if=/dev/zero of=/dev/mapper/dust1 bs=512 count=128 oflag=direct
|
||||
128+0 records in
|
||||
128+0 records out
|
||||
|
||||
kernel: device-mapper: dust: block 60 removed from badblocklist by write
|
||||
kernel: device-mapper: dust: block 67 removed from badblocklist by write
|
||||
kernel: device-mapper: dust: block 72 removed from badblocklist by write
|
||||
kernel: device-mapper: dust: block 87 removed from badblocklist by write
|
||||
|
||||
Bad block add/remove error handling:
|
||||
------------------------------------
|
||||
|
||||
Attempting to add a bad block that already exists in the list will
|
||||
result in an "Invalid argument" error, as well as a helpful message:
|
||||
|
||||
$ sudo dmsetup message dust1 0 addbadblock 88
|
||||
device-mapper: message ioctl on dust1 failed: Invalid argument
|
||||
kernel: device-mapper: dust: block 88 already in badblocklist
|
||||
|
||||
Attempting to remove a bad block that doesn't exist in the list will
|
||||
result in an "Invalid argument" error, as well as a helpful message:
|
||||
|
||||
$ sudo dmsetup message dust1 0 removebadblock 87
|
||||
device-mapper: message ioctl on dust1 failed: Invalid argument
|
||||
kernel: device-mapper: dust: block 87 not found in badblocklist
|
||||
|
||||
Counting the number of bad blocks in the bad block list:
|
||||
--------------------------------------------------------
|
||||
|
||||
To count the number of bad blocks configured in the device, run the
|
||||
following message command:
|
||||
|
||||
$ sudo dmsetup message dust1 0 countbadblocks
|
||||
|
||||
A message will print with the number of bad blocks currently
|
||||
configured on the device:
|
||||
|
||||
kernel: device-mapper: dust: countbadblocks: 895 badblock(s) found
|
||||
|
||||
Querying for specific bad blocks:
|
||||
---------------------------------
|
||||
|
||||
To find out if a specific block is in the bad block list, run the
|
||||
following message command:
|
||||
|
||||
$ sudo dmsetup message dust1 0 queryblock 72
|
||||
|
||||
The following message will print if the block is in the list:
|
||||
device-mapper: dust: queryblock: block 72 found in badblocklist
|
||||
|
||||
The following message will print if the block is in the list:
|
||||
device-mapper: dust: queryblock: block 72 not found in badblocklist
|
||||
|
||||
The "queryblock" message command will work in both the "enabled"
|
||||
and "disabled" modes, allowing the verification of whether a block
|
||||
will be treated as "bad" without having to issue I/O to the device,
|
||||
or having to "enable" the bad block emulation.
|
||||
|
||||
Clearing the bad block list:
|
||||
----------------------------
|
||||
|
||||
To clear the bad block list (without needing to individually run
|
||||
a "removebadblock" message command for every block), run the
|
||||
following message command:
|
||||
|
||||
$ sudo dmsetup message dust1 0 clearbadblocks
|
||||
|
||||
After clearing the bad block list, the following message will appear:
|
||||
|
||||
kernel: device-mapper: dust: clearbadblocks: badblocks cleared
|
||||
|
||||
If there were no bad blocks to clear, the following message will
|
||||
appear:
|
||||
|
||||
kernel: device-mapper: dust: clearbadblocks: no badblocks found
|
||||
|
||||
Message commands list:
|
||||
----------------------
|
||||
|
||||
Below is a list of the messages that can be sent to a dust device:
|
||||
|
||||
Operations on blocks (requires a <blknum> argument):
|
||||
|
||||
addbadblock <blknum>
|
||||
queryblock <blknum>
|
||||
removebadblock <blknum>
|
||||
|
||||
...where <blknum> is a block number within range of the device
|
||||
(corresponding to the block size of the device.)
|
||||
|
||||
Single argument message commands:
|
||||
|
||||
countbadblocks
|
||||
clearbadblocks
|
||||
disable
|
||||
enable
|
||||
quiet
|
||||
|
||||
Device removal:
|
||||
---------------
|
||||
|
||||
When finished, remove the device via the "dmsetup remove" command:
|
||||
|
||||
$ sudo dmsetup remove dust1
|
||||
|
||||
Quiet mode:
|
||||
-----------
|
||||
|
||||
On test runs with many bad blocks, it may be desirable to avoid
|
||||
excessive logging (from bad blocks added, removed, or "remapped").
|
||||
This can be done by enabling "quiet mode" via the following message:
|
||||
|
||||
$ sudo dmsetup message dust1 0 quiet
|
||||
|
||||
This will suppress log messages from add / remove / removed by write
|
||||
operations. Log messages from "countbadblocks" or "queryblock"
|
||||
message commands will still print in quiet mode.
|
||||
|
||||
The status of quiet mode can be seen by running "dmsetup status":
|
||||
|
||||
$ sudo dmsetup status dust1
|
||||
0 33552384 dust 252:17 fail_read_on_bad_block quiet
|
||||
|
||||
To disable quiet mode, send the "quiet" message again:
|
||||
|
||||
$ sudo dmsetup message dust1 0 quiet
|
||||
|
||||
$ sudo dmsetup status dust1
|
||||
0 33552384 dust 252:17 fail_read_on_bad_block verbose
|
||||
|
||||
(The presence of "verbose" indicates normal logging.)
|
||||
|
||||
"Why not...?"
|
||||
-------------
|
||||
|
||||
scsi_debug has a "medium error" mode that can fail reads on one
|
||||
specified sector (sector 0x1234, hardcoded in the source code), but
|
||||
it uses RAM for the persistent storage, which drastically decreases
|
||||
the potential device size.
|
||||
|
||||
dm-flakey fails all I/O from all block locations at a specified time
|
||||
frequency, and not a given point in time.
|
||||
|
||||
When a bad sector occurs on a hard disk drive, reads to that sector
|
||||
are failed by the device, usually resulting in an error code of EIO
|
||||
("I/O error") or ENODATA ("No data available"). However, a write to
|
||||
the sector may succeed, and result in the sector becoming readable
|
||||
after the device controller no longer experiences errors reading the
|
||||
sector (or after a reallocation of the sector). However, there may
|
||||
be bad sectors that occur on the device in the future, in a different,
|
||||
unpredictable location.
|
||||
|
||||
This target seeks to provide a device that can exhibit the behavior
|
||||
of a bad sector at a known sector location, at a known time, based
|
||||
on a large storage device (at least tens of gigabytes, not occupying
|
||||
system memory).
|
74
Documentation/admin-guide/device-mapper/dm-flakey.rst
Normal file
74
Documentation/admin-guide/device-mapper/dm-flakey.rst
Normal file
|
@ -0,0 +1,74 @@
|
|||
=========
|
||||
dm-flakey
|
||||
=========
|
||||
|
||||
This target is the same as the linear target except that it exhibits
|
||||
unreliable behaviour periodically. It's been found useful in simulating
|
||||
failing devices for testing purposes.
|
||||
|
||||
Starting from the time the table is loaded, the device is available for
|
||||
<up interval> seconds, then exhibits unreliable behaviour for <down
|
||||
interval> seconds, and then this cycle repeats.
|
||||
|
||||
Also, consider using this in combination with the dm-delay target too,
|
||||
which can delay reads and writes and/or send them to different
|
||||
underlying devices.
|
||||
|
||||
Table parameters
|
||||
----------------
|
||||
|
||||
::
|
||||
|
||||
<dev path> <offset> <up interval> <down interval> \
|
||||
[<num_features> [<feature arguments>]]
|
||||
|
||||
Mandatory parameters:
|
||||
|
||||
<dev path>:
|
||||
Full pathname to the underlying block-device, or a
|
||||
"major:minor" device-number.
|
||||
<offset>:
|
||||
Starting sector within the device.
|
||||
<up interval>:
|
||||
Number of seconds device is available.
|
||||
<down interval>:
|
||||
Number of seconds device returns errors.
|
||||
|
||||
Optional feature parameters:
|
||||
|
||||
If no feature parameters are present, during the periods of
|
||||
unreliability, all I/O returns errors.
|
||||
|
||||
drop_writes:
|
||||
All write I/O is silently ignored.
|
||||
Read I/O is handled correctly.
|
||||
|
||||
error_writes:
|
||||
All write I/O is failed with an error signalled.
|
||||
Read I/O is handled correctly.
|
||||
|
||||
corrupt_bio_byte <Nth_byte> <direction> <value> <flags>:
|
||||
During <down interval>, replace <Nth_byte> of the data of
|
||||
each matching bio with <value>.
|
||||
|
||||
<Nth_byte>:
|
||||
The offset of the byte to replace.
|
||||
Counting starts at 1, to replace the first byte.
|
||||
<direction>:
|
||||
Either 'r' to corrupt reads or 'w' to corrupt writes.
|
||||
'w' is incompatible with drop_writes.
|
||||
<value>:
|
||||
The value (from 0-255) to write.
|
||||
<flags>:
|
||||
Perform the replacement only if bio->bi_opf has all the
|
||||
selected flags set.
|
||||
|
||||
Examples:
|
||||
|
||||
Replaces the 32nd byte of READ bios with the value 1::
|
||||
|
||||
corrupt_bio_byte 32 r 1 0
|
||||
|
||||
Replaces the 224th byte of REQ_META (=32) bios with the value 0::
|
||||
|
||||
corrupt_bio_byte 224 w 0 32
|
125
Documentation/admin-guide/device-mapper/dm-init.rst
Normal file
125
Documentation/admin-guide/device-mapper/dm-init.rst
Normal file
|
@ -0,0 +1,125 @@
|
|||
================================
|
||||
Early creation of mapped devices
|
||||
================================
|
||||
|
||||
It is possible to configure a device-mapper device to act as the root device for
|
||||
your system in two ways.
|
||||
|
||||
The first is to build an initial ramdisk which boots to a minimal userspace
|
||||
which configures the device, then pivot_root(8) in to it.
|
||||
|
||||
The second is to create one or more device-mappers using the module parameter
|
||||
"dm-mod.create=" through the kernel boot command line argument.
|
||||
|
||||
The format is specified as a string of data separated by commas and optionally
|
||||
semi-colons, where:
|
||||
|
||||
- a comma is used to separate fields like name, uuid, flags and table
|
||||
(specifies one device)
|
||||
- a semi-colon is used to separate devices.
|
||||
|
||||
So the format will look like this::
|
||||
|
||||
dm-mod.create=<name>,<uuid>,<minor>,<flags>,<table>[,<table>+][;<name>,<uuid>,<minor>,<flags>,<table>[,<table>+]+]
|
||||
|
||||
Where::
|
||||
|
||||
<name> ::= The device name.
|
||||
<uuid> ::= xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | ""
|
||||
<minor> ::= The device minor number | ""
|
||||
<flags> ::= "ro" | "rw"
|
||||
<table> ::= <start_sector> <num_sectors> <target_type> <target_args>
|
||||
<target_type> ::= "verity" | "linear" | ... (see list below)
|
||||
|
||||
The dm line should be equivalent to the one used by the dmsetup tool with the
|
||||
`--concise` argument.
|
||||
|
||||
Target types
|
||||
============
|
||||
|
||||
Not all target types are available as there are serious risks in allowing
|
||||
activation of certain DM targets without first using userspace tools to check
|
||||
the validity of associated metadata.
|
||||
|
||||
======================= =======================================================
|
||||
`cache` constrained, userspace should verify cache device
|
||||
`crypt` allowed
|
||||
`delay` allowed
|
||||
`era` constrained, userspace should verify metadata device
|
||||
`flakey` constrained, meant for test
|
||||
`linear` allowed
|
||||
`log-writes` constrained, userspace should verify metadata device
|
||||
`mirror` constrained, userspace should verify main/mirror device
|
||||
`raid` constrained, userspace should verify metadata device
|
||||
`snapshot` constrained, userspace should verify src/dst device
|
||||
`snapshot-origin` allowed
|
||||
`snapshot-merge` constrained, userspace should verify src/dst device
|
||||
`striped` allowed
|
||||
`switch` constrained, userspace should verify dev path
|
||||
`thin` constrained, requires dm target message from userspace
|
||||
`thin-pool` constrained, requires dm target message from userspace
|
||||
`verity` allowed
|
||||
`writecache` constrained, userspace should verify cache device
|
||||
`zero` constrained, not meant for rootfs
|
||||
======================= =======================================================
|
||||
|
||||
If the target is not listed above, it is constrained by default (not tested).
|
||||
|
||||
Examples
|
||||
========
|
||||
An example of booting to a linear array made up of user-mode linux block
|
||||
devices::
|
||||
|
||||
dm-mod.create="lroot,,,rw, 0 4096 linear 98:16 0, 4096 4096 linear 98:32 0" root=/dev/dm-0
|
||||
|
||||
This will boot to a rw dm-linear target of 8192 sectors split across two block
|
||||
devices identified by their major:minor numbers. After boot, udev will rename
|
||||
this target to /dev/mapper/lroot (depending on the rules). No uuid was assigned.
|
||||
|
||||
An example of multiple device-mappers, with the dm-mod.create="..." contents
|
||||
is shown here split on multiple lines for readability::
|
||||
|
||||
dm-linear,,1,rw,
|
||||
0 32768 linear 8:1 0,
|
||||
32768 1024000 linear 8:2 0;
|
||||
dm-verity,,3,ro,
|
||||
0 1638400 verity 1 /dev/sdc1 /dev/sdc2 4096 4096 204800 1 sha256
|
||||
ac87db56303c9c1da433d7209b5a6ef3e4779df141200cbd7c157dcb8dd89c42
|
||||
5ebfe87f7df3235b80a117ebc4078e44f55045487ad4a96581d1adb564615b51
|
||||
|
||||
Other examples (per target):
|
||||
|
||||
"crypt"::
|
||||
|
||||
dm-crypt,,8,ro,
|
||||
0 1048576 crypt aes-xts-plain64
|
||||
babebabebabebabebabebabebabebabebabebabebabebabebabebabebabebabe 0
|
||||
/dev/sda 0 1 allow_discards
|
||||
|
||||
"delay"::
|
||||
|
||||
dm-delay,,4,ro,0 409600 delay /dev/sda1 0 500
|
||||
|
||||
"linear"::
|
||||
|
||||
dm-linear,,,rw,
|
||||
0 32768 linear /dev/sda1 0,
|
||||
32768 1024000 linear /dev/sda2 0,
|
||||
1056768 204800 linear /dev/sda3 0,
|
||||
1261568 512000 linear /dev/sda4 0
|
||||
|
||||
"snapshot-origin"::
|
||||
|
||||
dm-snap-orig,,4,ro,0 409600 snapshot-origin 8:2
|
||||
|
||||
"striped"::
|
||||
|
||||
dm-striped,,4,ro,0 1638400 striped 4 4096
|
||||
/dev/sda1 0 /dev/sda2 0 /dev/sda3 0 /dev/sda4 0
|
||||
|
||||
"verity"::
|
||||
|
||||
dm-verity,,4,ro,
|
||||
0 1638400 verity 1 8:1 8:2 4096 4096 204800 1 sha256
|
||||
fb1a5a0f00deb908d8b53cb270858975e76cf64105d412ce764225d53b8f3cfd
|
||||
51934789604d1b92399c52e7cb149d1b3a1b74bbbcb103b2a0aaacbed5c08584
|
259
Documentation/admin-guide/device-mapper/dm-integrity.rst
Normal file
259
Documentation/admin-guide/device-mapper/dm-integrity.rst
Normal file
|
@ -0,0 +1,259 @@
|
|||
============
|
||||
dm-integrity
|
||||
============
|
||||
|
||||
The dm-integrity target emulates a block device that has additional
|
||||
per-sector tags that can be used for storing integrity information.
|
||||
|
||||
A general problem with storing integrity tags with every sector is that
|
||||
writing the sector and the integrity tag must be atomic - i.e. in case of
|
||||
crash, either both sector and integrity tag or none of them is written.
|
||||
|
||||
To guarantee write atomicity, the dm-integrity target uses journal, it
|
||||
writes sector data and integrity tags into a journal, commits the journal
|
||||
and then copies the data and integrity tags to their respective location.
|
||||
|
||||
The dm-integrity target can be used with the dm-crypt target - in this
|
||||
situation the dm-crypt target creates the integrity data and passes them
|
||||
to the dm-integrity target via bio_integrity_payload attached to the bio.
|
||||
In this mode, the dm-crypt and dm-integrity targets provide authenticated
|
||||
disk encryption - if the attacker modifies the encrypted device, an I/O
|
||||
error is returned instead of random data.
|
||||
|
||||
The dm-integrity target can also be used as a standalone target, in this
|
||||
mode it calculates and verifies the integrity tag internally. In this
|
||||
mode, the dm-integrity target can be used to detect silent data
|
||||
corruption on the disk or in the I/O path.
|
||||
|
||||
There's an alternate mode of operation where dm-integrity uses bitmap
|
||||
instead of a journal. If a bit in the bitmap is 1, the corresponding
|
||||
region's data and integrity tags are not synchronized - if the machine
|
||||
crashes, the unsynchronized regions will be recalculated. The bitmap mode
|
||||
is faster than the journal mode, because we don't have to write the data
|
||||
twice, but it is also less reliable, because if data corruption happens
|
||||
when the machine crashes, it may not be detected.
|
||||
|
||||
When loading the target for the first time, the kernel driver will format
|
||||
the device. But it will only format the device if the superblock contains
|
||||
zeroes. If the superblock is neither valid nor zeroed, the dm-integrity
|
||||
target can't be loaded.
|
||||
|
||||
To use the target for the first time:
|
||||
|
||||
1. overwrite the superblock with zeroes
|
||||
2. load the dm-integrity target with one-sector size, the kernel driver
|
||||
will format the device
|
||||
3. unload the dm-integrity target
|
||||
4. read the "provided_data_sectors" value from the superblock
|
||||
5. load the dm-integrity target with the the target size
|
||||
"provided_data_sectors"
|
||||
6. if you want to use dm-integrity with dm-crypt, load the dm-crypt target
|
||||
with the size "provided_data_sectors"
|
||||
|
||||
|
||||
Target arguments:
|
||||
|
||||
1. the underlying block device
|
||||
|
||||
2. the number of reserved sector at the beginning of the device - the
|
||||
dm-integrity won't read of write these sectors
|
||||
|
||||
3. the size of the integrity tag (if "-" is used, the size is taken from
|
||||
the internal-hash algorithm)
|
||||
|
||||
4. mode:
|
||||
|
||||
D - direct writes (without journal)
|
||||
in this mode, journaling is
|
||||
not used and data sectors and integrity tags are written
|
||||
separately. In case of crash, it is possible that the data
|
||||
and integrity tag doesn't match.
|
||||
J - journaled writes
|
||||
data and integrity tags are written to the
|
||||
journal and atomicity is guaranteed. In case of crash,
|
||||
either both data and tag or none of them are written. The
|
||||
journaled mode degrades write throughput twice because the
|
||||
data have to be written twice.
|
||||
B - bitmap mode - data and metadata are written without any
|
||||
synchronization, the driver maintains a bitmap of dirty
|
||||
regions where data and metadata don't match. This mode can
|
||||
only be used with internal hash.
|
||||
R - recovery mode - in this mode, journal is not replayed,
|
||||
checksums are not checked and writes to the device are not
|
||||
allowed. This mode is useful for data recovery if the
|
||||
device cannot be activated in any of the other standard
|
||||
modes.
|
||||
|
||||
5. the number of additional arguments
|
||||
|
||||
Additional arguments:
|
||||
|
||||
journal_sectors:number
|
||||
The size of journal, this argument is used only if formatting the
|
||||
device. If the device is already formatted, the value from the
|
||||
superblock is used.
|
||||
|
||||
interleave_sectors:number
|
||||
The number of interleaved sectors. This values is rounded down to
|
||||
a power of two. If the device is already formatted, the value from
|
||||
the superblock is used.
|
||||
|
||||
meta_device:device
|
||||
Don't interleave the data and metadata on on device. Use a
|
||||
separate device for metadata.
|
||||
|
||||
buffer_sectors:number
|
||||
The number of sectors in one buffer. The value is rounded down to
|
||||
a power of two.
|
||||
|
||||
The tag area is accessed using buffers, the buffer size is
|
||||
configurable. The large buffer size means that the I/O size will
|
||||
be larger, but there could be less I/Os issued.
|
||||
|
||||
journal_watermark:number
|
||||
The journal watermark in percents. When the size of the journal
|
||||
exceeds this watermark, the thread that flushes the journal will
|
||||
be started.
|
||||
|
||||
commit_time:number
|
||||
Commit time in milliseconds. When this time passes, the journal is
|
||||
written. The journal is also written immediatelly if the FLUSH
|
||||
request is received.
|
||||
|
||||
internal_hash:algorithm(:key) (the key is optional)
|
||||
Use internal hash or crc.
|
||||
When this argument is used, the dm-integrity target won't accept
|
||||
integrity tags from the upper target, but it will automatically
|
||||
generate and verify the integrity tags.
|
||||
|
||||
You can use a crc algorithm (such as crc32), then integrity target
|
||||
will protect the data against accidental corruption.
|
||||
You can also use a hmac algorithm (for example
|
||||
"hmac(sha256):0123456789abcdef"), in this mode it will provide
|
||||
cryptographic authentication of the data without encryption.
|
||||
|
||||
When this argument is not used, the integrity tags are accepted
|
||||
from an upper layer target, such as dm-crypt. The upper layer
|
||||
target should check the validity of the integrity tags.
|
||||
|
||||
recalculate
|
||||
Recalculate the integrity tags automatically. It is only valid
|
||||
when using internal hash.
|
||||
|
||||
journal_crypt:algorithm(:key) (the key is optional)
|
||||
Encrypt the journal using given algorithm to make sure that the
|
||||
attacker can't read the journal. You can use a block cipher here
|
||||
(such as "cbc(aes)") or a stream cipher (for example "chacha20",
|
||||
"salsa20", "ctr(aes)" or "ecb(arc4)").
|
||||
|
||||
The journal contains history of last writes to the block device,
|
||||
an attacker reading the journal could see the last sector nubmers
|
||||
that were written. From the sector numbers, the attacker can infer
|
||||
the size of files that were written. To protect against this
|
||||
situation, you can encrypt the journal.
|
||||
|
||||
journal_mac:algorithm(:key) (the key is optional)
|
||||
Protect sector numbers in the journal from accidental or malicious
|
||||
modification. To protect against accidental modification, use a
|
||||
crc algorithm, to protect against malicious modification, use a
|
||||
hmac algorithm with a key.
|
||||
|
||||
This option is not needed when using internal-hash because in this
|
||||
mode, the integrity of journal entries is checked when replaying
|
||||
the journal. Thus, modified sector number would be detected at
|
||||
this stage.
|
||||
|
||||
block_size:number
|
||||
The size of a data block in bytes. The larger the block size the
|
||||
less overhead there is for per-block integrity metadata.
|
||||
Supported values are 512, 1024, 2048 and 4096 bytes. If not
|
||||
specified the default block size is 512 bytes.
|
||||
|
||||
sectors_per_bit:number
|
||||
In the bitmap mode, this parameter specifies the number of
|
||||
512-byte sectors that corresponds to one bitmap bit.
|
||||
|
||||
bitmap_flush_interval:number
|
||||
The bitmap flush interval in milliseconds. The metadata buffers
|
||||
are synchronized when this interval expires.
|
||||
|
||||
|
||||
The journal mode (D/J), buffer_sectors, journal_watermark, commit_time can
|
||||
be changed when reloading the target (load an inactive table and swap the
|
||||
tables with suspend and resume). The other arguments should not be changed
|
||||
when reloading the target because the layout of disk data depend on them
|
||||
and the reloaded target would be non-functional.
|
||||
|
||||
|
||||
The layout of the formatted block device:
|
||||
|
||||
* reserved sectors
|
||||
(they are not used by this target, they can be used for
|
||||
storing LUKS metadata or for other purpose), the size of the reserved
|
||||
area is specified in the target arguments
|
||||
|
||||
* superblock (4kiB)
|
||||
* magic string - identifies that the device was formatted
|
||||
* version
|
||||
* log2(interleave sectors)
|
||||
* integrity tag size
|
||||
* the number of journal sections
|
||||
* provided data sectors - the number of sectors that this target
|
||||
provides (i.e. the size of the device minus the size of all
|
||||
metadata and padding). The user of this target should not send
|
||||
bios that access data beyond the "provided data sectors" limit.
|
||||
* flags
|
||||
SB_FLAG_HAVE_JOURNAL_MAC
|
||||
- a flag is set if journal_mac is used
|
||||
SB_FLAG_RECALCULATING
|
||||
- recalculating is in progress
|
||||
SB_FLAG_DIRTY_BITMAP
|
||||
- journal area contains the bitmap of dirty
|
||||
blocks
|
||||
* log2(sectors per block)
|
||||
* a position where recalculating finished
|
||||
* journal
|
||||
The journal is divided into sections, each section contains:
|
||||
|
||||
* metadata area (4kiB), it contains journal entries
|
||||
|
||||
- every journal entry contains:
|
||||
|
||||
* logical sector (specifies where the data and tag should
|
||||
be written)
|
||||
* last 8 bytes of data
|
||||
* integrity tag (the size is specified in the superblock)
|
||||
|
||||
- every metadata sector ends with
|
||||
|
||||
* mac (8-bytes), all the macs in 8 metadata sectors form a
|
||||
64-byte value. It is used to store hmac of sector
|
||||
numbers in the journal section, to protect against a
|
||||
possibility that the attacker tampers with sector
|
||||
numbers in the journal.
|
||||
* commit id
|
||||
|
||||
* data area (the size is variable; it depends on how many journal
|
||||
entries fit into the metadata area)
|
||||
|
||||
- every sector in the data area contains:
|
||||
|
||||
* data (504 bytes of data, the last 8 bytes are stored in
|
||||
the journal entry)
|
||||
* commit id
|
||||
|
||||
To test if the whole journal section was written correctly, every
|
||||
512-byte sector of the journal ends with 8-byte commit id. If the
|
||||
commit id matches on all sectors in a journal section, then it is
|
||||
assumed that the section was written correctly. If the commit id
|
||||
doesn't match, the section was written partially and it should not
|
||||
be replayed.
|
||||
|
||||
* one or more runs of interleaved tags and data.
|
||||
Each run contains:
|
||||
|
||||
* tag area - it contains integrity tags. There is one tag for each
|
||||
sector in the data area
|
||||
* data area - it contains data sectors. The number of data sectors
|
||||
in one run must be a power of two. log2 of this value is stored
|
||||
in the superblock.
|
75
Documentation/admin-guide/device-mapper/dm-io.rst
Normal file
75
Documentation/admin-guide/device-mapper/dm-io.rst
Normal file
|
@ -0,0 +1,75 @@
|
|||
=====
|
||||
dm-io
|
||||
=====
|
||||
|
||||
Dm-io provides synchronous and asynchronous I/O services. There are three
|
||||
types of I/O services available, and each type has a sync and an async
|
||||
version.
|
||||
|
||||
The user must set up an io_region structure to describe the desired location
|
||||
of the I/O. Each io_region indicates a block-device along with the starting
|
||||
sector and size of the region::
|
||||
|
||||
struct io_region {
|
||||
struct block_device *bdev;
|
||||
sector_t sector;
|
||||
sector_t count;
|
||||
};
|
||||
|
||||
Dm-io can read from one io_region or write to one or more io_regions. Writes
|
||||
to multiple regions are specified by an array of io_region structures.
|
||||
|
||||
The first I/O service type takes a list of memory pages as the data buffer for
|
||||
the I/O, along with an offset into the first page::
|
||||
|
||||
struct page_list {
|
||||
struct page_list *next;
|
||||
struct page *page;
|
||||
};
|
||||
|
||||
int dm_io_sync(unsigned int num_regions, struct io_region *where, int rw,
|
||||
struct page_list *pl, unsigned int offset,
|
||||
unsigned long *error_bits);
|
||||
int dm_io_async(unsigned int num_regions, struct io_region *where, int rw,
|
||||
struct page_list *pl, unsigned int offset,
|
||||
io_notify_fn fn, void *context);
|
||||
|
||||
The second I/O service type takes an array of bio vectors as the data buffer
|
||||
for the I/O. This service can be handy if the caller has a pre-assembled bio,
|
||||
but wants to direct different portions of the bio to different devices::
|
||||
|
||||
int dm_io_sync_bvec(unsigned int num_regions, struct io_region *where,
|
||||
int rw, struct bio_vec *bvec,
|
||||
unsigned long *error_bits);
|
||||
int dm_io_async_bvec(unsigned int num_regions, struct io_region *where,
|
||||
int rw, struct bio_vec *bvec,
|
||||
io_notify_fn fn, void *context);
|
||||
|
||||
The third I/O service type takes a pointer to a vmalloc'd memory buffer as the
|
||||
data buffer for the I/O. This service can be handy if the caller needs to do
|
||||
I/O to a large region but doesn't want to allocate a large number of individual
|
||||
memory pages::
|
||||
|
||||
int dm_io_sync_vm(unsigned int num_regions, struct io_region *where, int rw,
|
||||
void *data, unsigned long *error_bits);
|
||||
int dm_io_async_vm(unsigned int num_regions, struct io_region *where, int rw,
|
||||
void *data, io_notify_fn fn, void *context);
|
||||
|
||||
Callers of the asynchronous I/O services must include the name of a completion
|
||||
callback routine and a pointer to some context data for the I/O::
|
||||
|
||||
typedef void (*io_notify_fn)(unsigned long error, void *context);
|
||||
|
||||
The "error" parameter in this callback, as well as the `*error` parameter in
|
||||
all of the synchronous versions, is a bitset (instead of a simple error value).
|
||||
In the case of an write-I/O to multiple regions, this bitset allows dm-io to
|
||||
indicate success or failure on each individual region.
|
||||
|
||||
Before using any of the dm-io services, the user should call dm_io_get()
|
||||
and specify the number of pages they expect to perform I/O on concurrently.
|
||||
Dm-io will attempt to resize its mempool to make sure enough pages are
|
||||
always available in order to avoid unnecessary waiting while performing I/O.
|
||||
|
||||
When the user is finished using the dm-io services, they should call
|
||||
dm_io_put() and specify the same number of pages that were given on the
|
||||
dm_io_get() call.
|
57
Documentation/admin-guide/device-mapper/dm-log.rst
Normal file
57
Documentation/admin-guide/device-mapper/dm-log.rst
Normal file
|
@ -0,0 +1,57 @@
|
|||
=====================
|
||||
Device-Mapper Logging
|
||||
=====================
|
||||
The device-mapper logging code is used by some of the device-mapper
|
||||
RAID targets to track regions of the disk that are not consistent.
|
||||
A region (or portion of the address space) of the disk may be
|
||||
inconsistent because a RAID stripe is currently being operated on or
|
||||
a machine died while the region was being altered. In the case of
|
||||
mirrors, a region would be considered dirty/inconsistent while you
|
||||
are writing to it because the writes need to be replicated for all
|
||||
the legs of the mirror and may not reach the legs at the same time.
|
||||
Once all writes are complete, the region is considered clean again.
|
||||
|
||||
There is a generic logging interface that the device-mapper RAID
|
||||
implementations use to perform logging operations (see
|
||||
dm_dirty_log_type in include/linux/dm-dirty-log.h). Various different
|
||||
logging implementations are available and provide different
|
||||
capabilities. The list includes:
|
||||
|
||||
============== ==============================================================
|
||||
Type Files
|
||||
============== ==============================================================
|
||||
disk drivers/md/dm-log.c
|
||||
core drivers/md/dm-log.c
|
||||
userspace drivers/md/dm-log-userspace* include/linux/dm-log-userspace.h
|
||||
============== ==============================================================
|
||||
|
||||
The "disk" log type
|
||||
-------------------
|
||||
This log implementation commits the log state to disk. This way, the
|
||||
logging state survives reboots/crashes.
|
||||
|
||||
The "core" log type
|
||||
-------------------
|
||||
This log implementation keeps the log state in memory. The log state
|
||||
will not survive a reboot or crash, but there may be a small boost in
|
||||
performance. This method can also be used if no storage device is
|
||||
available for storing log state.
|
||||
|
||||
The "userspace" log type
|
||||
------------------------
|
||||
This log type simply provides a way to export the log API to userspace,
|
||||
so log implementations can be done there. This is done by forwarding most
|
||||
logging requests to userspace, where a daemon receives and processes the
|
||||
request.
|
||||
|
||||
The structure used for communication between kernel and userspace are
|
||||
located in include/linux/dm-log-userspace.h. Due to the frequency,
|
||||
diversity, and 2-way communication nature of the exchanges between
|
||||
kernel and userspace, 'connector' is used as the interface for
|
||||
communication.
|
||||
|
||||
There are currently two userspace log implementations that leverage this
|
||||
framework - "clustered-disk" and "clustered-core". These implementations
|
||||
provide a cluster-coherent log for shared-storage. Device-mapper mirroring
|
||||
can be used in a shared-storage environment when the cluster log implementations
|
||||
are employed.
|
48
Documentation/admin-guide/device-mapper/dm-queue-length.rst
Normal file
48
Documentation/admin-guide/device-mapper/dm-queue-length.rst
Normal file
|
@ -0,0 +1,48 @@
|
|||
===============
|
||||
dm-queue-length
|
||||
===============
|
||||
|
||||
dm-queue-length is a path selector module for device-mapper targets,
|
||||
which selects a path with the least number of in-flight I/Os.
|
||||
The path selector name is 'queue-length'.
|
||||
|
||||
Table parameters for each path: [<repeat_count>]
|
||||
|
||||
::
|
||||
|
||||
<repeat_count>: The number of I/Os to dispatch using the selected
|
||||
path before switching to the next path.
|
||||
If not given, internal default is used. To check
|
||||
the default value, see the activated table.
|
||||
|
||||
Status for each path: <status> <fail-count> <in-flight>
|
||||
|
||||
::
|
||||
|
||||
<status>: 'A' if the path is active, 'F' if the path is failed.
|
||||
<fail-count>: The number of path failures.
|
||||
<in-flight>: The number of in-flight I/Os on the path.
|
||||
|
||||
|
||||
Algorithm
|
||||
=========
|
||||
|
||||
dm-queue-length increments/decrements 'in-flight' when an I/O is
|
||||
dispatched/completed respectively.
|
||||
dm-queue-length selects a path with the minimum 'in-flight'.
|
||||
|
||||
|
||||
Examples
|
||||
========
|
||||
In case that 2 paths (sda and sdb) are used with repeat_count == 128.
|
||||
|
||||
::
|
||||
|
||||
# echo "0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128" \
|
||||
dmsetup create test
|
||||
#
|
||||
# dmsetup table
|
||||
test: 0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128
|
||||
#
|
||||
# dmsetup status
|
||||
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 1 8:0 A 0 0 8:16 A 0 0
|
419
Documentation/admin-guide/device-mapper/dm-raid.rst
Normal file
419
Documentation/admin-guide/device-mapper/dm-raid.rst
Normal file
|
@ -0,0 +1,419 @@
|
|||
=======
|
||||
dm-raid
|
||||
=======
|
||||
|
||||
The device-mapper RAID (dm-raid) target provides a bridge from DM to MD.
|
||||
It allows the MD RAID drivers to be accessed using a device-mapper
|
||||
interface.
|
||||
|
||||
|
||||
Mapping Table Interface
|
||||
-----------------------
|
||||
The target is named "raid" and it accepts the following parameters::
|
||||
|
||||
<raid_type> <#raid_params> <raid_params> \
|
||||
<#raid_devs> <metadata_dev0> <dev0> [.. <metadata_devN> <devN>]
|
||||
|
||||
<raid_type>:
|
||||
|
||||
============= ===============================================================
|
||||
raid0 RAID0 striping (no resilience)
|
||||
raid1 RAID1 mirroring
|
||||
raid4 RAID4 with dedicated last parity disk
|
||||
raid5_n RAID5 with dedicated last parity disk supporting takeover
|
||||
Same as raid4
|
||||
|
||||
- Transitory layout
|
||||
raid5_la RAID5 left asymmetric
|
||||
|
||||
- rotating parity 0 with data continuation
|
||||
raid5_ra RAID5 right asymmetric
|
||||
|
||||
- rotating parity N with data continuation
|
||||
raid5_ls RAID5 left symmetric
|
||||
|
||||
- rotating parity 0 with data restart
|
||||
raid5_rs RAID5 right symmetric
|
||||
|
||||
- rotating parity N with data restart
|
||||
raid6_zr RAID6 zero restart
|
||||
|
||||
- rotating parity zero (left-to-right) with data restart
|
||||
raid6_nr RAID6 N restart
|
||||
|
||||
- rotating parity N (right-to-left) with data restart
|
||||
raid6_nc RAID6 N continue
|
||||
|
||||
- rotating parity N (right-to-left) with data continuation
|
||||
raid6_n_6 RAID6 with dedicate parity disks
|
||||
|
||||
- parity and Q-syndrome on the last 2 disks;
|
||||
layout for takeover from/to raid4/raid5_n
|
||||
raid6_la_6 Same as "raid_la" plus dedicated last Q-syndrome disk
|
||||
|
||||
- layout for takeover from raid5_la from/to raid6
|
||||
raid6_ra_6 Same as "raid5_ra" dedicated last Q-syndrome disk
|
||||
|
||||
- layout for takeover from raid5_ra from/to raid6
|
||||
raid6_ls_6 Same as "raid5_ls" dedicated last Q-syndrome disk
|
||||
|
||||
- layout for takeover from raid5_ls from/to raid6
|
||||
raid6_rs_6 Same as "raid5_rs" dedicated last Q-syndrome disk
|
||||
|
||||
- layout for takeover from raid5_rs from/to raid6
|
||||
raid10 Various RAID10 inspired algorithms chosen by additional params
|
||||
(see raid10_format and raid10_copies below)
|
||||
|
||||
- RAID10: Striped Mirrors (aka 'Striping on top of mirrors')
|
||||
- RAID1E: Integrated Adjacent Stripe Mirroring
|
||||
- RAID1E: Integrated Offset Stripe Mirroring
|
||||
- and other similar RAID10 variants
|
||||
============= ===============================================================
|
||||
|
||||
Reference: Chapter 4 of
|
||||
http://www.snia.org/sites/default/files/SNIA_DDF_Technical_Position_v2.0.pdf
|
||||
|
||||
<#raid_params>: The number of parameters that follow.
|
||||
|
||||
<raid_params> consists of
|
||||
|
||||
Mandatory parameters:
|
||||
<chunk_size>:
|
||||
Chunk size in sectors. This parameter is often known as
|
||||
"stripe size". It is the only mandatory parameter and
|
||||
is placed first.
|
||||
|
||||
followed by optional parameters (in any order):
|
||||
[sync|nosync]
|
||||
Force or prevent RAID initialization.
|
||||
|
||||
[rebuild <idx>]
|
||||
Rebuild drive number 'idx' (first drive is 0).
|
||||
|
||||
[daemon_sleep <ms>]
|
||||
Interval between runs of the bitmap daemon that
|
||||
clear bits. A longer interval means less bitmap I/O but
|
||||
resyncing after a failure is likely to take longer.
|
||||
|
||||
[min_recovery_rate <kB/sec/disk>]
|
||||
Throttle RAID initialization
|
||||
[max_recovery_rate <kB/sec/disk>]
|
||||
Throttle RAID initialization
|
||||
[write_mostly <idx>]
|
||||
Mark drive index 'idx' write-mostly.
|
||||
[max_write_behind <sectors>]
|
||||
See '--write-behind=' (man mdadm)
|
||||
[stripe_cache <sectors>]
|
||||
Stripe cache size (RAID 4/5/6 only)
|
||||
[region_size <sectors>]
|
||||
The region_size multiplied by the number of regions is the
|
||||
logical size of the array. The bitmap records the device
|
||||
synchronisation state for each region.
|
||||
|
||||
[raid10_copies <# copies>], [raid10_format <near|far|offset>]
|
||||
These two options are used to alter the default layout of
|
||||
a RAID10 configuration. The number of copies is can be
|
||||
specified, but the default is 2. There are also three
|
||||
variations to how the copies are laid down - the default
|
||||
is "near". Near copies are what most people think of with
|
||||
respect to mirroring. If these options are left unspecified,
|
||||
or 'raid10_copies 2' and/or 'raid10_format near' are given,
|
||||
then the layouts for 2, 3 and 4 devices are:
|
||||
|
||||
======== ========== ==============
|
||||
2 drives 3 drives 4 drives
|
||||
======== ========== ==============
|
||||
A1 A1 A1 A1 A2 A1 A1 A2 A2
|
||||
A2 A2 A2 A3 A3 A3 A3 A4 A4
|
||||
A3 A3 A4 A4 A5 A5 A5 A6 A6
|
||||
A4 A4 A5 A6 A6 A7 A7 A8 A8
|
||||
.. .. .. .. .. .. .. .. ..
|
||||
======== ========== ==============
|
||||
|
||||
The 2-device layout is equivalent 2-way RAID1. The 4-device
|
||||
layout is what a traditional RAID10 would look like. The
|
||||
3-device layout is what might be called a 'RAID1E - Integrated
|
||||
Adjacent Stripe Mirroring'.
|
||||
|
||||
If 'raid10_copies 2' and 'raid10_format far', then the layouts
|
||||
for 2, 3 and 4 devices are:
|
||||
|
||||
======== ============ ===================
|
||||
2 drives 3 drives 4 drives
|
||||
======== ============ ===================
|
||||
A1 A2 A1 A2 A3 A1 A2 A3 A4
|
||||
A3 A4 A4 A5 A6 A5 A6 A7 A8
|
||||
A5 A6 A7 A8 A9 A9 A10 A11 A12
|
||||
.. .. .. .. .. .. .. .. ..
|
||||
A2 A1 A3 A1 A2 A2 A1 A4 A3
|
||||
A4 A3 A6 A4 A5 A6 A5 A8 A7
|
||||
A6 A5 A9 A7 A8 A10 A9 A12 A11
|
||||
.. .. .. .. .. .. .. .. ..
|
||||
======== ============ ===================
|
||||
|
||||
If 'raid10_copies 2' and 'raid10_format offset', then the
|
||||
layouts for 2, 3 and 4 devices are:
|
||||
|
||||
======== ========== ================
|
||||
2 drives 3 drives 4 drives
|
||||
======== ========== ================
|
||||
A1 A2 A1 A2 A3 A1 A2 A3 A4
|
||||
A2 A1 A3 A1 A2 A2 A1 A4 A3
|
||||
A3 A4 A4 A5 A6 A5 A6 A7 A8
|
||||
A4 A3 A6 A4 A5 A6 A5 A8 A7
|
||||
A5 A6 A7 A8 A9 A9 A10 A11 A12
|
||||
A6 A5 A9 A7 A8 A10 A9 A12 A11
|
||||
.. .. .. .. .. .. .. .. ..
|
||||
======== ========== ================
|
||||
|
||||
Here we see layouts closely akin to 'RAID1E - Integrated
|
||||
Offset Stripe Mirroring'.
|
||||
|
||||
[delta_disks <N>]
|
||||
The delta_disks option value (-251 < N < +251) triggers
|
||||
device removal (negative value) or device addition (positive
|
||||
value) to any reshape supporting raid levels 4/5/6 and 10.
|
||||
RAID levels 4/5/6 allow for addition of devices (metadata
|
||||
and data device tuple), raid10_near and raid10_offset only
|
||||
allow for device addition. raid10_far does not support any
|
||||
reshaping at all.
|
||||
A minimum of devices have to be kept to enforce resilience,
|
||||
which is 3 devices for raid4/5 and 4 devices for raid6.
|
||||
|
||||
[data_offset <sectors>]
|
||||
This option value defines the offset into each data device
|
||||
where the data starts. This is used to provide out-of-place
|
||||
reshaping space to avoid writing over data while
|
||||
changing the layout of stripes, hence an interruption/crash
|
||||
may happen at any time without the risk of losing data.
|
||||
E.g. when adding devices to an existing raid set during
|
||||
forward reshaping, the out-of-place space will be allocated
|
||||
at the beginning of each raid device. The kernel raid4/5/6/10
|
||||
MD personalities supporting such device addition will read the data from
|
||||
the existing first stripes (those with smaller number of stripes)
|
||||
starting at data_offset to fill up a new stripe with the larger
|
||||
number of stripes, calculate the redundancy blocks (CRC/Q-syndrome)
|
||||
and write that new stripe to offset 0. Same will be applied to all
|
||||
N-1 other new stripes. This out-of-place scheme is used to change
|
||||
the RAID type (i.e. the allocation algorithm) as well, e.g.
|
||||
changing from raid5_ls to raid5_n.
|
||||
|
||||
[journal_dev <dev>]
|
||||
This option adds a journal device to raid4/5/6 raid sets and
|
||||
uses it to close the 'write hole' caused by the non-atomic updates
|
||||
to the component devices which can cause data loss during recovery.
|
||||
The journal device is used as writethrough thus causing writes to
|
||||
be throttled versus non-journaled raid4/5/6 sets.
|
||||
Takeover/reshape is not possible with a raid4/5/6 journal device;
|
||||
it has to be deconfigured before requesting these.
|
||||
|
||||
[journal_mode <mode>]
|
||||
This option sets the caching mode on journaled raid4/5/6 raid sets
|
||||
(see 'journal_dev <dev>' above) to 'writethrough' or 'writeback'.
|
||||
If 'writeback' is selected the journal device has to be resilient
|
||||
and must not suffer from the 'write hole' problem itself (e.g. use
|
||||
raid1 or raid10) to avoid a single point of failure.
|
||||
|
||||
<#raid_devs>: The number of devices composing the array.
|
||||
Each device consists of two entries. The first is the device
|
||||
containing the metadata (if any); the second is the one containing the
|
||||
data. A Maximum of 64 metadata/data device entries are supported
|
||||
up to target version 1.8.0.
|
||||
1.9.0 supports up to 253 which is enforced by the used MD kernel runtime.
|
||||
|
||||
If a drive has failed or is missing at creation time, a '-' can be
|
||||
given for both the metadata and data drives for a given position.
|
||||
|
||||
|
||||
Example Tables
|
||||
--------------
|
||||
|
||||
::
|
||||
|
||||
# RAID4 - 4 data drives, 1 parity (no metadata devices)
|
||||
# No metadata devices specified to hold superblock/bitmap info
|
||||
# Chunk size of 1MiB
|
||||
# (Lines separated for easy reading)
|
||||
|
||||
0 1960893648 raid \
|
||||
raid4 1 2048 \
|
||||
5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81
|
||||
|
||||
# RAID4 - 4 data drives, 1 parity (with metadata devices)
|
||||
# Chunk size of 1MiB, force RAID initialization,
|
||||
# min recovery rate at 20 kiB/sec/disk
|
||||
|
||||
0 1960893648 raid \
|
||||
raid4 4 2048 sync min_recovery_rate 20 \
|
||||
5 8:17 8:18 8:33 8:34 8:49 8:50 8:65 8:66 8:81 8:82
|
||||
|
||||
|
||||
Status Output
|
||||
-------------
|
||||
'dmsetup table' displays the table used to construct the mapping.
|
||||
The optional parameters are always printed in the order listed
|
||||
above with "sync" or "nosync" always output ahead of the other
|
||||
arguments, regardless of the order used when originally loading the table.
|
||||
Arguments that can be repeated are ordered by value.
|
||||
|
||||
|
||||
'dmsetup status' yields information on the state and health of the array.
|
||||
The output is as follows (normally a single line, but expanded here for
|
||||
clarity)::
|
||||
|
||||
1: <s> <l> raid \
|
||||
2: <raid_type> <#devices> <health_chars> \
|
||||
3: <sync_ratio> <sync_action> <mismatch_cnt>
|
||||
|
||||
Line 1 is the standard output produced by device-mapper.
|
||||
|
||||
Line 2 & 3 are produced by the raid target and are best explained by example::
|
||||
|
||||
0 1960893648 raid raid4 5 AAAAA 2/490221568 init 0
|
||||
|
||||
Here we can see the RAID type is raid4, there are 5 devices - all of
|
||||
which are 'A'live, and the array is 2/490221568 complete with its initial
|
||||
recovery. Here is a fuller description of the individual fields:
|
||||
|
||||
=============== =========================================================
|
||||
<raid_type> Same as the <raid_type> used to create the array.
|
||||
<health_chars> One char for each device, indicating:
|
||||
|
||||
- 'A' = alive and in-sync
|
||||
- 'a' = alive but not in-sync
|
||||
- 'D' = dead/failed.
|
||||
<sync_ratio> The ratio indicating how much of the array has undergone
|
||||
the process described by 'sync_action'. If the
|
||||
'sync_action' is "check" or "repair", then the process
|
||||
of "resync" or "recover" can be considered complete.
|
||||
<sync_action> One of the following possible states:
|
||||
|
||||
idle
|
||||
- No synchronization action is being performed.
|
||||
frozen
|
||||
- The current action has been halted.
|
||||
resync
|
||||
- Array is undergoing its initial synchronization
|
||||
or is resynchronizing after an unclean shutdown
|
||||
(possibly aided by a bitmap).
|
||||
recover
|
||||
- A device in the array is being rebuilt or
|
||||
replaced.
|
||||
check
|
||||
- A user-initiated full check of the array is
|
||||
being performed. All blocks are read and
|
||||
checked for consistency. The number of
|
||||
discrepancies found are recorded in
|
||||
<mismatch_cnt>. No changes are made to the
|
||||
array by this action.
|
||||
repair
|
||||
- The same as "check", but discrepancies are
|
||||
corrected.
|
||||
reshape
|
||||
- The array is undergoing a reshape.
|
||||
<mismatch_cnt> The number of discrepancies found between mirror copies
|
||||
in RAID1/10 or wrong parity values found in RAID4/5/6.
|
||||
This value is valid only after a "check" of the array
|
||||
is performed. A healthy array has a 'mismatch_cnt' of 0.
|
||||
<data_offset> The current data offset to the start of the user data on
|
||||
each component device of a raid set (see the respective
|
||||
raid parameter to support out-of-place reshaping).
|
||||
<journal_char> - 'A' - active write-through journal device.
|
||||
- 'a' - active write-back journal device.
|
||||
- 'D' - dead journal device.
|
||||
- '-' - no journal device.
|
||||
=============== =========================================================
|
||||
|
||||
|
||||
Message Interface
|
||||
-----------------
|
||||
The dm-raid target will accept certain actions through the 'message' interface.
|
||||
('man dmsetup' for more information on the message interface.) These actions
|
||||
include:
|
||||
|
||||
========= ================================================
|
||||
"idle" Halt the current sync action.
|
||||
"frozen" Freeze the current sync action.
|
||||
"resync" Initiate/continue a resync.
|
||||
"recover" Initiate/continue a recover process.
|
||||
"check" Initiate a check (i.e. a "scrub") of the array.
|
||||
"repair" Initiate a repair of the array.
|
||||
========= ================================================
|
||||
|
||||
|
||||
Discard Support
|
||||
---------------
|
||||
The implementation of discard support among hardware vendors varies.
|
||||
When a block is discarded, some storage devices will return zeroes when
|
||||
the block is read. These devices set the 'discard_zeroes_data'
|
||||
attribute. Other devices will return random data. Confusingly, some
|
||||
devices that advertise 'discard_zeroes_data' will not reliably return
|
||||
zeroes when discarded blocks are read! Since RAID 4/5/6 uses blocks
|
||||
from a number of devices to calculate parity blocks and (for performance
|
||||
reasons) relies on 'discard_zeroes_data' being reliable, it is important
|
||||
that the devices be consistent. Blocks may be discarded in the middle
|
||||
of a RAID 4/5/6 stripe and if subsequent read results are not
|
||||
consistent, the parity blocks may be calculated differently at any time;
|
||||
making the parity blocks useless for redundancy. It is important to
|
||||
understand how your hardware behaves with discards if you are going to
|
||||
enable discards with RAID 4/5/6.
|
||||
|
||||
Since the behavior of storage devices is unreliable in this respect,
|
||||
even when reporting 'discard_zeroes_data', by default RAID 4/5/6
|
||||
discard support is disabled -- this ensures data integrity at the
|
||||
expense of losing some performance.
|
||||
|
||||
Storage devices that properly support 'discard_zeroes_data' are
|
||||
increasingly whitelisted in the kernel and can thus be trusted.
|
||||
|
||||
For trusted devices, the following dm-raid module parameter can be set
|
||||
to safely enable discard support for RAID 4/5/6:
|
||||
|
||||
'devices_handle_discards_safely'
|
||||
|
||||
|
||||
Version History
|
||||
---------------
|
||||
|
||||
::
|
||||
|
||||
1.0.0 Initial version. Support for RAID 4/5/6
|
||||
1.1.0 Added support for RAID 1
|
||||
1.2.0 Handle creation of arrays that contain failed devices.
|
||||
1.3.0 Added support for RAID 10
|
||||
1.3.1 Allow device replacement/rebuild for RAID 10
|
||||
1.3.2 Fix/improve redundancy checking for RAID10
|
||||
1.4.0 Non-functional change. Removes arg from mapping function.
|
||||
1.4.1 RAID10 fix redundancy validation checks (commit 55ebbb5).
|
||||
1.4.2 Add RAID10 "far" and "offset" algorithm support.
|
||||
1.5.0 Add message interface to allow manipulation of the sync_action.
|
||||
New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt.
|
||||
1.5.1 Add ability to restore transiently failed devices on resume.
|
||||
1.5.2 'mismatch_cnt' is zero unless [last_]sync_action is "check".
|
||||
1.6.0 Add discard support (and devices_handle_discard_safely module param).
|
||||
1.7.0 Add support for MD RAID0 mappings.
|
||||
1.8.0 Explicitly check for compatible flags in the superblock metadata
|
||||
and reject to start the raid set if any are set by a newer
|
||||
target version, thus avoiding data corruption on a raid set
|
||||
with a reshape in progress.
|
||||
1.9.0 Add support for RAID level takeover/reshape/region size
|
||||
and set size reduction.
|
||||
1.9.1 Fix activation of existing RAID 4/10 mapped devices
|
||||
1.9.2 Don't emit '- -' on the status table line in case the constructor
|
||||
fails reading a superblock. Correctly emit 'maj:min1 maj:min2' and
|
||||
'D' on the status line. If '- -' is passed into the constructor, emit
|
||||
'- -' on the table line and '-' as the status line health character.
|
||||
1.10.0 Add support for raid4/5/6 journal device
|
||||
1.10.1 Fix data corruption on reshape request
|
||||
1.11.0 Fix table line argument order
|
||||
(wrong raid10_copies/raid10_format sequence)
|
||||
1.11.1 Add raid4/5/6 journal write-back support via journal_mode option
|
||||
1.12.1 Fix for MD deadlock between mddev_suspend() and md_write_start() available
|
||||
1.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A')
|
||||
1.13.1 Fix deadlock caused by early md_stop_writes(). Also fix size an
|
||||
state races.
|
||||
1.13.2 Fix raid redundancy validation and avoid keeping raid set frozen
|
||||
1.14.0 Fix reshape race on small devices. Fix stripe adding reshape
|
||||
deadlock/potential data corruption. Update superblock when
|
||||
specific devices are requested via rebuild. Fix RAID leg
|
||||
rebuild errors.
|
101
Documentation/admin-guide/device-mapper/dm-service-time.rst
Normal file
101
Documentation/admin-guide/device-mapper/dm-service-time.rst
Normal file
|
@ -0,0 +1,101 @@
|
|||
===============
|
||||
dm-service-time
|
||||
===============
|
||||
|
||||
dm-service-time is a path selector module for device-mapper targets,
|
||||
which selects a path with the shortest estimated service time for
|
||||
the incoming I/O.
|
||||
|
||||
The service time for each path is estimated by dividing the total size
|
||||
of in-flight I/Os on a path with the performance value of the path.
|
||||
The performance value is a relative throughput value among all paths
|
||||
in a path-group, and it can be specified as a table argument.
|
||||
|
||||
The path selector name is 'service-time'.
|
||||
|
||||
Table parameters for each path:
|
||||
|
||||
[<repeat_count> [<relative_throughput>]]
|
||||
<repeat_count>:
|
||||
The number of I/Os to dispatch using the selected
|
||||
path before switching to the next path.
|
||||
If not given, internal default is used. To check
|
||||
the default value, see the activated table.
|
||||
<relative_throughput>:
|
||||
The relative throughput value of the path
|
||||
among all paths in the path-group.
|
||||
The valid range is 0-100.
|
||||
If not given, minimum value '1' is used.
|
||||
If '0' is given, the path isn't selected while
|
||||
other paths having a positive value are available.
|
||||
|
||||
Status for each path:
|
||||
|
||||
<status> <fail-count> <in-flight-size> <relative_throughput>
|
||||
<status>:
|
||||
'A' if the path is active, 'F' if the path is failed.
|
||||
<fail-count>:
|
||||
The number of path failures.
|
||||
<in-flight-size>:
|
||||
The size of in-flight I/Os on the path.
|
||||
<relative_throughput>:
|
||||
The relative throughput value of the path
|
||||
among all paths in the path-group.
|
||||
|
||||
|
||||
Algorithm
|
||||
=========
|
||||
|
||||
dm-service-time adds the I/O size to 'in-flight-size' when the I/O is
|
||||
dispatched and subtracts when completed.
|
||||
Basically, dm-service-time selects a path having minimum service time
|
||||
which is calculated by::
|
||||
|
||||
('in-flight-size' + 'size-of-incoming-io') / 'relative_throughput'
|
||||
|
||||
However, some optimizations below are used to reduce the calculation
|
||||
as much as possible.
|
||||
|
||||
1. If the paths have the same 'relative_throughput', skip
|
||||
the division and just compare the 'in-flight-size'.
|
||||
|
||||
2. If the paths have the same 'in-flight-size', skip the division
|
||||
and just compare the 'relative_throughput'.
|
||||
|
||||
3. If some paths have non-zero 'relative_throughput' and others
|
||||
have zero 'relative_throughput', ignore those paths with zero
|
||||
'relative_throughput'.
|
||||
|
||||
If such optimizations can't be applied, calculate service time, and
|
||||
compare service time.
|
||||
If calculated service time is equal, the path having maximum
|
||||
'relative_throughput' may be better. So compare 'relative_throughput'
|
||||
then.
|
||||
|
||||
|
||||
Examples
|
||||
========
|
||||
In case that 2 paths (sda and sdb) are used with repeat_count == 128
|
||||
and sda has an average throughput 1GB/s and sdb has 4GB/s,
|
||||
'relative_throughput' value may be '1' for sda and '4' for sdb::
|
||||
|
||||
# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4" \
|
||||
dmsetup create test
|
||||
#
|
||||
# dmsetup table
|
||||
test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4
|
||||
#
|
||||
# dmsetup status
|
||||
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 1 8:16 A 0 0 4
|
||||
|
||||
|
||||
Or '2' for sda and '8' for sdb would be also true::
|
||||
|
||||
# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8" \
|
||||
dmsetup create test
|
||||
#
|
||||
# dmsetup table
|
||||
test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8
|
||||
#
|
||||
# dmsetup status
|
||||
test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 2 8:16 A 0 0 8
|
110
Documentation/admin-guide/device-mapper/dm-uevent.rst
Normal file
110
Documentation/admin-guide/device-mapper/dm-uevent.rst
Normal file
|
@ -0,0 +1,110 @@
|
|||
====================
|
||||
device-mapper uevent
|
||||
====================
|
||||
|
||||
The device-mapper uevent code adds the capability to device-mapper to create
|
||||
and send kobject uevents (uevents). Previously device-mapper events were only
|
||||
available through the ioctl interface. The advantage of the uevents interface
|
||||
is the event contains environment attributes providing increased context for
|
||||
the event avoiding the need to query the state of the device-mapper device after
|
||||
the event is received.
|
||||
|
||||
There are two functions currently for device-mapper events. The first function
|
||||
listed creates the event and the second function sends the event(s)::
|
||||
|
||||
void dm_path_uevent(enum dm_uevent_type event_type, struct dm_target *ti,
|
||||
const char *path, unsigned nr_valid_paths)
|
||||
|
||||
void dm_send_uevents(struct list_head *events, struct kobject *kobj)
|
||||
|
||||
|
||||
The variables added to the uevent environment are:
|
||||
|
||||
Variable Name: DM_TARGET
|
||||
------------------------
|
||||
:Uevent Action(s): KOBJ_CHANGE
|
||||
:Type: string
|
||||
:Description:
|
||||
:Value: Name of device-mapper target that generated the event.
|
||||
|
||||
Variable Name: DM_ACTION
|
||||
------------------------
|
||||
:Uevent Action(s): KOBJ_CHANGE
|
||||
:Type: string
|
||||
:Description:
|
||||
:Value: Device-mapper specific action that caused the uevent action.
|
||||
PATH_FAILED - A path has failed;
|
||||
PATH_REINSTATED - A path has been reinstated.
|
||||
|
||||
Variable Name: DM_SEQNUM
|
||||
------------------------
|
||||
:Uevent Action(s): KOBJ_CHANGE
|
||||
:Type: unsigned integer
|
||||
:Description: A sequence number for this specific device-mapper device.
|
||||
:Value: Valid unsigned integer range.
|
||||
|
||||
Variable Name: DM_PATH
|
||||
----------------------
|
||||
:Uevent Action(s): KOBJ_CHANGE
|
||||
:Type: string
|
||||
:Description: Major and minor number of the path device pertaining to this
|
||||
event.
|
||||
:Value: Path name in the form of "Major:Minor"
|
||||
|
||||
Variable Name: DM_NR_VALID_PATHS
|
||||
--------------------------------
|
||||
:Uevent Action(s): KOBJ_CHANGE
|
||||
:Type: unsigned integer
|
||||
:Description:
|
||||
:Value: Valid unsigned integer range.
|
||||
|
||||
Variable Name: DM_NAME
|
||||
----------------------
|
||||
:Uevent Action(s): KOBJ_CHANGE
|
||||
:Type: string
|
||||
:Description: Name of the device-mapper device.
|
||||
:Value: Name
|
||||
|
||||
Variable Name: DM_UUID
|
||||
----------------------
|
||||
:Uevent Action(s): KOBJ_CHANGE
|
||||
:Type: string
|
||||
:Description: UUID of the device-mapper device.
|
||||
:Value: UUID. (Empty string if there isn't one.)
|
||||
|
||||
An example of the uevents generated as captured by udevmonitor is shown
|
||||
below
|
||||
|
||||
1.) Path failure::
|
||||
|
||||
UEVENT[1192521009.711215] change@/block/dm-3
|
||||
ACTION=change
|
||||
DEVPATH=/block/dm-3
|
||||
SUBSYSTEM=block
|
||||
DM_TARGET=multipath
|
||||
DM_ACTION=PATH_FAILED
|
||||
DM_SEQNUM=1
|
||||
DM_PATH=8:32
|
||||
DM_NR_VALID_PATHS=0
|
||||
DM_NAME=mpath2
|
||||
DM_UUID=mpath-35333333000002328
|
||||
MINOR=3
|
||||
MAJOR=253
|
||||
SEQNUM=1130
|
||||
|
||||
2.) Path reinstate::
|
||||
|
||||
UEVENT[1192521132.989927] change@/block/dm-3
|
||||
ACTION=change
|
||||
DEVPATH=/block/dm-3
|
||||
SUBSYSTEM=block
|
||||
DM_TARGET=multipath
|
||||
DM_ACTION=PATH_REINSTATED
|
||||
DM_SEQNUM=2
|
||||
DM_PATH=8:32
|
||||
DM_NR_VALID_PATHS=1
|
||||
DM_NAME=mpath2
|
||||
DM_UUID=mpath-35333333000002328
|
||||
MINOR=3
|
||||
MAJOR=253
|
||||
SEQNUM=1131
|
146
Documentation/admin-guide/device-mapper/dm-zoned.rst
Normal file
146
Documentation/admin-guide/device-mapper/dm-zoned.rst
Normal file
|
@ -0,0 +1,146 @@
|
|||
========
|
||||
dm-zoned
|
||||
========
|
||||
|
||||
The dm-zoned device mapper target exposes a zoned block device (ZBC and
|
||||
ZAC compliant devices) as a regular block device without any write
|
||||
pattern constraints. In effect, it implements a drive-managed zoned
|
||||
block device which hides from the user (a file system or an application
|
||||
doing raw block device accesses) the sequential write constraints of
|
||||
host-managed zoned block devices and can mitigate the potential
|
||||
device-side performance degradation due to excessive random writes on
|
||||
host-aware zoned block devices.
|
||||
|
||||
For a more detailed description of the zoned block device models and
|
||||
their constraints see (for SCSI devices):
|
||||
|
||||
http://www.t10.org/drafts.htm#ZBC_Family
|
||||
|
||||
and (for ATA devices):
|
||||
|
||||
http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf
|
||||
|
||||
The dm-zoned implementation is simple and minimizes system overhead (CPU
|
||||
and memory usage as well as storage capacity loss). For a 10TB
|
||||
host-managed disk with 256 MB zones, dm-zoned memory usage per disk
|
||||
instance is at most 4.5 MB and as little as 5 zones will be used
|
||||
internally for storing metadata and performaing reclaim operations.
|
||||
|
||||
dm-zoned target devices are formatted and checked using the dmzadm
|
||||
utility available at:
|
||||
|
||||
https://github.com/hgst/dm-zoned-tools
|
||||
|
||||
Algorithm
|
||||
=========
|
||||
|
||||
dm-zoned implements an on-disk buffering scheme to handle non-sequential
|
||||
write accesses to the sequential zones of a zoned block device.
|
||||
Conventional zones are used for caching as well as for storing internal
|
||||
metadata.
|
||||
|
||||
The zones of the device are separated into 2 types:
|
||||
|
||||
1) Metadata zones: these are conventional zones used to store metadata.
|
||||
Metadata zones are not reported as useable capacity to the user.
|
||||
|
||||
2) Data zones: all remaining zones, the vast majority of which will be
|
||||
sequential zones used exclusively to store user data. The conventional
|
||||
zones of the device may be used also for buffering user random writes.
|
||||
Data in these zones may be directly mapped to the conventional zone, but
|
||||
later moved to a sequential zone so that the conventional zone can be
|
||||
reused for buffering incoming random writes.
|
||||
|
||||
dm-zoned exposes a logical device with a sector size of 4096 bytes,
|
||||
irrespective of the physical sector size of the backend zoned block
|
||||
device being used. This allows reducing the amount of metadata needed to
|
||||
manage valid blocks (blocks written).
|
||||
|
||||
The on-disk metadata format is as follows:
|
||||
|
||||
1) The first block of the first conventional zone found contains the
|
||||
super block which describes the on disk amount and position of metadata
|
||||
blocks.
|
||||
|
||||
2) Following the super block, a set of blocks is used to describe the
|
||||
mapping of the logical device blocks. The mapping is done per chunk of
|
||||
blocks, with the chunk size equal to the zoned block device size. The
|
||||
mapping table is indexed by chunk number and each mapping entry
|
||||
indicates the zone number of the device storing the chunk of data. Each
|
||||
mapping entry may also indicate if the zone number of a conventional
|
||||
zone used to buffer random modification to the data zone.
|
||||
|
||||
3) A set of blocks used to store bitmaps indicating the validity of
|
||||
blocks in the data zones follows the mapping table. A valid block is
|
||||
defined as a block that was written and not discarded. For a buffered
|
||||
data chunk, a block is always valid only in the data zone mapping the
|
||||
chunk or in the buffer zone of the chunk.
|
||||
|
||||
For a logical chunk mapped to a conventional zone, all write operations
|
||||
are processed by directly writing to the zone. If the mapping zone is a
|
||||
sequential zone, the write operation is processed directly only if the
|
||||
write offset within the logical chunk is equal to the write pointer
|
||||
offset within of the sequential data zone (i.e. the write operation is
|
||||
aligned on the zone write pointer). Otherwise, write operations are
|
||||
processed indirectly using a buffer zone. In that case, an unused
|
||||
conventional zone is allocated and assigned to the chunk being
|
||||
accessed. Writing a block to the buffer zone of a chunk will
|
||||
automatically invalidate the same block in the sequential zone mapping
|
||||
the chunk. If all blocks of the sequential zone become invalid, the zone
|
||||
is freed and the chunk buffer zone becomes the primary zone mapping the
|
||||
chunk, resulting in native random write performance similar to a regular
|
||||
block device.
|
||||
|
||||
Read operations are processed according to the block validity
|
||||
information provided by the bitmaps. Valid blocks are read either from
|
||||
the sequential zone mapping a chunk, or if the chunk is buffered, from
|
||||
the buffer zone assigned. If the accessed chunk has no mapping, or the
|
||||
accessed blocks are invalid, the read buffer is zeroed and the read
|
||||
operation terminated.
|
||||
|
||||
After some time, the limited number of convnetional zones available may
|
||||
be exhausted (all used to map chunks or buffer sequential zones) and
|
||||
unaligned writes to unbuffered chunks become impossible. To avoid this
|
||||
situation, a reclaim process regularly scans used conventional zones and
|
||||
tries to reclaim the least recently used zones by copying the valid
|
||||
blocks of the buffer zone to a free sequential zone. Once the copy
|
||||
completes, the chunk mapping is updated to point to the sequential zone
|
||||
and the buffer zone freed for reuse.
|
||||
|
||||
Metadata Protection
|
||||
===================
|
||||
|
||||
To protect metadata against corruption in case of sudden power loss or
|
||||
system crash, 2 sets of metadata zones are used. One set, the primary
|
||||
set, is used as the main metadata region, while the secondary set is
|
||||
used as a staging area. Modified metadata is first written to the
|
||||
secondary set and validated by updating the super block in the secondary
|
||||
set, a generation counter is used to indicate that this set contains the
|
||||
newest metadata. Once this operation completes, in place of metadata
|
||||
block updates can be done in the primary metadata set. This ensures that
|
||||
one of the set is always consistent (all modifications committed or none
|
||||
at all). Flush operations are used as a commit point. Upon reception of
|
||||
a flush request, metadata modification activity is temporarily blocked
|
||||
(for both incoming BIO processing and reclaim process) and all dirty
|
||||
metadata blocks are staged and updated. Normal operation is then
|
||||
resumed. Flushing metadata thus only temporarily delays write and
|
||||
discard requests. Read requests can be processed concurrently while
|
||||
metadata flush is being executed.
|
||||
|
||||
Usage
|
||||
=====
|
||||
|
||||
A zoned block device must first be formatted using the dmzadm tool. This
|
||||
will analyze the device zone configuration, determine where to place the
|
||||
metadata sets on the device and initialize the metadata sets.
|
||||
|
||||
Ex::
|
||||
|
||||
dmzadm --format /dev/sdxx
|
||||
|
||||
For a formatted device, the target can be created normally with the
|
||||
dmsetup utility. The only parameter that dm-zoned requires is the
|
||||
underlying zoned block device name. Ex::
|
||||
|
||||
echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | \
|
||||
dmsetup create dmz-`basename ${dev}`
|
116
Documentation/admin-guide/device-mapper/era.rst
Normal file
116
Documentation/admin-guide/device-mapper/era.rst
Normal file
|
@ -0,0 +1,116 @@
|
|||
======
|
||||
dm-era
|
||||
======
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
dm-era is a target that behaves similar to the linear target. In
|
||||
addition it keeps track of which blocks were written within a user
|
||||
defined period of time called an 'era'. Each era target instance
|
||||
maintains the current era as a monotonically increasing 32-bit
|
||||
counter.
|
||||
|
||||
Use cases include tracking changed blocks for backup software, and
|
||||
partially invalidating the contents of a cache to restore cache
|
||||
coherency after rolling back a vendor snapshot.
|
||||
|
||||
Constructor
|
||||
===========
|
||||
|
||||
era <metadata dev> <origin dev> <block size>
|
||||
|
||||
================ ======================================================
|
||||
metadata dev fast device holding the persistent metadata
|
||||
origin dev device holding data blocks that may change
|
||||
block size block size of origin data device, granularity that is
|
||||
tracked by the target
|
||||
================ ======================================================
|
||||
|
||||
Messages
|
||||
========
|
||||
|
||||
None of the dm messages take any arguments.
|
||||
|
||||
checkpoint
|
||||
----------
|
||||
|
||||
Possibly move to a new era. You shouldn't assume the era has
|
||||
incremented. After sending this message, you should check the
|
||||
current era via the status line.
|
||||
|
||||
take_metadata_snap
|
||||
------------------
|
||||
|
||||
Create a clone of the metadata, to allow a userland process to read it.
|
||||
|
||||
drop_metadata_snap
|
||||
------------------
|
||||
|
||||
Drop the metadata snapshot.
|
||||
|
||||
Status
|
||||
======
|
||||
|
||||
<metadata block size> <#used metadata blocks>/<#total metadata blocks>
|
||||
<current era> <held metadata root | '-'>
|
||||
|
||||
========================= ==============================================
|
||||
metadata block size Fixed block size for each metadata block in
|
||||
sectors
|
||||
#used metadata blocks Number of metadata blocks used
|
||||
#total metadata blocks Total number of metadata blocks
|
||||
current era The current era
|
||||
held metadata root The location, in blocks, of the metadata root
|
||||
that has been 'held' for userspace read
|
||||
access. '-' indicates there is no held root
|
||||
========================= ==============================================
|
||||
|
||||
Detailed use case
|
||||
=================
|
||||
|
||||
The scenario of invalidating a cache when rolling back a vendor
|
||||
snapshot was the primary use case when developing this target:
|
||||
|
||||
Taking a vendor snapshot
|
||||
------------------------
|
||||
|
||||
- Send a checkpoint message to the era target
|
||||
- Make a note of the current era in its status line
|
||||
- Take vendor snapshot (the era and snapshot should be forever
|
||||
associated now).
|
||||
|
||||
Rolling back to an vendor snapshot
|
||||
----------------------------------
|
||||
|
||||
- Cache enters passthrough mode (see: dm-cache's docs in cache.txt)
|
||||
- Rollback vendor storage
|
||||
- Take metadata snapshot
|
||||
- Ascertain which blocks have been written since the snapshot was taken
|
||||
by checking each block's era
|
||||
- Invalidate those blocks in the caching software
|
||||
- Cache returns to writeback/writethrough mode
|
||||
|
||||
Memory usage
|
||||
============
|
||||
|
||||
The target uses a bitset to record writes in the current era. It also
|
||||
has a spare bitset ready for switching over to a new era. Other than
|
||||
that it uses a few 4k blocks for updating metadata::
|
||||
|
||||
(4 * nr_blocks) bytes + buffers
|
||||
|
||||
Resilience
|
||||
==========
|
||||
|
||||
Metadata is updated on disk before a write to a previously unwritten
|
||||
block is performed. As such dm-era should not be effected by a hard
|
||||
crash such as power failure.
|
||||
|
||||
Userland tools
|
||||
==============
|
||||
|
||||
Userland tools are found in the increasingly poorly named
|
||||
thin-provisioning-tools project:
|
||||
|
||||
https://github.com/jthornber/thin-provisioning-tools
|
42
Documentation/admin-guide/device-mapper/index.rst
Normal file
42
Documentation/admin-guide/device-mapper/index.rst
Normal file
|
@ -0,0 +1,42 @@
|
|||
=============
|
||||
Device Mapper
|
||||
=============
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
cache-policies
|
||||
cache
|
||||
delay
|
||||
dm-crypt
|
||||
dm-flakey
|
||||
dm-init
|
||||
dm-integrity
|
||||
dm-io
|
||||
dm-log
|
||||
dm-queue-length
|
||||
dm-raid
|
||||
dm-service-time
|
||||
dm-uevent
|
||||
dm-zoned
|
||||
era
|
||||
kcopyd
|
||||
linear
|
||||
log-writes
|
||||
persistent-data
|
||||
snapshot
|
||||
statistics
|
||||
striped
|
||||
switch
|
||||
thin-provisioning
|
||||
unstriped
|
||||
verity
|
||||
writecache
|
||||
zero
|
||||
|
||||
.. only:: subproject and html
|
||||
|
||||
Indices
|
||||
=======
|
||||
|
||||
* :ref:`genindex`
|
47
Documentation/admin-guide/device-mapper/kcopyd.rst
Normal file
47
Documentation/admin-guide/device-mapper/kcopyd.rst
Normal file
|
@ -0,0 +1,47 @@
|
|||
======
|
||||
kcopyd
|
||||
======
|
||||
|
||||
Kcopyd provides the ability to copy a range of sectors from one block-device
|
||||
to one or more other block-devices, with an asynchronous completion
|
||||
notification. It is used by dm-snapshot and dm-mirror.
|
||||
|
||||
Users of kcopyd must first create a client and indicate how many memory pages
|
||||
to set aside for their copy jobs. This is done with a call to
|
||||
kcopyd_client_create()::
|
||||
|
||||
int kcopyd_client_create(unsigned int num_pages,
|
||||
struct kcopyd_client **result);
|
||||
|
||||
To start a copy job, the user must set up io_region structures to describe
|
||||
the source and destinations of the copy. Each io_region indicates a
|
||||
block-device along with the starting sector and size of the region. The source
|
||||
of the copy is given as one io_region structure, and the destinations of the
|
||||
copy are given as an array of io_region structures::
|
||||
|
||||
struct io_region {
|
||||
struct block_device *bdev;
|
||||
sector_t sector;
|
||||
sector_t count;
|
||||
};
|
||||
|
||||
To start the copy, the user calls kcopyd_copy(), passing in the client
|
||||
pointer, pointers to the source and destination io_regions, the name of a
|
||||
completion callback routine, and a pointer to some context data for the copy::
|
||||
|
||||
int kcopyd_copy(struct kcopyd_client *kc, struct io_region *from,
|
||||
unsigned int num_dests, struct io_region *dests,
|
||||
unsigned int flags, kcopyd_notify_fn fn, void *context);
|
||||
|
||||
typedef void (*kcopyd_notify_fn)(int read_err, unsigned int write_err,
|
||||
void *context);
|
||||
|
||||
When the copy completes, kcopyd will call the user's completion routine,
|
||||
passing back the user's context pointer. It will also indicate if a read or
|
||||
write error occurred during the copy.
|
||||
|
||||
When a user is done with all their copy jobs, they should call
|
||||
kcopyd_client_destroy() to delete the kcopyd client, which will release the
|
||||
associated memory pages::
|
||||
|
||||
void kcopyd_client_destroy(struct kcopyd_client *kc);
|
63
Documentation/admin-guide/device-mapper/linear.rst
Normal file
63
Documentation/admin-guide/device-mapper/linear.rst
Normal file
|
@ -0,0 +1,63 @@
|
|||
=========
|
||||
dm-linear
|
||||
=========
|
||||
|
||||
Device-Mapper's "linear" target maps a linear range of the Device-Mapper
|
||||
device onto a linear range of another device. This is the basic building
|
||||
block of logical volume managers.
|
||||
|
||||
Parameters: <dev path> <offset>
|
||||
<dev path>:
|
||||
Full pathname to the underlying block-device, or a
|
||||
"major:minor" device-number.
|
||||
<offset>:
|
||||
Starting sector within the device.
|
||||
|
||||
|
||||
Example scripts
|
||||
===============
|
||||
|
||||
::
|
||||
|
||||
#!/bin/sh
|
||||
# Create an identity mapping for a device
|
||||
echo "0 `blockdev --getsz $1` linear $1 0" | dmsetup create identity
|
||||
|
||||
::
|
||||
|
||||
#!/bin/sh
|
||||
# Join 2 devices together
|
||||
size1=`blockdev --getsz $1`
|
||||
size2=`blockdev --getsz $2`
|
||||
echo "0 $size1 linear $1 0
|
||||
$size1 $size2 linear $2 0" | dmsetup create joined
|
||||
|
||||
::
|
||||
|
||||
#!/usr/bin/perl -w
|
||||
# Split a device into 4M chunks and then join them together in reverse order.
|
||||
|
||||
my $name = "reverse";
|
||||
my $extent_size = 4 * 1024 * 2;
|
||||
my $dev = $ARGV[0];
|
||||
my $table = "";
|
||||
my $count = 0;
|
||||
|
||||
if (!defined($dev)) {
|
||||
die("Please specify a device.\n");
|
||||
}
|
||||
|
||||
my $dev_size = `blockdev --getsz $dev`;
|
||||
my $extents = int($dev_size / $extent_size) -
|
||||
(($dev_size % $extent_size) ? 1 : 0);
|
||||
|
||||
while ($extents > 0) {
|
||||
my $this_start = $count * $extent_size;
|
||||
$extents--;
|
||||
$count++;
|
||||
my $this_offset = $extents * $extent_size;
|
||||
|
||||
$table .= "$this_start $extent_size linear $dev $this_offset\n";
|
||||
}
|
||||
|
||||
`echo \"$table\" | dmsetup create $name`;
|
145
Documentation/admin-guide/device-mapper/log-writes.rst
Normal file
145
Documentation/admin-guide/device-mapper/log-writes.rst
Normal file
|
@ -0,0 +1,145 @@
|
|||
=============
|
||||
dm-log-writes
|
||||
=============
|
||||
|
||||
This target takes 2 devices, one to pass all IO to normally, and one to log all
|
||||
of the write operations to. This is intended for file system developers wishing
|
||||
to verify the integrity of metadata or data as the file system is written to.
|
||||
There is a log_write_entry written for every WRITE request and the target is
|
||||
able to take arbitrary data from userspace to insert into the log. The data
|
||||
that is in the WRITE requests is copied into the log to make the replay happen
|
||||
exactly as it happened originally.
|
||||
|
||||
Log Ordering
|
||||
============
|
||||
|
||||
We log things in order of completion once we are sure the write is no longer in
|
||||
cache. This means that normal WRITE requests are not actually logged until the
|
||||
next REQ_PREFLUSH request. This is to make it easier for userspace to replay
|
||||
the log in a way that correlates to what is on disk and not what is in cache,
|
||||
to make it easier to detect improper waiting/flushing.
|
||||
|
||||
This works by attaching all WRITE requests to a list once the write completes.
|
||||
Once we see a REQ_PREFLUSH request we splice this list onto the request and once
|
||||
the FLUSH request completes we log all of the WRITEs and then the FLUSH. Only
|
||||
completed WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
|
||||
simulate the worst case scenario with regard to power failures. Consider the
|
||||
following example (W means write, C means complete):
|
||||
|
||||
W1,W2,W3,C3,C2,Wflush,C1,Cflush
|
||||
|
||||
The log would show the following:
|
||||
|
||||
W3,W2,flush,W1....
|
||||
|
||||
Again this is to simulate what is actually on disk, this allows us to detect
|
||||
cases where a power failure at a particular point in time would create an
|
||||
inconsistent file system.
|
||||
|
||||
Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as
|
||||
they complete as those requests will obviously bypass the device cache.
|
||||
|
||||
Any REQ_OP_DISCARD requests are treated like WRITE requests. Otherwise we would
|
||||
have all the DISCARD requests, and then the WRITE requests and then the FLUSH
|
||||
request. Consider the following example:
|
||||
|
||||
WRITE block 1, DISCARD block 1, FLUSH
|
||||
|
||||
If we logged DISCARD when it completed, the replay would look like this:
|
||||
|
||||
DISCARD 1, WRITE 1, FLUSH
|
||||
|
||||
which isn't quite what happened and wouldn't be caught during the log replay.
|
||||
|
||||
Target interface
|
||||
================
|
||||
|
||||
i) Constructor
|
||||
|
||||
log-writes <dev_path> <log_dev_path>
|
||||
|
||||
============= ==============================================
|
||||
dev_path Device that all of the IO will go to normally.
|
||||
log_dev_path Device where the log entries are written to.
|
||||
============= ==============================================
|
||||
|
||||
ii) Status
|
||||
|
||||
<#logged entries> <highest allocated sector>
|
||||
|
||||
=========================== ========================
|
||||
#logged entries Number of logged entries
|
||||
highest allocated sector Highest allocated sector
|
||||
=========================== ========================
|
||||
|
||||
iii) Messages
|
||||
|
||||
mark <description>
|
||||
|
||||
You can use a dmsetup message to set an arbitrary mark in a log.
|
||||
For example say you want to fsck a file system after every
|
||||
write, but first you need to replay up to the mkfs to make sure
|
||||
we're fsck'ing something reasonable, you would do something like
|
||||
this::
|
||||
|
||||
mkfs.btrfs -f /dev/mapper/log
|
||||
dmsetup message log 0 mark mkfs
|
||||
<run test>
|
||||
|
||||
This would allow you to replay the log up to the mkfs mark and
|
||||
then replay from that point on doing the fsck check in the
|
||||
interval that you want.
|
||||
|
||||
Every log has a mark at the end labeled "dm-log-writes-end".
|
||||
|
||||
Userspace component
|
||||
===================
|
||||
|
||||
There is a userspace tool that will replay the log for you in various ways.
|
||||
It can be found here: https://github.com/josefbacik/log-writes
|
||||
|
||||
Example usage
|
||||
=============
|
||||
|
||||
Say you want to test fsync on your file system. You would do something like
|
||||
this::
|
||||
|
||||
TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
|
||||
dmsetup create log --table "$TABLE"
|
||||
mkfs.btrfs -f /dev/mapper/log
|
||||
dmsetup message log 0 mark mkfs
|
||||
|
||||
mount /dev/mapper/log /mnt/btrfs-test
|
||||
<some test that does fsync at the end>
|
||||
dmsetup message log 0 mark fsync
|
||||
md5sum /mnt/btrfs-test/foo
|
||||
umount /mnt/btrfs-test
|
||||
|
||||
dmsetup remove log
|
||||
replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
|
||||
mount /dev/sdb /mnt/btrfs-test
|
||||
md5sum /mnt/btrfs-test/foo
|
||||
<verify md5sum's are correct>
|
||||
|
||||
Another option is to do a complicated file system operation and verify the file
|
||||
system is consistent during the entire operation. You could do this with:
|
||||
|
||||
TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
|
||||
dmsetup create log --table "$TABLE"
|
||||
mkfs.btrfs -f /dev/mapper/log
|
||||
dmsetup message log 0 mark mkfs
|
||||
|
||||
mount /dev/mapper/log /mnt/btrfs-test
|
||||
<fsstress to dirty the fs>
|
||||
btrfs filesystem balance /mnt/btrfs-test
|
||||
umount /mnt/btrfs-test
|
||||
dmsetup remove log
|
||||
|
||||
replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
|
||||
btrfsck /dev/sdb
|
||||
replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
|
||||
--fsck "btrfsck /dev/sdb" --check fua
|
||||
|
||||
And that will replay the log until it sees a FUA request, run the fsck command
|
||||
and if the fsck passes it will replay to the next FUA, until it is completed or
|
||||
the fsck command exists abnormally.
|
88
Documentation/admin-guide/device-mapper/persistent-data.rst
Normal file
88
Documentation/admin-guide/device-mapper/persistent-data.rst
Normal file
|
@ -0,0 +1,88 @@
|
|||
===============
|
||||
Persistent data
|
||||
===============
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
The more-sophisticated device-mapper targets require complex metadata
|
||||
that is managed in kernel. In late 2010 we were seeing that various
|
||||
different targets were rolling their own data structures, for example:
|
||||
|
||||
- Mikulas Patocka's multisnap implementation
|
||||
- Heinz Mauelshagen's thin provisioning target
|
||||
- Another btree-based caching target posted to dm-devel
|
||||
- Another multi-snapshot target based on a design of Daniel Phillips
|
||||
|
||||
Maintaining these data structures takes a lot of work, so if possible
|
||||
we'd like to reduce the number.
|
||||
|
||||
The persistent-data library is an attempt to provide a re-usable
|
||||
framework for people who want to store metadata in device-mapper
|
||||
targets. It's currently used by the thin-provisioning target and an
|
||||
upcoming hierarchical storage target.
|
||||
|
||||
Overview
|
||||
========
|
||||
|
||||
The main documentation is in the header files which can all be found
|
||||
under drivers/md/persistent-data.
|
||||
|
||||
The block manager
|
||||
-----------------
|
||||
|
||||
dm-block-manager.[hc]
|
||||
|
||||
This provides access to the data on disk in fixed sized-blocks. There
|
||||
is a read/write locking interface to prevent concurrent accesses, and
|
||||
keep data that is being used in the cache.
|
||||
|
||||
Clients of persistent-data are unlikely to use this directly.
|
||||
|
||||
The transaction manager
|
||||
-----------------------
|
||||
|
||||
dm-transaction-manager.[hc]
|
||||
|
||||
This restricts access to blocks and enforces copy-on-write semantics.
|
||||
The only way you can get hold of a writable block through the
|
||||
transaction manager is by shadowing an existing block (ie. doing
|
||||
copy-on-write) or allocating a fresh one. Shadowing is elided within
|
||||
the same transaction so performance is reasonable. The commit method
|
||||
ensures that all data is flushed before it writes the superblock.
|
||||
On power failure your metadata will be as it was when last committed.
|
||||
|
||||
The Space Maps
|
||||
--------------
|
||||
|
||||
dm-space-map.h
|
||||
dm-space-map-metadata.[hc]
|
||||
dm-space-map-disk.[hc]
|
||||
|
||||
On-disk data structures that keep track of reference counts of blocks.
|
||||
Also acts as the allocator of new blocks. Currently two
|
||||
implementations: a simpler one for managing blocks on a different
|
||||
device (eg. thinly-provisioned data blocks); and one for managing
|
||||
the metadata space. The latter is complicated by the need to store
|
||||
its own data within the space it's managing.
|
||||
|
||||
The data structures
|
||||
-------------------
|
||||
|
||||
dm-btree.[hc]
|
||||
dm-btree-remove.c
|
||||
dm-btree-spine.c
|
||||
dm-btree-internal.h
|
||||
|
||||
Currently there is only one data structure, a hierarchical btree.
|
||||
There are plans to add more. For example, something with an
|
||||
array-like interface would see a lot of use.
|
||||
|
||||
The btree is 'hierarchical' in that you can define it to be composed
|
||||
of nested btrees, and take multiple keys. For example, the
|
||||
thin-provisioning target uses a btree with two levels of nesting.
|
||||
The first maps a device id to a mapping tree, and that in turn maps a
|
||||
virtual block to a physical block.
|
||||
|
||||
Values stored in the btrees can have arbitrary size. Keys are always
|
||||
64bits, although nesting allows you to use multiple keys.
|
196
Documentation/admin-guide/device-mapper/snapshot.rst
Normal file
196
Documentation/admin-guide/device-mapper/snapshot.rst
Normal file
|
@ -0,0 +1,196 @@
|
|||
==============================
|
||||
Device-mapper snapshot support
|
||||
==============================
|
||||
|
||||
Device-mapper allows you, without massive data copying:
|
||||
|
||||
- To create snapshots of any block device i.e. mountable, saved states of
|
||||
the block device which are also writable without interfering with the
|
||||
original content;
|
||||
- To create device "forks", i.e. multiple different versions of the
|
||||
same data stream.
|
||||
- To merge a snapshot of a block device back into the snapshot's origin
|
||||
device.
|
||||
|
||||
In the first two cases, dm copies only the chunks of data that get
|
||||
changed and uses a separate copy-on-write (COW) block device for
|
||||
storage.
|
||||
|
||||
For snapshot merge the contents of the COW storage are merged back into
|
||||
the origin device.
|
||||
|
||||
|
||||
There are three dm targets available:
|
||||
snapshot, snapshot-origin, and snapshot-merge.
|
||||
|
||||
- snapshot-origin <origin>
|
||||
|
||||
which will normally have one or more snapshots based on it.
|
||||
Reads will be mapped directly to the backing device. For each write, the
|
||||
original data will be saved in the <COW device> of each snapshot to keep
|
||||
its visible content unchanged, at least until the <COW device> fills up.
|
||||
|
||||
|
||||
- snapshot <origin> <COW device> <persistent?> <chunksize>
|
||||
[<# feature args> [<arg>]*]
|
||||
|
||||
A snapshot of the <origin> block device is created. Changed chunks of
|
||||
<chunksize> sectors will be stored on the <COW device>. Writes will
|
||||
only go to the <COW device>. Reads will come from the <COW device> or
|
||||
from <origin> for unchanged data. <COW device> will often be
|
||||
smaller than the origin and if it fills up the snapshot will become
|
||||
useless and be disabled, returning errors. So it is important to monitor
|
||||
the amount of free space and expand the <COW device> before it fills up.
|
||||
|
||||
<persistent?> is P (Persistent) or N (Not persistent - will not survive
|
||||
after reboot). O (Overflow) can be added as a persistent store option
|
||||
to allow userspace to advertise its support for seeing "Overflow" in the
|
||||
snapshot status. So supported store types are "P", "PO" and "N".
|
||||
|
||||
The difference between persistent and transient is with transient
|
||||
snapshots less metadata must be saved on disk - they can be kept in
|
||||
memory by the kernel.
|
||||
|
||||
When loading or unloading the snapshot target, the corresponding
|
||||
snapshot-origin or snapshot-merge target must be suspended. A failure to
|
||||
suspend the origin target could result in data corruption.
|
||||
|
||||
Optional features:
|
||||
|
||||
discard_zeroes_cow - a discard issued to the snapshot device that
|
||||
maps to entire chunks to will zero the corresponding exception(s) in
|
||||
the snapshot's exception store.
|
||||
|
||||
discard_passdown_origin - a discard to the snapshot device is passed
|
||||
down to the snapshot-origin's underlying device. This doesn't cause
|
||||
copy-out to the snapshot exception store because the snapshot-origin
|
||||
target is bypassed.
|
||||
|
||||
The discard_passdown_origin feature depends on the discard_zeroes_cow
|
||||
feature being enabled.
|
||||
|
||||
|
||||
- snapshot-merge <origin> <COW device> <persistent> <chunksize>
|
||||
[<# feature args> [<arg>]*]
|
||||
|
||||
takes the same table arguments as the snapshot target except it only
|
||||
works with persistent snapshots. This target assumes the role of the
|
||||
"snapshot-origin" target and must not be loaded if the "snapshot-origin"
|
||||
is still present for <origin>.
|
||||
|
||||
Creates a merging snapshot that takes control of the changed chunks
|
||||
stored in the <COW device> of an existing snapshot, through a handover
|
||||
procedure, and merges these chunks back into the <origin>. Once merging
|
||||
has started (in the background) the <origin> may be opened and the merge
|
||||
will continue while I/O is flowing to it. Changes to the <origin> are
|
||||
deferred until the merging snapshot's corresponding chunk(s) have been
|
||||
merged. Once merging has started the snapshot device, associated with
|
||||
the "snapshot" target, will return -EIO when accessed.
|
||||
|
||||
|
||||
How snapshot is used by LVM2
|
||||
============================
|
||||
When you create the first LVM2 snapshot of a volume, four dm devices are used:
|
||||
|
||||
1) a device containing the original mapping table of the source volume;
|
||||
2) a device used as the <COW device>;
|
||||
3) a "snapshot" device, combining #1 and #2, which is the visible snapshot
|
||||
volume;
|
||||
4) the "original" volume (which uses the device number used by the original
|
||||
source volume), whose table is replaced by a "snapshot-origin" mapping
|
||||
from device #1.
|
||||
|
||||
A fixed naming scheme is used, so with the following commands::
|
||||
|
||||
lvcreate -L 1G -n base volumeGroup
|
||||
lvcreate -L 100M --snapshot -n snap volumeGroup/base
|
||||
|
||||
we'll have this situation (with volumes in above order)::
|
||||
|
||||
# dmsetup table|grep volumeGroup
|
||||
|
||||
volumeGroup-base-real: 0 2097152 linear 8:19 384
|
||||
volumeGroup-snap-cow: 0 204800 linear 8:19 2097536
|
||||
volumeGroup-snap: 0 2097152 snapshot 254:11 254:12 P 16
|
||||
volumeGroup-base: 0 2097152 snapshot-origin 254:11
|
||||
|
||||
# ls -lL /dev/mapper/volumeGroup-*
|
||||
brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real
|
||||
brw------- 1 root root 254, 12 29 ago 18:15 /dev/mapper/volumeGroup-snap-cow
|
||||
brw------- 1 root root 254, 13 29 ago 18:15 /dev/mapper/volumeGroup-snap
|
||||
brw------- 1 root root 254, 10 29 ago 18:14 /dev/mapper/volumeGroup-base
|
||||
|
||||
|
||||
How snapshot-merge is used by LVM2
|
||||
==================================
|
||||
A merging snapshot assumes the role of the "snapshot-origin" while
|
||||
merging. As such the "snapshot-origin" is replaced with
|
||||
"snapshot-merge". The "-real" device is not changed and the "-cow"
|
||||
device is renamed to <origin name>-cow to aid LVM2's cleanup of the
|
||||
merging snapshot after it completes. The "snapshot" that hands over its
|
||||
COW device to the "snapshot-merge" is deactivated (unless using lvchange
|
||||
--refresh); but if it is left active it will simply return I/O errors.
|
||||
|
||||
A snapshot will merge into its origin with the following command::
|
||||
|
||||
lvconvert --merge volumeGroup/snap
|
||||
|
||||
we'll now have this situation::
|
||||
|
||||
# dmsetup table|grep volumeGroup
|
||||
|
||||
volumeGroup-base-real: 0 2097152 linear 8:19 384
|
||||
volumeGroup-base-cow: 0 204800 linear 8:19 2097536
|
||||
volumeGroup-base: 0 2097152 snapshot-merge 254:11 254:12 P 16
|
||||
|
||||
# ls -lL /dev/mapper/volumeGroup-*
|
||||
brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real
|
||||
brw------- 1 root root 254, 12 29 ago 18:16 /dev/mapper/volumeGroup-base-cow
|
||||
brw------- 1 root root 254, 10 29 ago 18:16 /dev/mapper/volumeGroup-base
|
||||
|
||||
|
||||
How to determine when a merging is complete
|
||||
===========================================
|
||||
The snapshot-merge and snapshot status lines end with:
|
||||
|
||||
<sectors_allocated>/<total_sectors> <metadata_sectors>
|
||||
|
||||
Both <sectors_allocated> and <total_sectors> include both data and metadata.
|
||||
During merging, the number of sectors allocated gets smaller and
|
||||
smaller. Merging has finished when the number of sectors holding data
|
||||
is zero, in other words <sectors_allocated> == <metadata_sectors>.
|
||||
|
||||
Here is a practical example (using a hybrid of lvm and dmsetup commands)::
|
||||
|
||||
# lvs
|
||||
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
|
||||
base volumeGroup owi-a- 4.00g
|
||||
snap volumeGroup swi-a- 1.00g base 18.97
|
||||
|
||||
# dmsetup status volumeGroup-snap
|
||||
0 8388608 snapshot 397896/2097152 1560
|
||||
^^^^ metadata sectors
|
||||
|
||||
# lvconvert --merge -b volumeGroup/snap
|
||||
Merging of volume snap started.
|
||||
|
||||
# lvs volumeGroup/snap
|
||||
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
|
||||
base volumeGroup Owi-a- 4.00g 17.23
|
||||
|
||||
# dmsetup status volumeGroup-base
|
||||
0 8388608 snapshot-merge 281688/2097152 1104
|
||||
|
||||
# dmsetup status volumeGroup-base
|
||||
0 8388608 snapshot-merge 180480/2097152 712
|
||||
|
||||
# dmsetup status volumeGroup-base
|
||||
0 8388608 snapshot-merge 16/2097152 16
|
||||
|
||||
Merging has finished.
|
||||
|
||||
::
|
||||
|
||||
# lvs
|
||||
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
|
||||
base volumeGroup owi-a- 4.00g
|
225
Documentation/admin-guide/device-mapper/statistics.rst
Normal file
225
Documentation/admin-guide/device-mapper/statistics.rst
Normal file
|
@ -0,0 +1,225 @@
|
|||
=============
|
||||
DM statistics
|
||||
=============
|
||||
|
||||
Device Mapper supports the collection of I/O statistics on user-defined
|
||||
regions of a DM device. If no regions are defined no statistics are
|
||||
collected so there isn't any performance impact. Only bio-based DM
|
||||
devices are currently supported.
|
||||
|
||||
Each user-defined region specifies a starting sector, length and step.
|
||||
Individual statistics will be collected for each step-sized area within
|
||||
the range specified.
|
||||
|
||||
The I/O statistics counters for each step-sized area of a region are
|
||||
in the same format as `/sys/block/*/stat` or `/proc/diskstats` (see:
|
||||
Documentation/iostats.txt). But two extra counters (12 and 13) are
|
||||
provided: total time spent reading and writing. When the histogram
|
||||
argument is used, the 14th parameter is reported that represents the
|
||||
histogram of latencies. All these counters may be accessed by sending
|
||||
the @stats_print message to the appropriate DM device via dmsetup.
|
||||
|
||||
The reported times are in milliseconds and the granularity depends on
|
||||
the kernel ticks. When the option precise_timestamps is used, the
|
||||
reported times are in nanoseconds.
|
||||
|
||||
Each region has a corresponding unique identifier, which we call a
|
||||
region_id, that is assigned when the region is created. The region_id
|
||||
must be supplied when querying statistics about the region, deleting the
|
||||
region, etc. Unique region_ids enable multiple userspace programs to
|
||||
request and process statistics for the same DM device without stepping
|
||||
on each other's data.
|
||||
|
||||
The creation of DM statistics will allocate memory via kmalloc or
|
||||
fallback to using vmalloc space. At most, 1/4 of the overall system
|
||||
memory may be allocated by DM statistics. The admin can see how much
|
||||
memory is used by reading:
|
||||
|
||||
/sys/module/dm_mod/parameters/stats_current_allocated_bytes
|
||||
|
||||
Messages
|
||||
========
|
||||
|
||||
@stats_create <range> <step> [<number_of_optional_arguments> <optional_arguments>...] [<program_id> [<aux_data>]]
|
||||
Create a new region and return the region_id.
|
||||
|
||||
<range>
|
||||
"-"
|
||||
whole device
|
||||
"<start_sector>+<length>"
|
||||
a range of <length> 512-byte sectors
|
||||
starting with <start_sector>.
|
||||
|
||||
<step>
|
||||
"<area_size>"
|
||||
the range is subdivided into areas each containing
|
||||
<area_size> sectors.
|
||||
"/<number_of_areas>"
|
||||
the range is subdivided into the specified
|
||||
number of areas.
|
||||
|
||||
<number_of_optional_arguments>
|
||||
The number of optional arguments
|
||||
|
||||
<optional_arguments>
|
||||
The following optional arguments are supported:
|
||||
|
||||
precise_timestamps
|
||||
use precise timer with nanosecond resolution
|
||||
instead of the "jiffies" variable. When this argument is
|
||||
used, the resulting times are in nanoseconds instead of
|
||||
milliseconds. Precise timestamps are a little bit slower
|
||||
to obtain than jiffies-based timestamps.
|
||||
histogram:n1,n2,n3,n4,...
|
||||
collect histogram of latencies. The
|
||||
numbers n1, n2, etc are times that represent the boundaries
|
||||
of the histogram. If precise_timestamps is not used, the
|
||||
times are in milliseconds, otherwise they are in
|
||||
nanoseconds. For each range, the kernel will report the
|
||||
number of requests that completed within this range. For
|
||||
example, if we use "histogram:10,20,30", the kernel will
|
||||
report four numbers a:b:c:d. a is the number of requests
|
||||
that took 0-10 ms to complete, b is the number of requests
|
||||
that took 10-20 ms to complete, c is the number of requests
|
||||
that took 20-30 ms to complete and d is the number of
|
||||
requests that took more than 30 ms to complete.
|
||||
|
||||
<program_id>
|
||||
An optional parameter. A name that uniquely identifies
|
||||
the userspace owner of the range. This groups ranges together
|
||||
so that userspace programs can identify the ranges they
|
||||
created and ignore those created by others.
|
||||
The kernel returns this string back in the output of
|
||||
@stats_list message, but it doesn't use it for anything else.
|
||||
If we omit the number of optional arguments, program id must not
|
||||
be a number, otherwise it would be interpreted as the number of
|
||||
optional arguments.
|
||||
|
||||
<aux_data>
|
||||
An optional parameter. A word that provides auxiliary data
|
||||
that is useful to the client program that created the range.
|
||||
The kernel returns this string back in the output of
|
||||
@stats_list message, but it doesn't use this value for anything.
|
||||
|
||||
@stats_delete <region_id>
|
||||
Delete the region with the specified id.
|
||||
|
||||
<region_id>
|
||||
region_id returned from @stats_create
|
||||
|
||||
@stats_clear <region_id>
|
||||
Clear all the counters except the in-flight i/o counters.
|
||||
|
||||
<region_id>
|
||||
region_id returned from @stats_create
|
||||
|
||||
@stats_list [<program_id>]
|
||||
List all regions registered with @stats_create.
|
||||
|
||||
<program_id>
|
||||
An optional parameter.
|
||||
If this parameter is specified, only matching regions
|
||||
are returned.
|
||||
If it is not specified, all regions are returned.
|
||||
|
||||
Output format:
|
||||
<region_id>: <start_sector>+<length> <step> <program_id> <aux_data>
|
||||
precise_timestamps histogram:n1,n2,n3,...
|
||||
|
||||
The strings "precise_timestamps" and "histogram" are printed only
|
||||
if they were specified when creating the region.
|
||||
|
||||
@stats_print <region_id> [<starting_line> <number_of_lines>]
|
||||
Print counters for each step-sized area of a region.
|
||||
|
||||
<region_id>
|
||||
region_id returned from @stats_create
|
||||
|
||||
<starting_line>
|
||||
The index of the starting line in the output.
|
||||
If omitted, all lines are returned.
|
||||
|
||||
<number_of_lines>
|
||||
The number of lines to include in the output.
|
||||
If omitted, all lines are returned.
|
||||
|
||||
Output format for each step-sized area of a region:
|
||||
|
||||
<start_sector>+<length>
|
||||
counters
|
||||
|
||||
The first 11 counters have the same meaning as
|
||||
`/sys/block/*/stat or /proc/diskstats`.
|
||||
|
||||
Please refer to Documentation/iostats.txt for details.
|
||||
|
||||
1. the number of reads completed
|
||||
2. the number of reads merged
|
||||
3. the number of sectors read
|
||||
4. the number of milliseconds spent reading
|
||||
5. the number of writes completed
|
||||
6. the number of writes merged
|
||||
7. the number of sectors written
|
||||
8. the number of milliseconds spent writing
|
||||
9. the number of I/Os currently in progress
|
||||
10. the number of milliseconds spent doing I/Os
|
||||
11. the weighted number of milliseconds spent doing I/Os
|
||||
|
||||
Additional counters:
|
||||
|
||||
12. the total time spent reading in milliseconds
|
||||
13. the total time spent writing in milliseconds
|
||||
|
||||
@stats_print_clear <region_id> [<starting_line> <number_of_lines>]
|
||||
Atomically print and then clear all the counters except the
|
||||
in-flight i/o counters. Useful when the client consuming the
|
||||
statistics does not want to lose any statistics (those updated
|
||||
between printing and clearing).
|
||||
|
||||
<region_id>
|
||||
region_id returned from @stats_create
|
||||
|
||||
<starting_line>
|
||||
The index of the starting line in the output.
|
||||
If omitted, all lines are printed and then cleared.
|
||||
|
||||
<number_of_lines>
|
||||
The number of lines to process.
|
||||
If omitted, all lines are printed and then cleared.
|
||||
|
||||
@stats_set_aux <region_id> <aux_data>
|
||||
Store auxiliary data aux_data for the specified region.
|
||||
|
||||
<region_id>
|
||||
region_id returned from @stats_create
|
||||
|
||||
<aux_data>
|
||||
The string that identifies data which is useful to the client
|
||||
program that created the range. The kernel returns this
|
||||
string back in the output of @stats_list message, but it
|
||||
doesn't use this value for anything.
|
||||
|
||||
Examples
|
||||
========
|
||||
|
||||
Subdivide the DM device 'vol' into 100 pieces and start collecting
|
||||
statistics on them::
|
||||
|
||||
dmsetup message vol 0 @stats_create - /100
|
||||
|
||||
Set the auxiliary data string to "foo bar baz" (the escape for each
|
||||
space must also be escaped, otherwise the shell will consume them)::
|
||||
|
||||
dmsetup message vol 0 @stats_set_aux 0 foo\\ bar\\ baz
|
||||
|
||||
List the statistics::
|
||||
|
||||
dmsetup message vol 0 @stats_list
|
||||
|
||||
Print the statistics::
|
||||
|
||||
dmsetup message vol 0 @stats_print 0
|
||||
|
||||
Delete the statistics::
|
||||
|
||||
dmsetup message vol 0 @stats_delete 0
|
61
Documentation/admin-guide/device-mapper/striped.rst
Normal file
61
Documentation/admin-guide/device-mapper/striped.rst
Normal file
|
@ -0,0 +1,61 @@
|
|||
=========
|
||||
dm-stripe
|
||||
=========
|
||||
|
||||
Device-Mapper's "striped" target is used to create a striped (i.e. RAID-0)
|
||||
device across one or more underlying devices. Data is written in "chunks",
|
||||
with consecutive chunks rotating among the underlying devices. This can
|
||||
potentially provide improved I/O throughput by utilizing several physical
|
||||
devices in parallel.
|
||||
|
||||
Parameters: <num devs> <chunk size> [<dev path> <offset>]+
|
||||
<num devs>:
|
||||
Number of underlying devices.
|
||||
<chunk size>:
|
||||
Size of each chunk of data. Must be at least as
|
||||
large as the system's PAGE_SIZE.
|
||||
<dev path>:
|
||||
Full pathname to the underlying block-device, or a
|
||||
"major:minor" device-number.
|
||||
<offset>:
|
||||
Starting sector within the device.
|
||||
|
||||
One or more underlying devices can be specified. The striped device size must
|
||||
be a multiple of the chunk size multiplied by the number of underlying devices.
|
||||
|
||||
|
||||
Example scripts
|
||||
===============
|
||||
|
||||
::
|
||||
|
||||
#!/usr/bin/perl -w
|
||||
# Create a striped device across any number of underlying devices. The device
|
||||
# will be called "stripe_dev" and have a chunk-size of 128k.
|
||||
|
||||
my $chunk_size = 128 * 2;
|
||||
my $dev_name = "stripe_dev";
|
||||
my $num_devs = @ARGV;
|
||||
my @devs = @ARGV;
|
||||
my ($min_dev_size, $stripe_dev_size, $i);
|
||||
|
||||
if (!$num_devs) {
|
||||
die("Specify at least one device\n");
|
||||
}
|
||||
|
||||
$min_dev_size = `blockdev --getsz $devs[0]`;
|
||||
for ($i = 1; $i < $num_devs; $i++) {
|
||||
my $this_size = `blockdev --getsz $devs[$i]`;
|
||||
$min_dev_size = ($min_dev_size < $this_size) ?
|
||||
$min_dev_size : $this_size;
|
||||
}
|
||||
|
||||
$stripe_dev_size = $min_dev_size * $num_devs;
|
||||
$stripe_dev_size -= $stripe_dev_size % ($chunk_size * $num_devs);
|
||||
|
||||
$table = "0 $stripe_dev_size striped $num_devs $chunk_size";
|
||||
for ($i = 0; $i < $num_devs; $i++) {
|
||||
$table .= " $devs[$i] 0";
|
||||
}
|
||||
|
||||
`echo $table | dmsetup create $dev_name`;
|
141
Documentation/admin-guide/device-mapper/switch.rst
Normal file
141
Documentation/admin-guide/device-mapper/switch.rst
Normal file
|
@ -0,0 +1,141 @@
|
|||
=========
|
||||
dm-switch
|
||||
=========
|
||||
|
||||
The device-mapper switch target creates a device that supports an
|
||||
arbitrary mapping of fixed-size regions of I/O across a fixed set of
|
||||
paths. The path used for any specific region can be switched
|
||||
dynamically by sending the target a message.
|
||||
|
||||
It maps I/O to underlying block devices efficiently when there is a large
|
||||
number of fixed-sized address regions but there is no simple pattern
|
||||
that would allow for a compact representation of the mapping such as
|
||||
dm-stripe.
|
||||
|
||||
Background
|
||||
----------
|
||||
|
||||
Dell EqualLogic and some other iSCSI storage arrays use a distributed
|
||||
frameless architecture. In this architecture, the storage group
|
||||
consists of a number of distinct storage arrays ("members") each having
|
||||
independent controllers, disk storage and network adapters. When a LUN
|
||||
is created it is spread across multiple members. The details of the
|
||||
spreading are hidden from initiators connected to this storage system.
|
||||
The storage group exposes a single target discovery portal, no matter
|
||||
how many members are being used. When iSCSI sessions are created, each
|
||||
session is connected to an eth port on a single member. Data to a LUN
|
||||
can be sent on any iSCSI session, and if the blocks being accessed are
|
||||
stored on another member the I/O will be forwarded as required. This
|
||||
forwarding is invisible to the initiator. The storage layout is also
|
||||
dynamic, and the blocks stored on disk may be moved from member to
|
||||
member as needed to balance the load.
|
||||
|
||||
This architecture simplifies the management and configuration of both
|
||||
the storage group and initiators. In a multipathing configuration, it
|
||||
is possible to set up multiple iSCSI sessions to use multiple network
|
||||
interfaces on both the host and target to take advantage of the
|
||||
increased network bandwidth. An initiator could use a simple round
|
||||
robin algorithm to send I/O across all paths and let the storage array
|
||||
members forward it as necessary, but there is a performance advantage to
|
||||
sending data directly to the correct member.
|
||||
|
||||
A device-mapper table already lets you map different regions of a
|
||||
device onto different targets. However in this architecture the LUN is
|
||||
spread with an address region size on the order of 10s of MBs, which
|
||||
means the resulting table could have more than a million entries and
|
||||
consume far too much memory.
|
||||
|
||||
Using this device-mapper switch target we can now build a two-layer
|
||||
device hierarchy:
|
||||
|
||||
Upper Tier - Determine which array member the I/O should be sent to.
|
||||
Lower Tier - Load balance amongst paths to a particular member.
|
||||
|
||||
The lower tier consists of a single dm multipath device for each member.
|
||||
Each of these multipath devices contains the set of paths directly to
|
||||
the array member in one priority group, and leverages existing path
|
||||
selectors to load balance amongst these paths. We also build a
|
||||
non-preferred priority group containing paths to other array members for
|
||||
failover reasons.
|
||||
|
||||
The upper tier consists of a single dm-switch device. This device uses
|
||||
a bitmap to look up the location of the I/O and choose the appropriate
|
||||
lower tier device to route the I/O. By using a bitmap we are able to
|
||||
use 4 bits for each address range in a 16 member group (which is very
|
||||
large for us). This is a much denser representation than the dm table
|
||||
b-tree can achieve.
|
||||
|
||||
Construction Parameters
|
||||
=======================
|
||||
|
||||
<num_paths> <region_size> <num_optional_args> [<optional_args>...] [<dev_path> <offset>]+
|
||||
<num_paths>
|
||||
The number of paths across which to distribute the I/O.
|
||||
|
||||
<region_size>
|
||||
The number of 512-byte sectors in a region. Each region can be redirected
|
||||
to any of the available paths.
|
||||
|
||||
<num_optional_args>
|
||||
The number of optional arguments. Currently, no optional arguments
|
||||
are supported and so this must be zero.
|
||||
|
||||
<dev_path>
|
||||
The block device that represents a specific path to the device.
|
||||
|
||||
<offset>
|
||||
The offset of the start of data on the specific <dev_path> (in units
|
||||
of 512-byte sectors). This number is added to the sector number when
|
||||
forwarding the request to the specific path. Typically it is zero.
|
||||
|
||||
Messages
|
||||
========
|
||||
|
||||
set_region_mappings <index>:<path_nr> [<index>]:<path_nr> [<index>]:<path_nr>...
|
||||
|
||||
Modify the region table by specifying which regions are redirected to
|
||||
which paths.
|
||||
|
||||
<index>
|
||||
The region number (region size was specified in constructor parameters).
|
||||
If index is omitted, the next region (previous index + 1) is used.
|
||||
Expressed in hexadecimal (WITHOUT any prefix like 0x).
|
||||
|
||||
<path_nr>
|
||||
The path number in the range 0 ... (<num_paths> - 1).
|
||||
Expressed in hexadecimal (WITHOUT any prefix like 0x).
|
||||
|
||||
R<n>,<m>
|
||||
This parameter allows repetitive patterns to be loaded quickly. <n> and <m>
|
||||
are hexadecimal numbers. The last <n> mappings are repeated in the next <m>
|
||||
slots.
|
||||
|
||||
Status
|
||||
======
|
||||
|
||||
No status line is reported.
|
||||
|
||||
Example
|
||||
=======
|
||||
|
||||
Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with
|
||||
the same size.
|
||||
|
||||
Create a switch device with 64kB region size::
|
||||
|
||||
dmsetup create switch --table "0 `blockdev --getsz /dev/vg1/switch0`
|
||||
switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0"
|
||||
|
||||
Set mappings for the first 7 entries to point to devices switch0, switch1,
|
||||
switch2, switch0, switch1, switch2, switch1::
|
||||
|
||||
dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1
|
||||
|
||||
Set repetitive mapping. This command::
|
||||
|
||||
dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10
|
||||
|
||||
is equivalent to::
|
||||
|
||||
dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \
|
||||
:1 :2 :1 :2 :1 :2 :1 :2 :1 :2
|
427
Documentation/admin-guide/device-mapper/thin-provisioning.rst
Normal file
427
Documentation/admin-guide/device-mapper/thin-provisioning.rst
Normal file
|
@ -0,0 +1,427 @@
|
|||
=================
|
||||
Thin provisioning
|
||||
=================
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
This document describes a collection of device-mapper targets that
|
||||
between them implement thin-provisioning and snapshots.
|
||||
|
||||
The main highlight of this implementation, compared to the previous
|
||||
implementation of snapshots, is that it allows many virtual devices to
|
||||
be stored on the same data volume. This simplifies administration and
|
||||
allows the sharing of data between volumes, thus reducing disk usage.
|
||||
|
||||
Another significant feature is support for an arbitrary depth of
|
||||
recursive snapshots (snapshots of snapshots of snapshots ...). The
|
||||
previous implementation of snapshots did this by chaining together
|
||||
lookup tables, and so performance was O(depth). This new
|
||||
implementation uses a single data structure to avoid this degradation
|
||||
with depth. Fragmentation may still be an issue, however, in some
|
||||
scenarios.
|
||||
|
||||
Metadata is stored on a separate device from data, giving the
|
||||
administrator some freedom, for example to:
|
||||
|
||||
- Improve metadata resilience by storing metadata on a mirrored volume
|
||||
but data on a non-mirrored one.
|
||||
|
||||
- Improve performance by storing the metadata on SSD.
|
||||
|
||||
Status
|
||||
======
|
||||
|
||||
These targets are considered safe for production use. But different use
|
||||
cases will have different performance characteristics, for example due
|
||||
to fragmentation of the data volume.
|
||||
|
||||
If you find this software is not performing as expected please mail
|
||||
dm-devel@redhat.com with details and we'll try our best to improve
|
||||
things for you.
|
||||
|
||||
Userspace tools for checking and repairing the metadata have been fully
|
||||
developed and are available as 'thin_check' and 'thin_repair'. The name
|
||||
of the package that provides these utilities varies by distribution (on
|
||||
a Red Hat distribution it is named 'device-mapper-persistent-data').
|
||||
|
||||
Cookbook
|
||||
========
|
||||
|
||||
This section describes some quick recipes for using thin provisioning.
|
||||
They use the dmsetup program to control the device-mapper driver
|
||||
directly. End users will be advised to use a higher-level volume
|
||||
manager such as LVM2 once support has been added.
|
||||
|
||||
Pool device
|
||||
-----------
|
||||
|
||||
The pool device ties together the metadata volume and the data volume.
|
||||
It maps I/O linearly to the data volume and updates the metadata via
|
||||
two mechanisms:
|
||||
|
||||
- Function calls from the thin targets
|
||||
|
||||
- Device-mapper 'messages' from userspace which control the creation of new
|
||||
virtual devices amongst other things.
|
||||
|
||||
Setting up a fresh pool device
|
||||
------------------------------
|
||||
|
||||
Setting up a pool device requires a valid metadata device, and a
|
||||
data device. If you do not have an existing metadata device you can
|
||||
make one by zeroing the first 4k to indicate empty metadata.
|
||||
|
||||
dd if=/dev/zero of=$metadata_dev bs=4096 count=1
|
||||
|
||||
The amount of metadata you need will vary according to how many blocks
|
||||
are shared between thin devices (i.e. through snapshots). If you have
|
||||
less sharing than average you'll need a larger-than-average metadata device.
|
||||
|
||||
As a guide, we suggest you calculate the number of bytes to use in the
|
||||
metadata device as 48 * $data_dev_size / $data_block_size but round it up
|
||||
to 2MB if the answer is smaller. If you're creating large numbers of
|
||||
snapshots which are recording large amounts of change, you may find you
|
||||
need to increase this.
|
||||
|
||||
The largest size supported is 16GB: If the device is larger,
|
||||
a warning will be issued and the excess space will not be used.
|
||||
|
||||
Reloading a pool table
|
||||
----------------------
|
||||
|
||||
You may reload a pool's table, indeed this is how the pool is resized
|
||||
if it runs out of space. (N.B. While specifying a different metadata
|
||||
device when reloading is not forbidden at the moment, things will go
|
||||
wrong if it does not route I/O to exactly the same on-disk location as
|
||||
previously.)
|
||||
|
||||
Using an existing pool device
|
||||
-----------------------------
|
||||
|
||||
::
|
||||
|
||||
dmsetup create pool \
|
||||
--table "0 20971520 thin-pool $metadata_dev $data_dev \
|
||||
$data_block_size $low_water_mark"
|
||||
|
||||
$data_block_size gives the smallest unit of disk space that can be
|
||||
allocated at a time expressed in units of 512-byte sectors.
|
||||
$data_block_size must be between 128 (64KB) and 2097152 (1GB) and a
|
||||
multiple of 128 (64KB). $data_block_size cannot be changed after the
|
||||
thin-pool is created. People primarily interested in thin provisioning
|
||||
may want to use a value such as 1024 (512KB). People doing lots of
|
||||
snapshotting may want a smaller value such as 128 (64KB). If you are
|
||||
not zeroing newly-allocated data, a larger $data_block_size in the
|
||||
region of 256000 (128MB) is suggested.
|
||||
|
||||
$low_water_mark is expressed in blocks of size $data_block_size. If
|
||||
free space on the data device drops below this level then a dm event
|
||||
will be triggered which a userspace daemon should catch allowing it to
|
||||
extend the pool device. Only one such event will be sent.
|
||||
|
||||
No special event is triggered if a just resumed device's free space is below
|
||||
the low water mark. However, resuming a device always triggers an
|
||||
event; a userspace daemon should verify that free space exceeds the low
|
||||
water mark when handling this event.
|
||||
|
||||
A low water mark for the metadata device is maintained in the kernel and
|
||||
will trigger a dm event if free space on the metadata device drops below
|
||||
it.
|
||||
|
||||
Updating on-disk metadata
|
||||
-------------------------
|
||||
|
||||
On-disk metadata is committed every time a FLUSH or FUA bio is written.
|
||||
If no such requests are made then commits will occur every second. This
|
||||
means the thin-provisioning target behaves like a physical disk that has
|
||||
a volatile write cache. If power is lost you may lose some recent
|
||||
writes. The metadata should always be consistent in spite of any crash.
|
||||
|
||||
If data space is exhausted the pool will either error or queue IO
|
||||
according to the configuration (see: error_if_no_space). If metadata
|
||||
space is exhausted or a metadata operation fails: the pool will error IO
|
||||
until the pool is taken offline and repair is performed to 1) fix any
|
||||
potential inconsistencies and 2) clear the flag that imposes repair.
|
||||
Once the pool's metadata device is repaired it may be resized, which
|
||||
will allow the pool to return to normal operation. Note that if a pool
|
||||
is flagged as needing repair, the pool's data and metadata devices
|
||||
cannot be resized until repair is performed. It should also be noted
|
||||
that when the pool's metadata space is exhausted the current metadata
|
||||
transaction is aborted. Given that the pool will cache IO whose
|
||||
completion may have already been acknowledged to upper IO layers
|
||||
(e.g. filesystem) it is strongly suggested that consistency checks
|
||||
(e.g. fsck) be performed on those layers when repair of the pool is
|
||||
required.
|
||||
|
||||
Thin provisioning
|
||||
-----------------
|
||||
|
||||
i) Creating a new thinly-provisioned volume.
|
||||
|
||||
To create a new thinly- provisioned volume you must send a message to an
|
||||
active pool device, /dev/mapper/pool in this example::
|
||||
|
||||
dmsetup message /dev/mapper/pool 0 "create_thin 0"
|
||||
|
||||
Here '0' is an identifier for the volume, a 24-bit number. It's up
|
||||
to the caller to allocate and manage these identifiers. If the
|
||||
identifier is already in use, the message will fail with -EEXIST.
|
||||
|
||||
ii) Using a thinly-provisioned volume.
|
||||
|
||||
Thinly-provisioned volumes are activated using the 'thin' target::
|
||||
|
||||
dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0"
|
||||
|
||||
The last parameter is the identifier for the thinp device.
|
||||
|
||||
Internal snapshots
|
||||
------------------
|
||||
|
||||
i) Creating an internal snapshot.
|
||||
|
||||
Snapshots are created with another message to the pool.
|
||||
|
||||
N.B. If the origin device that you wish to snapshot is active, you
|
||||
must suspend it before creating the snapshot to avoid corruption.
|
||||
This is NOT enforced at the moment, so please be careful!
|
||||
|
||||
::
|
||||
|
||||
dmsetup suspend /dev/mapper/thin
|
||||
dmsetup message /dev/mapper/pool 0 "create_snap 1 0"
|
||||
dmsetup resume /dev/mapper/thin
|
||||
|
||||
Here '1' is the identifier for the volume, a 24-bit number. '0' is the
|
||||
identifier for the origin device.
|
||||
|
||||
ii) Using an internal snapshot.
|
||||
|
||||
Once created, the user doesn't have to worry about any connection
|
||||
between the origin and the snapshot. Indeed the snapshot is no
|
||||
different from any other thinly-provisioned device and can be
|
||||
snapshotted itself via the same method. It's perfectly legal to
|
||||
have only one of them active, and there's no ordering requirement on
|
||||
activating or removing them both. (This differs from conventional
|
||||
device-mapper snapshots.)
|
||||
|
||||
Activate it exactly the same way as any other thinly-provisioned volume::
|
||||
|
||||
dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1"
|
||||
|
||||
External snapshots
|
||||
------------------
|
||||
|
||||
You can use an external **read only** device as an origin for a
|
||||
thinly-provisioned volume. Any read to an unprovisioned area of the
|
||||
thin device will be passed through to the origin. Writes trigger
|
||||
the allocation of new blocks as usual.
|
||||
|
||||
One use case for this is VM hosts that want to run guests on
|
||||
thinly-provisioned volumes but have the base image on another device
|
||||
(possibly shared between many VMs).
|
||||
|
||||
You must not write to the origin device if you use this technique!
|
||||
Of course, you may write to the thin device and take internal snapshots
|
||||
of the thin volume.
|
||||
|
||||
i) Creating a snapshot of an external device
|
||||
|
||||
This is the same as creating a thin device.
|
||||
You don't mention the origin at this stage.
|
||||
|
||||
::
|
||||
|
||||
dmsetup message /dev/mapper/pool 0 "create_thin 0"
|
||||
|
||||
ii) Using a snapshot of an external device.
|
||||
|
||||
Append an extra parameter to the thin target specifying the origin::
|
||||
|
||||
dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 0 /dev/image"
|
||||
|
||||
N.B. All descendants (internal snapshots) of this snapshot require the
|
||||
same extra origin parameter.
|
||||
|
||||
Deactivation
|
||||
------------
|
||||
|
||||
All devices using a pool must be deactivated before the pool itself
|
||||
can be.
|
||||
|
||||
::
|
||||
|
||||
dmsetup remove thin
|
||||
dmsetup remove snap
|
||||
dmsetup remove pool
|
||||
|
||||
Reference
|
||||
=========
|
||||
|
||||
'thin-pool' target
|
||||
------------------
|
||||
|
||||
i) Constructor
|
||||
|
||||
::
|
||||
|
||||
thin-pool <metadata dev> <data dev> <data block size (sectors)> \
|
||||
<low water mark (blocks)> [<number of feature args> [<arg>]*]
|
||||
|
||||
Optional feature arguments:
|
||||
|
||||
skip_block_zeroing:
|
||||
Skip the zeroing of newly-provisioned blocks.
|
||||
|
||||
ignore_discard:
|
||||
Disable discard support.
|
||||
|
||||
no_discard_passdown:
|
||||
Don't pass discards down to the underlying
|
||||
data device, but just remove the mapping.
|
||||
|
||||
read_only:
|
||||
Don't allow any changes to be made to the pool
|
||||
metadata. This mode is only available after the
|
||||
thin-pool has been created and first used in full
|
||||
read/write mode. It cannot be specified on initial
|
||||
thin-pool creation.
|
||||
|
||||
error_if_no_space:
|
||||
Error IOs, instead of queueing, if no space.
|
||||
|
||||
Data block size must be between 64KB (128 sectors) and 1GB
|
||||
(2097152 sectors) inclusive.
|
||||
|
||||
|
||||
ii) Status
|
||||
|
||||
::
|
||||
|
||||
<transaction id> <used metadata blocks>/<total metadata blocks>
|
||||
<used data blocks>/<total data blocks> <held metadata root>
|
||||
ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space
|
||||
needs_check|- metadata_low_watermark
|
||||
|
||||
transaction id:
|
||||
A 64-bit number used by userspace to help synchronise with metadata
|
||||
from volume managers.
|
||||
|
||||
used data blocks / total data blocks
|
||||
If the number of free blocks drops below the pool's low water mark a
|
||||
dm event will be sent to userspace. This event is edge-triggered and
|
||||
it will occur only once after each resume so volume manager writers
|
||||
should register for the event and then check the target's status.
|
||||
|
||||
held metadata root:
|
||||
The location, in blocks, of the metadata root that has been
|
||||
'held' for userspace read access. '-' indicates there is no
|
||||
held root.
|
||||
|
||||
discard_passdown|no_discard_passdown
|
||||
Whether or not discards are actually being passed down to the
|
||||
underlying device. When this is enabled when loading the table,
|
||||
it can get disabled if the underlying device doesn't support it.
|
||||
|
||||
ro|rw|out_of_data_space
|
||||
If the pool encounters certain types of device failures it will
|
||||
drop into a read-only metadata mode in which no changes to
|
||||
the pool metadata (like allocating new blocks) are permitted.
|
||||
|
||||
In serious cases where even a read-only mode is deemed unsafe
|
||||
no further I/O will be permitted and the status will just
|
||||
contain the string 'Fail'. The userspace recovery tools
|
||||
should then be used.
|
||||
|
||||
error_if_no_space|queue_if_no_space
|
||||
If the pool runs out of data or metadata space, the pool will
|
||||
either queue or error the IO destined to the data device. The
|
||||
default is to queue the IO until more space is added or the
|
||||
'no_space_timeout' expires. The 'no_space_timeout' dm-thin-pool
|
||||
module parameter can be used to change this timeout -- it
|
||||
defaults to 60 seconds but may be disabled using a value of 0.
|
||||
|
||||
needs_check
|
||||
A metadata operation has failed, resulting in the needs_check
|
||||
flag being set in the metadata's superblock. The metadata
|
||||
device must be deactivated and checked/repaired before the
|
||||
thin-pool can be made fully operational again. '-' indicates
|
||||
needs_check is not set.
|
||||
|
||||
metadata_low_watermark:
|
||||
Value of metadata low watermark in blocks. The kernel sets this
|
||||
value internally but userspace needs to know this value to
|
||||
determine if an event was caused by crossing this threshold.
|
||||
|
||||
iii) Messages
|
||||
|
||||
create_thin <dev id>
|
||||
Create a new thinly-provisioned device.
|
||||
<dev id> is an arbitrary unique 24-bit identifier chosen by
|
||||
the caller.
|
||||
|
||||
create_snap <dev id> <origin id>
|
||||
Create a new snapshot of another thinly-provisioned device.
|
||||
<dev id> is an arbitrary unique 24-bit identifier chosen by
|
||||
the caller.
|
||||
<origin id> is the identifier of the thinly-provisioned device
|
||||
of which the new device will be a snapshot.
|
||||
|
||||
delete <dev id>
|
||||
Deletes a thin device. Irreversible.
|
||||
|
||||
set_transaction_id <current id> <new id>
|
||||
Userland volume managers, such as LVM, need a way to
|
||||
synchronise their external metadata with the internal metadata of the
|
||||
pool target. The thin-pool target offers to store an
|
||||
arbitrary 64-bit transaction id and return it on the target's
|
||||
status line. To avoid races you must provide what you think
|
||||
the current transaction id is when you change it with this
|
||||
compare-and-swap message.
|
||||
|
||||
reserve_metadata_snap
|
||||
Reserve a copy of the data mapping btree for use by userland.
|
||||
This allows userland to inspect the mappings as they were when
|
||||
this message was executed. Use the pool's status command to
|
||||
get the root block associated with the metadata snapshot.
|
||||
|
||||
release_metadata_snap
|
||||
Release a previously reserved copy of the data mapping btree.
|
||||
|
||||
'thin' target
|
||||
-------------
|
||||
|
||||
i) Constructor
|
||||
|
||||
::
|
||||
|
||||
thin <pool dev> <dev id> [<external origin dev>]
|
||||
|
||||
pool dev:
|
||||
the thin-pool device, e.g. /dev/mapper/my_pool or 253:0
|
||||
|
||||
dev id:
|
||||
the internal device identifier of the device to be
|
||||
activated.
|
||||
|
||||
external origin dev:
|
||||
an optional block device outside the pool to be treated as a
|
||||
read-only snapshot origin: reads to unprovisioned areas of the
|
||||
thin target will be mapped to this device.
|
||||
|
||||
The pool doesn't store any size against the thin devices. If you
|
||||
load a thin target that is smaller than you've been using previously,
|
||||
then you'll have no access to blocks mapped beyond the end. If you
|
||||
load a target that is bigger than before, then extra blocks will be
|
||||
provisioned as and when needed.
|
||||
|
||||
ii) Status
|
||||
|
||||
<nr mapped sectors> <highest mapped sector>
|
||||
If the pool has encountered device errors and failed, the status
|
||||
will just contain the string 'Fail'. The userspace recovery
|
||||
tools should then be used.
|
||||
|
||||
In the case where <nr mapped sectors> is 0, there is no highest
|
||||
mapped sector and the value of <highest mapped sector> is unspecified.
|
135
Documentation/admin-guide/device-mapper/unstriped.rst
Normal file
135
Documentation/admin-guide/device-mapper/unstriped.rst
Normal file
|
@ -0,0 +1,135 @@
|
|||
================================
|
||||
Device-mapper "unstriped" target
|
||||
================================
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
The device-mapper "unstriped" target provides a transparent mechanism to
|
||||
unstripe a device-mapper "striped" target to access the underlying disks
|
||||
without having to touch the true backing block-device. It can also be
|
||||
used to unstripe a hardware RAID-0 to access backing disks.
|
||||
|
||||
Parameters:
|
||||
<number of stripes> <chunk size> <stripe #> <dev_path> <offset>
|
||||
|
||||
<number of stripes>
|
||||
The number of stripes in the RAID 0.
|
||||
|
||||
<chunk size>
|
||||
The amount of 512B sectors in the chunk striping.
|
||||
|
||||
<dev_path>
|
||||
The block device you wish to unstripe.
|
||||
|
||||
<stripe #>
|
||||
The stripe number within the device that corresponds to physical
|
||||
drive you wish to unstripe. This must be 0 indexed.
|
||||
|
||||
|
||||
Why use this module?
|
||||
====================
|
||||
|
||||
An example of undoing an existing dm-stripe
|
||||
-------------------------------------------
|
||||
|
||||
This small bash script will setup 4 loop devices and use the existing
|
||||
striped target to combine the 4 devices into one. It then will use
|
||||
the unstriped target ontop of the striped device to access the
|
||||
individual backing loop devices. We write data to the newly exposed
|
||||
unstriped devices and verify the data written matches the correct
|
||||
underlying device on the striped array::
|
||||
|
||||
#!/bin/bash
|
||||
|
||||
MEMBER_SIZE=$((128 * 1024 * 1024))
|
||||
NUM=4
|
||||
SEQ_END=$((${NUM}-1))
|
||||
CHUNK=256
|
||||
BS=4096
|
||||
|
||||
RAID_SIZE=$((${MEMBER_SIZE}*${NUM}/512))
|
||||
DM_PARMS="0 ${RAID_SIZE} striped ${NUM} ${CHUNK}"
|
||||
COUNT=$((${MEMBER_SIZE} / ${BS}))
|
||||
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
dd if=/dev/zero of=member-${i} bs=${MEMBER_SIZE} count=1 oflag=direct
|
||||
losetup /dev/loop${i} member-${i}
|
||||
DM_PARMS+=" /dev/loop${i} 0"
|
||||
done
|
||||
|
||||
echo $DM_PARMS | dmsetup create raid0
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
echo "0 1 unstriped ${NUM} ${CHUNK} ${i} /dev/mapper/raid0 0" | dmsetup create set-${i}
|
||||
done;
|
||||
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
dd if=/dev/urandom of=/dev/mapper/set-${i} bs=${BS} count=${COUNT} oflag=direct
|
||||
diff /dev/mapper/set-${i} member-${i}
|
||||
done;
|
||||
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
dmsetup remove set-${i}
|
||||
done
|
||||
|
||||
dmsetup remove raid0
|
||||
|
||||
for i in $(seq 0 ${SEQ_END}); do
|
||||
losetup -d /dev/loop${i}
|
||||
rm -f member-${i}
|
||||
done
|
||||
|
||||
Another example
|
||||
---------------
|
||||
|
||||
Intel NVMe drives contain two cores on the physical device.
|
||||
Each core of the drive has segregated access to its LBA range.
|
||||
The current LBA model has a RAID 0 128k chunk on each core, resulting
|
||||
in a 256k stripe across the two cores::
|
||||
|
||||
Core 0: Core 1:
|
||||
__________ __________
|
||||
| LBA 512| | LBA 768|
|
||||
| LBA 0 | | LBA 256|
|
||||
---------- ----------
|
||||
|
||||
The purpose of this unstriping is to provide better QoS in noisy
|
||||
neighbor environments. When two partitions are created on the
|
||||
aggregate drive without this unstriping, reads on one partition
|
||||
can affect writes on another partition. This is because the partitions
|
||||
are striped across the two cores. When we unstripe this hardware RAID 0
|
||||
and make partitions on each new exposed device the two partitions are now
|
||||
physically separated.
|
||||
|
||||
With the dm-unstriped target we're able to segregate an fio script that
|
||||
has read and write jobs that are independent of each other. Compared to
|
||||
when we run the test on a combined drive with partitions, we were able
|
||||
to get a 92% reduction in read latency using this device mapper target.
|
||||
|
||||
|
||||
Example dmsetup usage
|
||||
=====================
|
||||
|
||||
unstriped ontop of Intel NVMe device that has 2 cores
|
||||
-----------------------------------------------------
|
||||
|
||||
::
|
||||
|
||||
dmsetup create nvmset0 --table '0 512 unstriped 2 256 0 /dev/nvme0n1 0'
|
||||
dmsetup create nvmset1 --table '0 512 unstriped 2 256 1 /dev/nvme0n1 0'
|
||||
|
||||
There will now be two devices that expose Intel NVMe core 0 and 1
|
||||
respectively::
|
||||
|
||||
/dev/mapper/nvmset0
|
||||
/dev/mapper/nvmset1
|
||||
|
||||
unstriped ontop of striped with 4 drives using 128K chunk size
|
||||
--------------------------------------------------------------
|
||||
|
||||
::
|
||||
|
||||
dmsetup create raid_disk0 --table '0 512 unstriped 4 256 0 /dev/mapper/striped 0'
|
||||
dmsetup create raid_disk1 --table '0 512 unstriped 4 256 1 /dev/mapper/striped 0'
|
||||
dmsetup create raid_disk2 --table '0 512 unstriped 4 256 2 /dev/mapper/striped 0'
|
||||
dmsetup create raid_disk3 --table '0 512 unstriped 4 256 3 /dev/mapper/striped 0'
|
229
Documentation/admin-guide/device-mapper/verity.rst
Normal file
229
Documentation/admin-guide/device-mapper/verity.rst
Normal file
|
@ -0,0 +1,229 @@
|
|||
=========
|
||||
dm-verity
|
||||
=========
|
||||
|
||||
Device-Mapper's "verity" target provides transparent integrity checking of
|
||||
block devices using a cryptographic digest provided by the kernel crypto API.
|
||||
This target is read-only.
|
||||
|
||||
Construction Parameters
|
||||
=======================
|
||||
|
||||
::
|
||||
|
||||
<version> <dev> <hash_dev>
|
||||
<data_block_size> <hash_block_size>
|
||||
<num_data_blocks> <hash_start_block>
|
||||
<algorithm> <digest> <salt>
|
||||
[<#opt_params> <opt_params>]
|
||||
|
||||
<version>
|
||||
This is the type of the on-disk hash format.
|
||||
|
||||
0 is the original format used in the Chromium OS.
|
||||
The salt is appended when hashing, digests are stored continuously and
|
||||
the rest of the block is padded with zeroes.
|
||||
|
||||
1 is the current format that should be used for new devices.
|
||||
The salt is prepended when hashing and each digest is
|
||||
padded with zeroes to the power of two.
|
||||
|
||||
<dev>
|
||||
This is the device containing data, the integrity of which needs to be
|
||||
checked. It may be specified as a path, like /dev/sdaX, or a device number,
|
||||
<major>:<minor>.
|
||||
|
||||
<hash_dev>
|
||||
This is the device that supplies the hash tree data. It may be
|
||||
specified similarly to the device path and may be the same device. If the
|
||||
same device is used, the hash_start should be outside the configured
|
||||
dm-verity device.
|
||||
|
||||
<data_block_size>
|
||||
The block size on a data device in bytes.
|
||||
Each block corresponds to one digest on the hash device.
|
||||
|
||||
<hash_block_size>
|
||||
The size of a hash block in bytes.
|
||||
|
||||
<num_data_blocks>
|
||||
The number of data blocks on the data device. Additional blocks are
|
||||
inaccessible. You can place hashes to the same partition as data, in this
|
||||
case hashes are placed after <num_data_blocks>.
|
||||
|
||||
<hash_start_block>
|
||||
This is the offset, in <hash_block_size>-blocks, from the start of hash_dev
|
||||
to the root block of the hash tree.
|
||||
|
||||
<algorithm>
|
||||
The cryptographic hash algorithm used for this device. This should
|
||||
be the name of the algorithm, like "sha1".
|
||||
|
||||
<digest>
|
||||
The hexadecimal encoding of the cryptographic hash of the root hash block
|
||||
and the salt. This hash should be trusted as there is no other authenticity
|
||||
beyond this point.
|
||||
|
||||
<salt>
|
||||
The hexadecimal encoding of the salt value.
|
||||
|
||||
<#opt_params>
|
||||
Number of optional parameters. If there are no optional parameters,
|
||||
the optional paramaters section can be skipped or #opt_params can be zero.
|
||||
Otherwise #opt_params is the number of following arguments.
|
||||
|
||||
Example of optional parameters section:
|
||||
1 ignore_corruption
|
||||
|
||||
ignore_corruption
|
||||
Log corrupted blocks, but allow read operations to proceed normally.
|
||||
|
||||
restart_on_corruption
|
||||
Restart the system when a corrupted block is discovered. This option is
|
||||
not compatible with ignore_corruption and requires user space support to
|
||||
avoid restart loops.
|
||||
|
||||
ignore_zero_blocks
|
||||
Do not verify blocks that are expected to contain zeroes and always return
|
||||
zeroes instead. This may be useful if the partition contains unused blocks
|
||||
that are not guaranteed to contain zeroes.
|
||||
|
||||
use_fec_from_device <fec_dev>
|
||||
Use forward error correction (FEC) to recover from corruption if hash
|
||||
verification fails. Use encoding data from the specified device. This
|
||||
may be the same device where data and hash blocks reside, in which case
|
||||
fec_start must be outside data and hash areas.
|
||||
|
||||
If the encoding data covers additional metadata, it must be accessible
|
||||
on the hash device after the hash blocks.
|
||||
|
||||
Note: block sizes for data and hash devices must match. Also, if the
|
||||
verity <dev> is encrypted the <fec_dev> should be too.
|
||||
|
||||
fec_roots <num>
|
||||
Number of generator roots. This equals to the number of parity bytes in
|
||||
the encoding data. For example, in RS(M, N) encoding, the number of roots
|
||||
is M-N.
|
||||
|
||||
fec_blocks <num>
|
||||
The number of encoding data blocks on the FEC device. The block size for
|
||||
the FEC device is <data_block_size>.
|
||||
|
||||
fec_start <offset>
|
||||
This is the offset, in <data_block_size> blocks, from the start of the
|
||||
FEC device to the beginning of the encoding data.
|
||||
|
||||
check_at_most_once
|
||||
Verify data blocks only the first time they are read from the data device,
|
||||
rather than every time. This reduces the overhead of dm-verity so that it
|
||||
can be used on systems that are memory and/or CPU constrained. However, it
|
||||
provides a reduced level of security because only offline tampering of the
|
||||
data device's content will be detected, not online tampering.
|
||||
|
||||
Hash blocks are still verified each time they are read from the hash device,
|
||||
since verification of hash blocks is less performance critical than data
|
||||
blocks, and a hash block will not be verified any more after all the data
|
||||
blocks it covers have been verified anyway.
|
||||
|
||||
Theory of operation
|
||||
===================
|
||||
|
||||
dm-verity is meant to be set up as part of a verified boot path. This
|
||||
may be anything ranging from a boot using tboot or trustedgrub to just
|
||||
booting from a known-good device (like a USB drive or CD).
|
||||
|
||||
When a dm-verity device is configured, it is expected that the caller
|
||||
has been authenticated in some way (cryptographic signatures, etc).
|
||||
After instantiation, all hashes will be verified on-demand during
|
||||
disk access. If they cannot be verified up to the root node of the
|
||||
tree, the root hash, then the I/O will fail. This should detect
|
||||
tampering with any data on the device and the hash data.
|
||||
|
||||
Cryptographic hashes are used to assert the integrity of the device on a
|
||||
per-block basis. This allows for a lightweight hash computation on first read
|
||||
into the page cache. Block hashes are stored linearly, aligned to the nearest
|
||||
block size.
|
||||
|
||||
If forward error correction (FEC) support is enabled any recovery of
|
||||
corrupted data will be verified using the cryptographic hash of the
|
||||
corresponding data. This is why combining error correction with
|
||||
integrity checking is essential.
|
||||
|
||||
Hash Tree
|
||||
---------
|
||||
|
||||
Each node in the tree is a cryptographic hash. If it is a leaf node, the hash
|
||||
of some data block on disk is calculated. If it is an intermediary node,
|
||||
the hash of a number of child nodes is calculated.
|
||||
|
||||
Each entry in the tree is a collection of neighboring nodes that fit in one
|
||||
block. The number is determined based on block_size and the size of the
|
||||
selected cryptographic digest algorithm. The hashes are linearly-ordered in
|
||||
this entry and any unaligned trailing space is ignored but included when
|
||||
calculating the parent node.
|
||||
|
||||
The tree looks something like:
|
||||
|
||||
alg = sha256, num_blocks = 32768, block_size = 4096
|
||||
|
||||
::
|
||||
|
||||
[ root ]
|
||||
/ . . . \
|
||||
[entry_0] [entry_1]
|
||||
/ . . . \ . . . \
|
||||
[entry_0_0] . . . [entry_0_127] . . . . [entry_1_127]
|
||||
/ ... \ / . . . \ / \
|
||||
blk_0 ... blk_127 blk_16256 blk_16383 blk_32640 . . . blk_32767
|
||||
|
||||
|
||||
On-disk format
|
||||
==============
|
||||
|
||||
The verity kernel code does not read the verity metadata on-disk header.
|
||||
It only reads the hash blocks which directly follow the header.
|
||||
It is expected that a user-space tool will verify the integrity of the
|
||||
verity header.
|
||||
|
||||
Alternatively, the header can be omitted and the dmsetup parameters can
|
||||
be passed via the kernel command-line in a rooted chain of trust where
|
||||
the command-line is verified.
|
||||
|
||||
Directly following the header (and with sector number padded to the next hash
|
||||
block boundary) are the hash blocks which are stored a depth at a time
|
||||
(starting from the root), sorted in order of increasing index.
|
||||
|
||||
The full specification of kernel parameters and on-disk metadata format
|
||||
is available at the cryptsetup project's wiki page
|
||||
|
||||
https://gitlab.com/cryptsetup/cryptsetup/wikis/DMVerity
|
||||
|
||||
Status
|
||||
======
|
||||
V (for Valid) is returned if every check performed so far was valid.
|
||||
If any check failed, C (for Corruption) is returned.
|
||||
|
||||
Example
|
||||
=======
|
||||
Set up a device::
|
||||
|
||||
# dmsetup create vroot --readonly --table \
|
||||
"0 2097152 verity 1 /dev/sda1 /dev/sda2 4096 4096 262144 1 sha256 "\
|
||||
"4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076 "\
|
||||
"1234000000000000000000000000000000000000000000000000000000000000"
|
||||
|
||||
A command line tool veritysetup is available to compute or verify
|
||||
the hash tree or activate the kernel device. This is available from
|
||||
the cryptsetup upstream repository https://gitlab.com/cryptsetup/cryptsetup/
|
||||
(as a libcryptsetup extension).
|
||||
|
||||
Create hash on the device::
|
||||
|
||||
# veritysetup format /dev/sda1 /dev/sda2
|
||||
...
|
||||
Root hash: 4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076
|
||||
|
||||
Activate the device::
|
||||
|
||||
# veritysetup create vroot /dev/sda1 /dev/sda2 \
|
||||
4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076
|
79
Documentation/admin-guide/device-mapper/writecache.rst
Normal file
79
Documentation/admin-guide/device-mapper/writecache.rst
Normal file
|
@ -0,0 +1,79 @@
|
|||
=================
|
||||
Writecache target
|
||||
=================
|
||||
|
||||
The writecache target caches writes on persistent memory or on SSD. It
|
||||
doesn't cache reads because reads are supposed to be cached in page cache
|
||||
in normal RAM.
|
||||
|
||||
When the device is constructed, the first sector should be zeroed or the
|
||||
first sector should contain valid superblock from previous invocation.
|
||||
|
||||
Constructor parameters:
|
||||
|
||||
1. type of the cache device - "p" or "s"
|
||||
|
||||
- p - persistent memory
|
||||
- s - SSD
|
||||
2. the underlying device that will be cached
|
||||
3. the cache device
|
||||
4. block size (4096 is recommended; the maximum block size is the page
|
||||
size)
|
||||
5. the number of optional parameters (the parameters with an argument
|
||||
count as two)
|
||||
|
||||
start_sector n (default: 0)
|
||||
offset from the start of cache device in 512-byte sectors
|
||||
high_watermark n (default: 50)
|
||||
start writeback when the number of used blocks reach this
|
||||
watermark
|
||||
low_watermark x (default: 45)
|
||||
stop writeback when the number of used blocks drops below
|
||||
this watermark
|
||||
writeback_jobs n (default: unlimited)
|
||||
limit the number of blocks that are in flight during
|
||||
writeback. Setting this value reduces writeback
|
||||
throughput, but it may improve latency of read requests
|
||||
autocommit_blocks n (default: 64 for pmem, 65536 for ssd)
|
||||
when the application writes this amount of blocks without
|
||||
issuing the FLUSH request, the blocks are automatically
|
||||
commited
|
||||
autocommit_time ms (default: 1000)
|
||||
autocommit time in milliseconds. The data is automatically
|
||||
commited if this time passes and no FLUSH request is
|
||||
received
|
||||
fua (by default on)
|
||||
applicable only to persistent memory - use the FUA flag
|
||||
when writing data from persistent memory back to the
|
||||
underlying device
|
||||
nofua
|
||||
applicable only to persistent memory - don't use the FUA
|
||||
flag when writing back data and send the FLUSH request
|
||||
afterwards
|
||||
|
||||
- some underlying devices perform better with fua, some
|
||||
with nofua. The user should test it
|
||||
|
||||
Status:
|
||||
1. error indicator - 0 if there was no error, otherwise error number
|
||||
2. the number of blocks
|
||||
3. the number of free blocks
|
||||
4. the number of blocks under writeback
|
||||
|
||||
Messages:
|
||||
flush
|
||||
flush the cache device. The message returns successfully
|
||||
if the cache device was flushed without an error
|
||||
flush_on_suspend
|
||||
flush the cache device on next suspend. Use this message
|
||||
when you are going to remove the cache device. The proper
|
||||
sequence for removing the cache device is:
|
||||
|
||||
1. send the "flush_on_suspend" message
|
||||
2. load an inactive table with a linear target that maps
|
||||
to the underlying device
|
||||
3. suspend the device
|
||||
4. ask for status and verify that there are no errors
|
||||
5. resume the device, so that it will use the linear
|
||||
target
|
||||
6. the cache device is now inactive and it can be deleted
|
37
Documentation/admin-guide/device-mapper/zero.rst
Normal file
37
Documentation/admin-guide/device-mapper/zero.rst
Normal file
|
@ -0,0 +1,37 @@
|
|||
=======
|
||||
dm-zero
|
||||
=======
|
||||
|
||||
Device-Mapper's "zero" target provides a block-device that always returns
|
||||
zero'd data on reads and silently drops writes. This is similar behavior to
|
||||
/dev/zero, but as a block-device instead of a character-device.
|
||||
|
||||
Dm-zero has no target-specific parameters.
|
||||
|
||||
One very interesting use of dm-zero is for creating "sparse" devices in
|
||||
conjunction with dm-snapshot. A sparse device reports a device-size larger
|
||||
than the amount of actual storage space available for that device. A user can
|
||||
write data anywhere within the sparse device and read it back like a normal
|
||||
device. Reads to previously unwritten areas will return a zero'd buffer. When
|
||||
enough data has been written to fill up the actual storage space, the sparse
|
||||
device is deactivated. This can be very useful for testing device and
|
||||
filesystem limitations.
|
||||
|
||||
To create a sparse device, start by creating a dm-zero device that's the
|
||||
desired size of the sparse device. For this example, we'll assume a 10TB
|
||||
sparse device::
|
||||
|
||||
TEN_TERABYTES=`expr 10 \* 1024 \* 1024 \* 1024 \* 2` # 10 TB in sectors
|
||||
echo "0 $TEN_TERABYTES zero" | dmsetup create zero1
|
||||
|
||||
Then create a snapshot of the zero device, using any available block-device as
|
||||
the COW device. The size of the COW device will determine the amount of real
|
||||
space available to the sparse device. For this example, we'll assume /dev/sdb1
|
||||
is an available 10GB partition::
|
||||
|
||||
echo "0 $TEN_TERABYTES snapshot /dev/mapper/zero1 /dev/sdb1 p 128" | \
|
||||
dmsetup create sparse1
|
||||
|
||||
This will create a 10TB sparse device called /dev/mapper/sparse1 that has
|
||||
10GB of actual storage space available. If more than 10GB of data is written
|
||||
to this device, it will start returning I/O errors.
|
Loading…
Add table
Add a link
Reference in a new issue