テストステ論

高テス協会会長が, テストステロンに関する情報をお届けします.

(writeboost report) ドキュメントを作っている

Satoruが「ドキュメントが全くupdateされてないぞ」と指摘してきたので, はぁ直すかということで, 完全にリデザインした.

たまってる仕事:

  • テキストのドキュメントをen/jaの両方を作る
  • PDFのドキュメントをupdateする
  • コードのクリーンアップをする
  • テストを2つ追加する

とりあえず, ドキュメント(en)を書いた. 日本語の方はある程度適当でもいいので(母語だから適当に書いてもわりと伝わる. 公式のものではない), とりあえず放置. 次はPDFでしょうか・・・

とりあえず, enの方をGistに貼り付けるので, 英語に誤りがあったり, 不自然であったりした場合は, 指摘お願いします (Gistにコメントでもいいし, Twitterでもいい. なんでもいい).

https://gist.github.com/akiradeveloper/505857264eb9fe5f7fd3

とりあえずブログにも貼り付け

Writeboost
==========
Writeboost target provides block-level log-structured caching.
Accepted bios are put into a big "log" and the log is written to the cache
device sequentially.


Mechanism
=========
Writeboost caches only writes - reads are not cached.
However, this doesn't necessarily mean that it doesn't improve read performance
of the whole system.

For most of the storage systems, writes are more burdening than reads.
(cf. RAID penalty)
If the write load of the the backing device gets low then it improves the read
performance as the backing device can focus on processing reads.

There are two mechanism to reduce the write load of the backing device:
1. Writeboost can cut the writes to the backing device by processing them on the
   cache device
2. In Writeboost's writeback, the data are sorted by the destination address and
   then submitted in async manner. Therefore, the average write load of the
   backing is always lower compared to being without Writeboost.

Additionally, the write data cached which are typically what written back from
the page cache are likely to be hit again on read. Needless to say, this also
improves the read.

For these reasons, Writeboost can improve not only writes but also reads.

The lifetime of the NAND SSD as the cache device is as great concern as the
performance in real world operations. Caching on read
1. shortens the lifetime of the cache device
2. sometimes takes no effect because of the data duplication between page cache.

As for the performance and the lifetime of the cache device,
Writeboost doesn't stage on read and therefore the value of Writeboost is the
optimized operation as a pure write cache.

Basic Mechanism
---------------
Writeboost controls three different layers - RAM buffer (rambuf), cache device
(cache_dev, e.g SSD) and backing device (backing_dev, e.g. HDD).
Write data are first stored in the RAM buffer and when the buffer is full
Writeboost adds metadata block to the RAM buffer to create a "log".
Afterward, the log is written to the cache device in background processing in
sequential manner and thereafter written back to the backing device in
background.

Persistent Logging
------------------
Writeboost can extend its functionality by specifying "type" in initialization.
Type 0 provides only the basic mechanism and the type 1 provides additional
"Persistent Logging" (or plog).
Plog aims to reduce the penalty in FLUSH operation by storing the write data to
both RAM buffer and persistent device (plog_dev).
This extended functionality is similar to full-data journaling in filesystem.
As of now, only block device as plog_dev is supported but other medium to use
will be supported in the future.

Log Replay
----------
On reboot, Writeboost replays the logs written on the cache device to restore
the on-memory metadata.
Logs are chronologically ordered thus it is theoritically possible to restoring
the state of the storage system of the moment of your choice.


Processings
===========
Writeboost is consist of one foreground processing and other six background
processings.

Foreground Processing
---------------------
A bio is accepted and the driver does as the result of looking up the cache.
All write data are stored in the RAM buffer. Later, when the buffer is full, a
log is created and queued as a flush job.

Background Processings
----------------------
(1) Flusher Daemon
This daemon dequeues a flush job from the queue and writes the log to the cache
device.

(2) Migrate Daemon
This daemon writes back the dirty data on the cache device to the backing
device. Writeboost calls writeback "Migration".
 
If `allow_migrate" is true, then it never starts writeback unless imminent
situation. Here, imminent situation is such that there is no room to append any
logs without writes back some segment to clean them up.

There are two major optimizations in writeback:
1. Multiple segments are written back at a time . `nr_max_batched_migration` is
the maximum number of segment to write back at a time.
2. The blocks to write back are sorted by the destination address on the backing
device.

(3) Migration Modulator
Writeback should be suppressed when the backing device is in high-load.
This daemon surveils the load of the backing device and stops writeback in
high-load by turning `allow_migrate` to false.
This daemon only enables when `enable_migration_modulator` is true and the
threshold to turn on/off the switch is determined by `migrate_threshold`.
 
(4) Superblock Recorder
This daemon periodically (specified by `update_record_interval`) records on
super block the last segment ID that was written back.
By doing this can omit unnecessary restoring in log replay and thus shorten the
reboot time.
 
(5) Sync Daemon
The data on the RAM buffer is lost in case of power failure.
Additionally, the data on the RAM cache of the cache device (typically, SSD has
such small cache) are also lost in such failure.
This daemon flushes them all periodically. (specified by `sync_interval`)

(6) Barrier Deadline (enabled type 0 only)
Without Persistent Logging, flush operation is high-penalty. It sometimes
results in making a log that is not fulfilled.
To mitigate this penalty, Writeboost has an optimization that delays ack to such
operation at most `barrier_deadline_ms` (ms).
By doing this, the log can be fulfilled in case of multiple processes shares the
storage and then submits writes.
 
 
Target Interfaces
=================
Use dmsetup command for operations.

Initialization (Constructor)
----------------------------

<type>
<essential args>
<#optional args> <optional args>
<#tunable args> <tunable args>

- For <type>, see `Mechanism`
- <essential args> differs by <type>
- <optional args> and <tunable args> are unordered list of kv pairs.

type 0 (applied to all <type>):
  <essential args>
  backing_dev: A block device having original data (E.g. HDD)
  cache_dev: A block device having caches (E.g. SSD)
 
  <optional_args>
  segment_size_order : Determines the size of a RAM buffer.
                       RAM buffer size will be 1 << n (sector).
                       4 <= n <= 10
                       default 10
  nr_rambuf_pool     : The number of RAM buffers to allocate
                       default 8
                     
  <tunable args>
  see `Messages`
                                    
E.g.
dmsetup create wbdev --table "0 $sz writeboost 0 $BACKING $CACHE"
dmsetup create wbdev --table "0 $sz writeboost 0 $BACKING $CACHE \
                              4 nr_rambuf_pool 32 segment_size_order 8 \
                              2 allow_migrate 1"
dmsetup create wbdev --table "0 $sz writeboost 0 $BACKING $CACHE \
                              0 \
                              2 allow_migrate 1"
                     
type 1:
  <essential args>
  backing_dev
  cache_dev
  plog_dev_desc      : A string descriptor to specify the plog device

E.g.
dmsetup create wbdev --table "0 $sz 0 writeboost 1 $BACKING $CACHE $PLOG"

Initialization (Reformatting)
-----------------------------
The cache device and plog are triggered reformating only if the first one sector
of the cache device is zeroed out.

Messages
--------
Some behavior of Writeboost device can be tuned online.
Use dmsetup message for this purpose.

(1) Tunables
The tunables in constructor can be altered online.
See `Background processings` for detail.

barrier_deadline_ms (ms)
  Default: 10

allow_migrate (bool)
  default: 0

enable_migration_modulator (bool) and migrate_threshold (%)
  default: 0 and 70

nr_max_batched_migration
  default: 1 << (15 - segment_size_order)

update_record_interval (sec)
  default: 0

sync_interval (sec)
  default: 0

E.g.
dmsetup message wbdev 0 enable_migration_modulator 0

(2) その他
clear_stats
  Clear the statistic info (see `Status`).
drop_caches
  Waits for all dirty data on the cache device to be written back to the backing
  device.

E.g.
dmsetup message wbdev 0 drop_caches

Status
------
<cursor pos>
<#cache blocks>
<#segments>
<current id>
<lastly flushed id>
<lastly migrated id>
<#dirty cache blocks>
<stat (w/r) x (hit/miss) x (on buffer?) x (fullsize?)>
<#none-full flushed>
<#tunable args> <tunable args>