add idl4k kernel firmware version 1.13.0.105

2015-03-26 17:22:37 +01:00
parent 5194d2792e
commit e9070cdc77
31064 changed files with 12769984 additions and 0 deletions
--- a/kernel/Documentation/block/00-INDEX
+++ b/kernel/Documentation/block/00-INDEX
@@ -0,0 +1,20 @@
+00-INDEX
+	- This file
+as-iosched.txt
+	- Anticipatory IO scheduler
+barrier.txt
+	- I/O Barriers
+biodoc.txt
+	- Notes on the Generic Block Layer Rewrite in Linux 2.5
+capability.txt
+	- Generic Block Device Capability (/sys/block/<disk>/capability)
+deadline-iosched.txt
+	- Deadline IO scheduler tunables
+ioprio.txt
+	- Block io priorities (in CFQ scheduler)
+request.txt
+	- The members of struct request (in include/linux/blkdev.h)
+stat.txt
+	- Block layer statistics in /sys/block/<dev>/stat
+switching-sched.txt
+	- Switching I/O schedulers at runtime
--- a/kernel/Documentation/block/as-iosched.txt
+++ b/kernel/Documentation/block/as-iosched.txt
@@ -0,0 +1,172 @@
+Anticipatory IO scheduler
+-------------------------
+Nick Piggin <piggin@cyberone.com.au>    13 Sep 2003
+
+Attention! Database servers, especially those using "TCQ" disks should
+investigate performance with the 'deadline' IO scheduler. Any system with high
+disk performance requirements should do so, in fact.
+
+If you see unusual performance characteristics of your disk systems, or you
+see big performance regressions versus the deadline scheduler, please email
+me. Database users don't bother unless you're willing to test a lot of patches
+from me ;) its a known issue.
+
+Also, users with hardware RAID controllers, doing striping, may find
+highly variable performance results with using the as-iosched. The
+as-iosched anticipatory implementation is based on the notion that a disk
+device has only one physical seeking head.  A striped RAID controller
+actually has a head for each physical device in the logical RAID device.
+
+However, setting the antic_expire (see tunable parameters below) produces
+very similar behavior to the deadline IO scheduler.
+
+Selecting IO schedulers
+-----------------------
+Refer to Documentation/block/switching-sched.txt for information on
+selecting an io scheduler on a per-device basis.
+
+Anticipatory IO scheduler Policies
+----------------------------------
+The as-iosched implementation implements several layers of policies
+to determine when an IO request is dispatched to the disk controller.
+Here are the policies outlined, in order of application.
+
+1. one-way Elevator algorithm.
+
+The elevator algorithm is similar to that used in deadline scheduler, with
+the addition that it allows limited backward movement of the elevator
+(i.e. seeks backwards).  A seek backwards can occur when choosing between
+two IO requests where one is behind the elevator's current position, and
+the other is in front of the elevator's position. If the seek distance to
+the request in back of the elevator is less than half the seek distance to
+the request in front of the elevator, then the request in back can be chosen.
+Backward seeks are also limited to a maximum of MAXBACK (1024*1024) sectors.
+This favors forward movement of the elevator, while allowing opportunistic
+"short" backward seeks.
+
+2. FIFO expiration times for reads and for writes.
+
+This is again very similar to the deadline IO scheduler.  The expiration
+times for requests on these lists is tunable using the parameters read_expire
+and write_expire discussed below.  When a read or a write expires in this way,
+the IO scheduler will interrupt its current elevator sweep or read anticipation
+to service the expired request.
+
+3. Read and write request batching
+
+A batch is a collection of read requests or a collection of write
+requests.  The as scheduler alternates dispatching read and write batches
+to the driver.  In the case a read batch, the scheduler submits read
+requests to the driver as long as there are read requests to submit, and
+the read batch time limit has not been exceeded (read_batch_expire).
+The read batch time limit begins counting down only when there are
+competing write requests pending.
+
+In the case of a write batch, the scheduler submits write requests to
+the driver as long as there are write requests available, and the
+write batch time limit has not been exceeded (write_batch_expire).
+However, the length of write batches will be gradually shortened
+when read batches frequently exceed their time limit.
+
+When changing between batch types, the scheduler waits for all requests
+from the previous batch to complete before scheduling requests for the
+next batch.
+
+The read and write fifo expiration times described in policy 2 above
+are checked only when in scheduling IO of a batch for the corresponding
+(read/write) type.  So for example, the read FIFO timeout values are
+tested only during read batches.  Likewise, the write FIFO timeout
+values are tested only during write batches.  For this reason,
+it is generally not recommended for the read batch time
+to be longer than the write expiration time, nor for the write batch
+time to exceed the read expiration time (see tunable parameters below).
+
+When the IO scheduler changes from a read to a write batch,
+it begins the elevator from the request that is on the head of the
+write expiration FIFO.  Likewise, when changing from a write batch to
+a read batch, scheduler begins the elevator from the first entry
+on the read expiration FIFO.
+
+4. Read anticipation.
+
+Read anticipation occurs only when scheduling a read batch.
+This implementation of read anticipation allows only one read request
+to be dispatched to the disk controller at a time.  In
+contrast, many write requests may be dispatched to the disk controller
+at a time during a write batch.  It is this characteristic that can make
+the anticipatory scheduler perform anomalously with controllers supporting
+TCQ, or with hardware striped RAID devices. Setting the antic_expire
+queue parameter (see below) to zero disables this behavior, and the 
+anticipatory scheduler behaves essentially like the deadline scheduler.
+
+When read anticipation is enabled (antic_expire is not zero), reads
+are dispatched to the disk controller one at a time.
+At the end of each read request, the IO scheduler examines its next
+candidate read request from its sorted read list.  If that next request
+is from the same process as the request that just completed,
+or if the next request in the queue is "very close" to the
+just completed request, it is dispatched immediately.  Otherwise,
+statistics (average think time, average seek distance) on the process
+that submitted the just completed request are examined.  If it seems
+likely that that process will submit another request soon, and that
+request is likely to be near the just completed request, then the IO
+scheduler will stop dispatching more read requests for up to (antic_expire)
+milliseconds, hoping that process will submit a new request near the one
+that just completed.  If such a request is made, then it is dispatched
+immediately.  If the antic_expire wait time expires, then the IO scheduler
+will dispatch the next read request from the sorted read queue.
+
+To decide whether an anticipatory wait is worthwhile, the scheduler
+maintains statistics for each process that can be used to compute
+mean "think time" (the time between read requests), and mean seek
+distance for that process.  One observation is that these statistics
+are associated with each process, but those statistics are not associated
+with a specific IO device.  So for example, if a process is doing IO
+on several file systems on separate devices, the statistics will be
+a combination of IO behavior from all those devices.
+
+
+Tuning the anticipatory IO scheduler
+------------------------------------
+When using 'as', the anticipatory IO scheduler there are 5 parameters under
+/sys/block/*/queue/iosched/. All are units of milliseconds.
+
+The parameters are:
+* read_expire
+    Controls how long until a read request becomes "expired". It also controls the
+    interval between which expired requests are served, so set to 50, a request
+    might take anywhere < 100ms to be serviced _if_ it is the next on the
+    expired list. Obviously request expiration strategies won't make the disk
+    go faster. The result basically equates to the timeslice a single reader
+    gets in the presence of other IO. 100*((seek time / read_expire) + 1) is
+    very roughly the % streaming read efficiency your disk should get with
+    multiple readers.
+
+* read_batch_expire
+    Controls how much time a batch of reads is given before pending writes are
+    served. A higher value is more efficient. This might be set below read_expire
+    if writes are to be given higher priority than reads, but reads are to be
+    as efficient as possible when there are no writes. Generally though, it
+    should be some multiple of read_expire.
+
+* write_expire, and
+* write_batch_expire are equivalent to the above, for writes.
+
+* antic_expire
+    Controls the maximum amount of time we can anticipate a good read (one
+    with a short seek distance from the most recently completed request) before
+    giving up. Many other factors may cause anticipation to be stopped early,
+    or some processes will not be "anticipated" at all. Should be a bit higher
+    for big seek time devices though not a linear correspondence - most
+    processes have only a few ms thinktime.
+
+In addition to the tunables above there is a read-only file named est_time
+which, when read, will show:
+
+    - The probability of a task exiting without a cooperating task
+      submitting an anticipated IO.
+
+    - The current mean think time.
+
+    - The seek distance used to determine if an incoming IO is better.
+
--- a/kernel/Documentation/block/barrier.txt
+++ b/kernel/Documentation/block/barrier.txt
@@ -0,0 +1,261 @@
+I/O Barriers
+============
+Tejun Heo <htejun@gmail.com>, July 22 2005
+
+I/O barrier requests are used to guarantee ordering around the barrier
+requests.  Unless you're crazy enough to use disk drives for
+implementing synchronization constructs (wow, sounds interesting...),
+the ordering is meaningful only for write requests for things like
+journal checkpoints.  All requests queued before a barrier request
+must be finished (made it to the physical medium) before the barrier
+request is started, and all requests queued after the barrier request
+must be started only after the barrier request is finished (again,
+made it to the physical medium).
+
+In other words, I/O barrier requests have the following two properties.
+
+1. Request ordering
+
+Requests cannot pass the barrier request.  Preceding requests are
+processed before the barrier and following requests after.
+
+Depending on what features a drive supports, this can be done in one
+of the following three ways.
+
+i. For devices which have queue depth greater than 1 (TCQ devices) and
+support ordered tags, block layer can just issue the barrier as an
+ordered request and the lower level driver, controller and drive
+itself are responsible for making sure that the ordering constraint is
+met.  Most modern SCSI controllers/drives should support this.
+
+NOTE: SCSI ordered tag isn't currently used due to limitation in the
+      SCSI midlayer, see the following random notes section.
+
+ii. For devices which have queue depth greater than 1 but don't
+support ordered tags, block layer ensures that the requests preceding
+a barrier request finishes before issuing the barrier request.  Also,
+it defers requests following the barrier until the barrier request is
+finished.  Older SCSI controllers/drives and SATA drives fall in this
+category.
+
+iii. Devices which have queue depth of 1.  This is a degenerate case
+of ii.  Just keeping issue order suffices.  Ancient SCSI
+controllers/drives and IDE drives are in this category.
+
+2. Forced flushing to physical medium
+
+Again, if you're not gonna do synchronization with disk drives (dang,
+it sounds even more appealing now!), the reason you use I/O barriers
+is mainly to protect filesystem integrity when power failure or some
+other events abruptly stop the drive from operating and possibly make
+the drive lose data in its cache.  So, I/O barriers need to guarantee
+that requests actually get written to non-volatile medium in order.
+
+There are four cases,
+
+i. No write-back cache.  Keeping requests ordered is enough.
+
+ii. Write-back cache but no flush operation.  There's no way to
+guarantee physical-medium commit order.  This kind of devices can't to
+I/O barriers.
+
+iii. Write-back cache and flush operation but no FUA (forced unit
+access).  We need two cache flushes - before and after the barrier
+request.
+
+iv. Write-back cache, flush operation and FUA.  We still need one
+flush to make sure requests preceding a barrier are written to medium,
+but post-barrier flush can be avoided by using FUA write on the
+barrier itself.
+
+
+How to support barrier requests in drivers
+------------------------------------------
+
+All barrier handling is done inside block layer proper.  All low level
+drivers have to are implementing its prepare_flush_fn and using one
+the following two functions to indicate what barrier type it supports
+and how to prepare flush requests.  Note that the term 'ordered' is
+used to indicate the whole sequence of performing barrier requests
+including draining and flushing.
+
+typedef void (prepare_flush_fn)(struct request_queue *q, struct request *rq);
+
+int blk_queue_ordered(struct request_queue *q, unsigned ordered,
+		      prepare_flush_fn *prepare_flush_fn);
+
+@q			: the queue in question
+@ordered		: the ordered mode the driver/device supports
+@prepare_flush_fn	: this function should prepare @rq such that it
+			  flushes cache to physical medium when executed
+
+For example, SCSI disk driver's prepare_flush_fn looks like the
+following.
+
+static void sd_prepare_flush(struct request_queue *q, struct request *rq)
+{
+	memset(rq->cmd, 0, sizeof(rq->cmd));
+	rq->cmd_type = REQ_TYPE_BLOCK_PC;
+	rq->timeout = SD_TIMEOUT;
+	rq->cmd[0] = SYNCHRONIZE_CACHE;
+	rq->cmd_len = 10;
+}
+
+The following seven ordered modes are supported.  The following table
+shows which mode should be used depending on what features a
+device/driver supports.  In the leftmost column of table,
+QUEUE_ORDERED_ prefix is omitted from the mode names to save space.
+
+The table is followed by description of each mode.  Note that in the
+descriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' is
+used for QUEUE_ORDERED_TAG* descriptions.  '=>' indicates that the
+preceding step must be complete before proceeding to the next step.
+'->' indicates that the next step can start as soon as the previous
+step is issued.
+
+	    write-back cache	ordered tag	flush		FUA
+-----------------------------------------------------------------------
+NONE		yes/no		N/A		no		N/A
+DRAIN		no		no		N/A		N/A
+DRAIN_FLUSH	yes		no		yes		no
+DRAIN_FUA	yes		no		yes		yes
+TAG		no		yes		N/A		N/A
+TAG_FLUSH	yes		yes		yes		no
+TAG_FUA		yes		yes		yes		yes
+
+
+QUEUE_ORDERED_NONE
+	I/O barriers are not needed and/or supported.
+
+	Sequence: N/A
+
+QUEUE_ORDERED_DRAIN
+	Requests are ordered by draining the request queue and cache
+	flushing isn't needed.
+
+	Sequence: drain => barrier
+
+QUEUE_ORDERED_DRAIN_FLUSH
+	Requests are ordered by draining the request queue and both
+	pre-barrier and post-barrier cache flushings are needed.
+
+	Sequence: drain => preflush => barrier => postflush
+
+QUEUE_ORDERED_DRAIN_FUA
+	Requests are ordered by draining the request queue and
+	pre-barrier cache flushing is needed.  By using FUA on barrier
+	request, post-barrier flushing can be skipped.
+
+	Sequence: drain => preflush => barrier
+
+QUEUE_ORDERED_TAG
+	Requests are ordered by ordered tag and cache flushing isn't
+	needed.
+
+	Sequence: barrier
+
+QUEUE_ORDERED_TAG_FLUSH
+	Requests are ordered by ordered tag and both pre-barrier and
+	post-barrier cache flushings are needed.
+
+	Sequence: preflush -> barrier -> postflush
+
+QUEUE_ORDERED_TAG_FUA
+	Requests are ordered by ordered tag and pre-barrier cache
+	flushing is needed.  By using FUA on barrier request,
+	post-barrier flushing can be skipped.
+
+	Sequence: preflush -> barrier
+
+
+Random notes/caveats
+--------------------
+
+* SCSI layer currently can't use TAG ordering even if the drive,
+controller and driver support it.  The problem is that SCSI midlayer
+request dispatch function is not atomic.  It releases queue lock and
+switch to SCSI host lock during issue and it's possible and likely to
+happen in time that requests change their relative positions.  Once
+this problem is solved, TAG ordering can be enabled.
+
+* Currently, no matter which ordered mode is used, there can be only
+one barrier request in progress.  All I/O barriers are held off by
+block layer until the previous I/O barrier is complete.  This doesn't
+make any difference for DRAIN ordered devices, but, for TAG ordered
+devices with very high command latency, passing multiple I/O barriers
+to low level *might* be helpful if they are very frequent.  Well, this
+certainly is a non-issue.  I'm writing this just to make clear that no
+two I/O barrier is ever passed to low-level driver.
+
+* Completion order.  Requests in ordered sequence are issued in order
+but not required to finish in order.  Barrier implementation can
+handle out-of-order completion of ordered sequence.  IOW, the requests
+MUST be processed in order but the hardware/software completion paths
+are allowed to reorder completion notifications - eg. current SCSI
+midlayer doesn't preserve completion order during error handling.
+
+* Requeueing order.  Low-level drivers are free to requeue any request
+after they removed it from the request queue with
+blkdev_dequeue_request().  As barrier sequence should be kept in order
+when requeued, generic elevator code takes care of putting requests in
+order around barrier.  See blk_ordered_req_seq() and
+ELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details.
+
+Note that block drivers must not requeue preceding requests while
+completing latter requests in an ordered sequence.  Currently, no
+error checking is done against this.
+
+* Error handling.  Currently, block layer will report error to upper
+layer if any of requests in an ordered sequence fails.  Unfortunately,
+this doesn't seem to be enough.  Look at the following request flow.
+QUEUE_ORDERED_TAG_FLUSH is in use.
+
+ [0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... >
+					  still in elevator
+
+Let's say request [2], [3] are write requests to update file system
+metadata (journal or whatever) and [barrier] is used to mark that
+those updates are valid.  Consider the following sequence.
+
+ i.	Requests [0] ~ [post] leaves the request queue and enters
+	low-level driver.
+ ii.	After a while, unfortunately, something goes wrong and the
+	drive fails [2].  Note that any of [0], [1] and [3] could have
+	completed by this time, but [pre] couldn't have been finished
+	as the drive must process it in order and it failed before
+	processing that command.
+ iii.	Error handling kicks in and determines that the error is
+	unrecoverable and fails [2], and resumes operation.
+ iv.	[pre] [barrier] [post] gets processed.
+ v.	*BOOM* power fails
+
+The problem here is that the barrier request is *supposed* to indicate
+that filesystem update requests [2] and [3] made it safely to the
+physical medium and, if the machine crashes after the barrier is
+written, filesystem recovery code can depend on that.  Sadly, that
+isn't true in this case anymore.  IOW, the success of a I/O barrier
+should also be dependent on success of some of the preceding requests,
+where only upper layer (filesystem) knows what 'some' is.
+
+This can be solved by implementing a way to tell the block layer which
+requests affect the success of the following barrier request and
+making lower lever drivers to resume operation on error only after
+block layer tells it to do so.
+
+As the probability of this happening is very low and the drive should
+be faulty, implementing the fix is probably an overkill.  But, still,
+it's there.
+
+* In previous drafts of barrier implementation, there was fallback
+mechanism such that, if FUA or ordered TAG fails, less fancy ordered
+mode can be selected and the failed barrier request is retried
+automatically.  The rationale for this feature was that as FUA is
+pretty new in ATA world and ordered tag was never used widely, there
+could be devices which report to support those features but choke when
+actually given such requests.
+
+ This was removed for two reasons 1. it's an overkill 2. it's
+impossible to implement properly when TAG ordering is used as low
+level drivers resume after an error automatically.  If it's ever
+needed adding it back and modifying low level drivers accordingly
+shouldn't be difficult.
--- a/kernel/Documentation/block/biodoc.txt
+++ b/kernel/Documentation/block/biodoc.txt
--- a/kernel/Documentation/block/capability.txt
+++ b/kernel/Documentation/block/capability.txt
@@ -0,0 +1,15 @@
+Generic Block Device Capability
+===============================================================================
+This file documents the sysfs file block/<disk>/capability
+
+capability is a hex word indicating which capabilities a specific disk
+supports.  For more information on bits not listed here, see
+include/linux/genhd.h
+
+Capability				Value
+-------------------------------------------------------------------------------
+GENHD_FL_MEDIA_CHANGE_NOTIFY		4
+	When this bit is set, the disk supports Asynchronous Notification
+	of media change events.  These events will be broadcast to user
+	space via kernel uevent.
+
--- a/kernel/Documentation/block/data-integrity.txt
+++ b/kernel/Documentation/block/data-integrity.txt
@@ -0,0 +1,327 @@
+----------------------------------------------------------------------
+1. INTRODUCTION
+
+Modern filesystems feature checksumming of data and metadata to
+protect against data corruption.  However, the detection of the
+corruption is done at read time which could potentially be months
+after the data was written.  At that point the original data that the
+application tried to write is most likely lost.
+
+The solution is to ensure that the disk is actually storing what the
+application meant it to.  Recent additions to both the SCSI family
+protocols (SBC Data Integrity Field, SCC protection proposal) as well
+as SATA/T13 (External Path Protection) try to remedy this by adding
+support for appending integrity metadata to an I/O.  The integrity
+metadata (or protection information in SCSI terminology) includes a
+checksum for each sector as well as an incrementing counter that
+ensures the individual sectors are written in the right order.  And
+for some protection schemes also that the I/O is written to the right
+place on disk.
+
+Current storage controllers and devices implement various protective
+measures, for instance checksumming and scrubbing.  But these
+technologies are working in their own isolated domains or at best
+between adjacent nodes in the I/O path.  The interesting thing about
+DIF and the other integrity extensions is that the protection format
+is well defined and every node in the I/O path can verify the
+integrity of the I/O and reject it if corruption is detected.  This
+allows not only corruption prevention but also isolation of the point
+of failure.
+
+----------------------------------------------------------------------
+2. THE DATA INTEGRITY EXTENSIONS
+
+As written, the protocol extensions only protect the path between
+controller and storage device.  However, many controllers actually
+allow the operating system to interact with the integrity metadata
+(IMD).  We have been working with several FC/SAS HBA vendors to enable
+the protection information to be transferred to and from their
+controllers.
+
+The SCSI Data Integrity Field works by appending 8 bytes of protection
+information to each sector.  The data + integrity metadata is stored
+in 520 byte sectors on disk.  Data + IMD are interleaved when
+transferred between the controller and target.  The T13 proposal is
+similar.
+
+Because it is highly inconvenient for operating systems to deal with
+520 (and 4104) byte sectors, we approached several HBA vendors and
+encouraged them to allow separation of the data and integrity metadata
+scatter-gather lists.
+
+The controller will interleave the buffers on write and split them on
+read.  This means that Linux can DMA the data buffers to and from
+host memory without changes to the page cache.
+
+Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs
+is somewhat heavy to compute in software.  Benchmarks found that
+calculating this checksum had a significant impact on system
+performance for a number of workloads.  Some controllers allow a
+lighter-weight checksum to be used when interfacing with the operating
+system.  Emulex, for instance, supports the TCP/IP checksum instead.
+The IP checksum received from the OS is converted to the 16-bit CRC
+when writing and vice versa.  This allows the integrity metadata to be
+generated by Linux or the application at very low cost (comparable to
+software RAID5).
+
+The IP checksum is weaker than the CRC in terms of detecting bit
+errors.  However, the strength is really in the separation of the data
+buffers and the integrity metadata.  These two distinct buffers must
+match up for an I/O to complete.
+
+The separation of the data and integrity metadata buffers as well as
+the choice in checksums is referred to as the Data Integrity
+Extensions.  As these extensions are outside the scope of the protocol
+bodies (T10, T13), Oracle and its partners are trying to standardize
+them within the Storage Networking Industry Association.
+
+----------------------------------------------------------------------
+3. KERNEL CHANGES
+
+The data integrity framework in Linux enables protection information
+to be pinned to I/Os and sent to/received from controllers that
+support it.
+
+The advantage to the integrity extensions in SCSI and SATA is that
+they enable us to protect the entire path from application to storage
+device.  However, at the same time this is also the biggest
+disadvantage. It means that the protection information must be in a
+format that can be understood by the disk.
+
+Generally Linux/POSIX applications are agnostic to the intricacies of
+the storage devices they are accessing.  The virtual filesystem switch
+and the block layer make things like hardware sector size and
+transport protocols completely transparent to the application.
+
+However, this level of detail is required when preparing the
+protection information to send to a disk.  Consequently, the very
+concept of an end-to-end protection scheme is a layering violation.
+It is completely unreasonable for an application to be aware whether
+it is accessing a SCSI or SATA disk.
+
+The data integrity support implemented in Linux attempts to hide this
+from the application.  As far as the application (and to some extent
+the kernel) is concerned, the integrity metadata is opaque information
+that's attached to the I/O.
+
+The current implementation allows the block layer to automatically
+generate the protection information for any I/O.  Eventually the
+intent is to move the integrity metadata calculation to userspace for
+user data.  Metadata and other I/O that originates within the kernel
+will still use the automatic generation interface.
+
+Some storage devices allow each hardware sector to be tagged with a
+16-bit value.  The owner of this tag space is the owner of the block
+device.  I.e. the filesystem in most cases.  The filesystem can use
+this extra space to tag sectors as they see fit.  Because the tag
+space is limited, the block interface allows tagging bigger chunks by
+way of interleaving.  This way, 8*16 bits of information can be
+attached to a typical 4KB filesystem block.
+
+This also means that applications such as fsck and mkfs will need
+access to manipulate the tags from user space.  A passthrough
+interface for this is being worked on.
+
+
+----------------------------------------------------------------------
+4. BLOCK LAYER IMPLEMENTATION DETAILS
+
+4.1 BIO
+
+The data integrity patches add a new field to struct bio when
+CONFIG_BLK_DEV_INTEGRITY is enabled.  bio->bi_integrity is a pointer
+to a struct bip which contains the bio integrity payload.  Essentially
+a bip is a trimmed down struct bio which holds a bio_vec containing
+the integrity metadata and the required housekeeping information (bvec
+pool, vector count, etc.)
+
+A kernel subsystem can enable data integrity protection on a bio by
+calling bio_integrity_alloc(bio).  This will allocate and attach the
+bip to the bio.
+
+Individual pages containing integrity metadata can subsequently be
+attached using bio_integrity_add_page().
+
+bio_free() will automatically free the bip.
+
+
+4.2 BLOCK DEVICE
+
+Because the format of the protection data is tied to the physical
+disk, each block device has been extended with a block integrity
+profile (struct blk_integrity).  This optional profile is registered
+with the block layer using blk_integrity_register().
+
+The profile contains callback functions for generating and verifying
+the protection data, as well as getting and setting application tags.
+The profile also contains a few constants to aid in completing,
+merging and splitting the integrity metadata.
+
+Layered block devices will need to pick a profile that's appropriate
+for all subdevices.  blk_integrity_compare() can help with that.  DM
+and MD linear, RAID0 and RAID1 are currently supported.  RAID4/5/6
+will require extra work due to the application tag.
+
+
+----------------------------------------------------------------------
+5.0 BLOCK LAYER INTEGRITY API
+
+5.1 NORMAL FILESYSTEM
+
+    The normal filesystem is unaware that the underlying block device
+    is capable of sending/receiving integrity metadata.  The IMD will
+    be automatically generated by the block layer at submit_bio() time
+    in case of a WRITE.  A READ request will cause the I/O integrity
+    to be verified upon completion.
+
+    IMD generation and verification can be toggled using the
+
+      /sys/block/<bdev>/integrity/write_generate
+
+    and
+
+      /sys/block/<bdev>/integrity/read_verify
+
+    flags.
+
+
+5.2 INTEGRITY-AWARE FILESYSTEM
+
+    A filesystem that is integrity-aware can prepare I/Os with IMD
+    attached.  It can also use the application tag space if this is
+    supported by the block device.
+
+
+    int bdev_integrity_enabled(block_device, int rw);
+
+      bdev_integrity_enabled() will return 1 if the block device
+      supports integrity metadata transfer for the data direction
+      specified in 'rw'.
+
+      bdev_integrity_enabled() honors the write_generate and
+      read_verify flags in sysfs and will respond accordingly.
+
+
+    int bio_integrity_prep(bio);
+
+      To generate IMD for WRITE and to set up buffers for READ, the
+      filesystem must call bio_integrity_prep(bio).
+
+      Prior to calling this function, the bio data direction and start
+      sector must be set, and the bio should have all data pages
+      added.  It is up to the caller to ensure that the bio does not
+      change while I/O is in progress.
+
+      bio_integrity_prep() should only be called if
+      bio_integrity_enabled() returned 1.
+
+
+    int bio_integrity_tag_size(bio);
+
+      If the filesystem wants to use the application tag space it will
+      first have to find out how much storage space is available.
+      Because tag space is generally limited (usually 2 bytes per
+      sector regardless of sector size), the integrity framework
+      supports interleaving the information between the sectors in an
+      I/O.
+
+      Filesystems can call bio_integrity_tag_size(bio) to find out how
+      many bytes of storage are available for that particular bio.
+
+      Another option is bdev_get_tag_size(block_device) which will
+      return the number of available bytes per hardware sector.
+
+
+    int bio_integrity_set_tag(bio, void *tag_buf, len);
+
+      After a successful return from bio_integrity_prep(),
+      bio_integrity_set_tag() can be used to attach an opaque tag
+      buffer to a bio.  Obviously this only makes sense if the I/O is
+      a WRITE.
+
+
+    int bio_integrity_get_tag(bio, void *tag_buf, len);
+
+      Similarly, at READ I/O completion time the filesystem can
+      retrieve the tag buffer using bio_integrity_get_tag().
+
+
+5.3 PASSING EXISTING INTEGRITY METADATA
+
+    Filesystems that either generate their own integrity metadata or
+    are capable of transferring IMD from user space can use the
+    following calls:
+
+
+    struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages);
+
+      Allocates the bio integrity payload and hangs it off of the bio.
+      nr_pages indicate how many pages of protection data need to be
+      stored in the integrity bio_vec list (similar to bio_alloc()).
+
+      The integrity payload will be freed at bio_free() time.
+
+
+    int bio_integrity_add_page(bio, page, len, offset);
+
+      Attaches a page containing integrity metadata to an existing
+      bio.  The bio must have an existing bip,
+      i.e. bio_integrity_alloc() must have been called.  For a WRITE,
+      the integrity metadata in the pages must be in a format
+      understood by the target device with the notable exception that
+      the sector numbers will be remapped as the request traverses the
+      I/O stack.  This implies that the pages added using this call
+      will be modified during I/O!  The first reference tag in the
+      integrity metadata must have a value of bip->bip_sector.
+
+      Pages can be added using bio_integrity_add_page() as long as
+      there is room in the bip bio_vec array (nr_pages).
+
+      Upon completion of a READ operation, the attached pages will
+      contain the integrity metadata received from the storage device.
+      It is up to the receiver to process them and verify data
+      integrity upon completion.
+
+
+5.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY
+    METADATA
+
+    To enable integrity exchange on a block device the gendisk must be
+    registered as capable:
+
+    int blk_integrity_register(gendisk, blk_integrity);
+
+      The blk_integrity struct is a template and should contain the
+      following:
+
+        static struct blk_integrity my_profile = {
+            .name                   = "STANDARDSBODY-TYPE-VARIANT-CSUM",
+            .generate_fn            = my_generate_fn,
+       	    .verify_fn              = my_verify_fn,
+       	    .get_tag_fn             = my_get_tag_fn,
+       	    .set_tag_fn             = my_set_tag_fn,
+	    .tuple_size             = sizeof(struct my_tuple_size),
+	    .tag_size               = <tag bytes per hw sector>,
+        };
+
+      'name' is a text string which will be visible in sysfs.  This is
+      part of the userland API so chose it carefully and never change
+      it.  The format is standards body-type-variant.
+      E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC.
+
+      'generate_fn' generates appropriate integrity metadata (for WRITE).
+
+      'verify_fn' verifies that the data buffer matches the integrity
+      metadata.
+
+      'tuple_size' must be set to match the size of the integrity
+      metadata per sector.  I.e. 8 for DIF and EPP.
+
+      'tag_size' must be set to identify how many bytes of tag space
+      are available per hardware sector.  For DIF this is either 2 or
+      0 depending on the value of the Control Mode Page ATO bit.
+
+      See 6.2 for a description of get_tag_fn and set_tag_fn.
+
+----------------------------------------------------------------------
+2007-12-24 Martin K. Petersen <martin.petersen@oracle.com>
--- a/kernel/Documentation/block/deadline-iosched.txt
+++ b/kernel/Documentation/block/deadline-iosched.txt
@@ -0,0 +1,75 @@
+Deadline IO scheduler tunables
+==============================
+
+This little file attempts to document how the deadline io scheduler works.
+In particular, it will clarify the meaning of the exposed tunables that may be
+of interest to power users.
+
+Selecting IO schedulers
+-----------------------
+Refer to Documentation/block/switching-sched.txt for information on
+selecting an io scheduler on a per-device basis.
+
+
+********************************************************************************
+
+
+read_expire	(in ms)
+-----------
+
+The goal of the deadline io scheduler is to attempt to guarantee a start
+service time for a request. As we focus mainly on read latencies, this is
+tunable. When a read request first enters the io scheduler, it is assigned
+a deadline that is the current time + the read_expire value in units of
+milliseconds.
+
+
+write_expire	(in ms)
+-----------
+
+Similar to read_expire mentioned above, but for writes.
+
+
+fifo_batch	(number of requests)
+----------
+
+Requests are grouped into ``batches'' of a particular data direction (read or
+write) which are serviced in increasing sector order.  To limit extra seeking,
+deadline expiries are only checked between batches.  fifo_batch controls the
+maximum number of requests per batch.
+
+This parameter tunes the balance between per-request latency and aggregate
+throughput.  When low latency is the primary concern, smaller is better (where
+a value of 1 yields first-come first-served behaviour).  Increasing fifo_batch
+generally improves throughput, at the cost of latency variation.
+
+
+writes_starved	(number of dispatches)
+--------------
+
+When we have to move requests from the io scheduler queue to the block
+device dispatch queue, we always give a preference to reads. However, we
+don't want to starve writes indefinitely either. So writes_starved controls
+how many times we give preference to reads over writes. When that has been
+done writes_starved number of times, we dispatch some writes based on the
+same criteria as reads.
+
+
+front_merges	(bool)
+------------
+
+Sometimes it happens that a request enters the io scheduler that is contiguous
+with a request that is already on the queue. Either it fits in the back of that
+request, or it fits at the front. That is called either a back merge candidate
+or a front merge candidate. Due to the way files are typically laid out,
+back merges are much more common than front merges. For some work loads, you
+may even know that it is a waste of time to spend any time attempting to
+front merge requests. Setting front_merges to 0 disables this functionality.
+Front merges may still occur due to the cached last_merge hint, but since
+that comes at basically 0 cost we leave that on. We simply disable the
+rbtree front sector lookup when the io scheduler merge function is called.
+
+
+Nov 11 2002, Jens Axboe <jens.axboe@oracle.com>
+
+
--- a/kernel/Documentation/block/ioprio.txt
+++ b/kernel/Documentation/block/ioprio.txt
@@ -0,0 +1,183 @@
+Block io priorities
+===================
+
+
+Intro
+-----
+
+With the introduction of cfq v3 (aka cfq-ts or time sliced cfq), basic io
+priorities are supported for reads on files.  This enables users to io nice
+processes or process groups, similar to what has been possible with cpu
+scheduling for ages.  This document mainly details the current possibilities
+with cfq; other io schedulers do not support io priorities thus far.
+
+Scheduling classes
+------------------
+
+CFQ implements three generic scheduling classes that determine how io is
+served for a process.
+
+IOPRIO_CLASS_RT: This is the realtime io class. This scheduling class is given
+higher priority than any other in the system, processes from this class are
+given first access to the disk every time. Thus it needs to be used with some
+care, one io RT process can starve the entire system. Within the RT class,
+there are 8 levels of class data that determine exactly how much time this
+process needs the disk for on each service. In the future this might change
+to be more directly mappable to performance, by passing in a wanted data
+rate instead.
+
+IOPRIO_CLASS_BE: This is the best-effort scheduling class, which is the default
+for any process that hasn't set a specific io priority. The class data
+determines how much io bandwidth the process will get, it's directly mappable
+to the cpu nice levels just more coarsely implemented. 0 is the highest
+BE prio level, 7 is the lowest. The mapping between cpu nice level and io
+nice level is determined as: io_nice = (cpu_nice + 20) / 5.
+
+IOPRIO_CLASS_IDLE: This is the idle scheduling class, processes running at this
+level only get io time when no one else needs the disk. The idle class has no
+class data, since it doesn't really apply here.
+
+Tools
+-----
+
+See below for a sample ionice tool. Usage:
+
+# ionice -c<class> -n<level> -p<pid>
+
+If pid isn't given, the current process is assumed. IO priority settings
+are inherited on fork, so you can use ionice to start the process at a given
+level:
+
+# ionice -c2 -n0 /bin/ls
+
+will run ls at the best-effort scheduling class at the highest priority.
+For a running process, you can give the pid instead:
+
+# ionice -c1 -n2 -p100
+
+will change pid 100 to run at the realtime scheduling class, at priority 2.
+
+---> snip ionice.c tool <---
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <getopt.h>
+#include <unistd.h>
+#include <sys/ptrace.h>
+#include <asm/unistd.h>
+
+extern int sys_ioprio_set(int, int, int);
+extern int sys_ioprio_get(int, int);
+
+#if defined(__i386__)
+#define __NR_ioprio_set		289
+#define __NR_ioprio_get		290
+#elif defined(__ppc__)
+#define __NR_ioprio_set		273
+#define __NR_ioprio_get		274
+#elif defined(__x86_64__)
+#define __NR_ioprio_set		251
+#define __NR_ioprio_get		252
+#elif defined(__ia64__)
+#define __NR_ioprio_set		1274
+#define __NR_ioprio_get		1275
+#else
+#error "Unsupported arch"
+#endif
+
+static inline int ioprio_set(int which, int who, int ioprio)
+{
+	return syscall(__NR_ioprio_set, which, who, ioprio);
+}
+
+static inline int ioprio_get(int which, int who)
+{
+	return syscall(__NR_ioprio_get, which, who);
+}
+
+enum {
+	IOPRIO_CLASS_NONE,
+	IOPRIO_CLASS_RT,
+	IOPRIO_CLASS_BE,
+	IOPRIO_CLASS_IDLE,
+};
+
+enum {
+	IOPRIO_WHO_PROCESS = 1,
+	IOPRIO_WHO_PGRP,
+	IOPRIO_WHO_USER,
+};
+
+#define IOPRIO_CLASS_SHIFT	13
+
+const char *to_prio[] = { "none", "realtime", "best-effort", "idle", };
+
+int main(int argc, char *argv[])
+{
+	int ioprio = 4, set = 0, ioprio_class = IOPRIO_CLASS_BE;
+	int c, pid = 0;
+
+	while ((c = getopt(argc, argv, "+n:c:p:")) != EOF) {
+		switch (c) {
+		case 'n':
+			ioprio = strtol(optarg, NULL, 10);
+			set = 1;
+			break;
+		case 'c':
+			ioprio_class = strtol(optarg, NULL, 10);
+			set = 1;
+			break;
+		case 'p':
+			pid = strtol(optarg, NULL, 10);
+			break;
+		}
+	}
+
+	switch (ioprio_class) {
+		case IOPRIO_CLASS_NONE:
+			ioprio_class = IOPRIO_CLASS_BE;
+			break;
+		case IOPRIO_CLASS_RT:
+		case IOPRIO_CLASS_BE:
+			break;
+		case IOPRIO_CLASS_IDLE:
+			ioprio = 7;
+			break;
+		default:
+			printf("bad prio class %d\n", ioprio_class);
+			return 1;
+	}
+
+	if (!set) {
+		if (!pid && argv[optind])
+			pid = strtol(argv[optind], NULL, 10);
+
+		ioprio = ioprio_get(IOPRIO_WHO_PROCESS, pid);
+
+		printf("pid=%d, %d\n", pid, ioprio);
+
+		if (ioprio == -1)
+			perror("ioprio_get");
+		else {
+			ioprio_class = ioprio >> IOPRIO_CLASS_SHIFT;
+			ioprio = ioprio & 0xff;
+			printf("%s: prio %d\n", to_prio[ioprio_class], ioprio);
+		}
+	} else {
+		if (ioprio_set(IOPRIO_WHO_PROCESS, pid, ioprio | ioprio_class << IOPRIO_CLASS_SHIFT) == -1) {
+			perror("ioprio_set");
+			return 1;
+		}
+
+		if (argv[optind])
+			execvp(argv[optind], &argv[optind]);
+	}
+
+	return 0;
+}
+
+---> snip ionice.c tool <---
+
+
+March 11 2005, Jens Axboe <jens.axboe@oracle.com>
--- a/kernel/Documentation/block/queue-sysfs.txt
+++ b/kernel/Documentation/block/queue-sysfs.txt
@@ -0,0 +1,63 @@
+Queue sysfs files
+=================
+
+This text file will detail the queue files that are located in the sysfs tree
+for each block device. Note that stacked devices typically do not export
+any settings, since their queue merely functions are a remapping target.
+These files are the ones found in the /sys/block/xxx/queue/ directory.
+
+Files denoted with a RO postfix are readonly and the RW postfix means
+read-write.
+
+hw_sector_size (RO)
+-------------------
+This is the hardware sector size of the device, in bytes.
+
+max_hw_sectors_kb (RO)
+----------------------
+This is the maximum number of kilobytes supported in a single data transfer.
+
+max_sectors_kb (RW)
+-------------------
+This is the maximum number of kilobytes that the block layer will allow
+for a filesystem request. Must be smaller than or equal to the maximum
+size allowed by the hardware.
+
+nomerges (RW)
+-------------
+This enables the user to disable the lookup logic involved with IO merging
+requests in the block layer. Merging may still occur through a direct
+1-hit cache, since that comes for (almost) free. The IO scheduler will not
+waste cycles doing tree/hash lookups for merges if nomerges is 1. Defaults
+to 0, enabling all merges.
+
+nr_requests (RW)
+----------------
+This controls how many requests may be allocated in the block layer for
+read or write requests. Note that the total allocated number may be twice
+this amount, since it applies only to reads or writes (not the accumulated
+sum).
+
+read_ahead_kb (RW)
+------------------
+Maximum number of kilobytes to read-ahead for filesystems on this block
+device.
+
+rq_affinity (RW)
+----------------
+If this option is enabled, the block layer will migrate request completions
+to the CPU that originally submitted the request. For some workloads
+this provides a significant reduction in CPU cycles due to caching effects.
+
+scheduler (RW)
+--------------
+When read, this file will display the current and available IO schedulers
+for this block device. The currently active IO scheduler will be enclosed
+in [] brackets. Writing an IO scheduler name to this file will switch
+control of this block device to that new IO scheduler. Note that writing
+an IO scheduler name to this file will attempt to load that IO scheduler
+module, if it isn't already present in the system.
+
+
+
+Jens Axboe <jens.axboe@oracle.com>, February 2009
--- a/kernel/Documentation/block/request.txt
+++ b/kernel/Documentation/block/request.txt
@@ -0,0 +1,88 @@
+
+struct request documentation
+
+Jens Axboe <jens.axboe@oracle.com> 27/05/02
+
+1.0
+Index
+
+2.0 Struct request members classification
+
+	2.1 struct request members explanation
+
+3.0
+
+
+2.0
+Short explanation of request members
+
+Classification flags:
+
+	D	driver member
+	B	block layer member
+	I	I/O scheduler member
+
+Unless an entry contains a D classification, a device driver must not access
+this member. Some members may contain D classifications, but should only be
+access through certain macros or functions (eg ->flags).
+
+<linux/blkdev.h>
+
+2.1
+Member				Flag	Comment
+------				----	-------
+
+struct list_head queuelist	BI	Organization on various internal
+					queues
+
+void *elevator_private		I	I/O scheduler private data
+
+unsigned char cmd[16]		D	Driver can use this for setting up
+					a cdb before execution, see
+					blk_queue_prep_rq
+
+unsigned long flags		DBI	Contains info about data direction,
+					request type, etc.
+
+int rq_status			D	Request status bits
+
+kdev_t rq_dev			DBI	Target device
+
+int errors			DB	Error counts
+
+sector_t sector			DBI	Target location
+
+unsigned long hard_nr_sectors	B	Used to keep sector sane
+
+unsigned long nr_sectors	DBI	Total number of sectors in request
+
+unsigned long hard_nr_sectors	B	Used to keep nr_sectors sane
+
+unsigned short nr_phys_segments	DB	Number of physical scatter gather
+					segments in a request
+
+unsigned short nr_hw_segments	DB	Number of hardware scatter gather
+					segments in a request
+
+unsigned int current_nr_sectors	DB	Number of sectors in first segment
+					of request
+
+unsigned int hard_cur_sectors	B	Used to keep current_nr_sectors sane
+
+int tag				DB	TCQ tag, if assigned
+
+void *special			D	Free to be used by driver
+
+char *buffer			D	Map of first segment, also see
+					section on bouncing SECTION
+
+struct completion *waiting	D	Can be used by driver to get signalled
+					on request completion
+
+struct bio *bio			DBI	First bio in request
+
+struct bio *biotail		DBI	Last bio in request
+
+struct request_queue *q		DB	Request queue this request belongs to
+
+struct request_list *rl		B	Request list this request came from
--- a/kernel/Documentation/block/stat.txt
+++ b/kernel/Documentation/block/stat.txt
@@ -0,0 +1,82 @@
+Block layer statistics in /sys/block/<dev>/stat
+===============================================
+
+This file documents the contents of the /sys/block/<dev>/stat file.
+
+The stat file provides several statistics about the state of block
+device <dev>.
+
+Q. Why are there multiple statistics in a single file?  Doesn't sysfs
+   normally contain a single value per file?
+A. By having a single file, the kernel can guarantee that the statistics
+   represent a consistent snapshot of the state of the device.  If the
+   statistics were exported as multiple files containing one statistic
+   each, it would be impossible to guarantee that a set of readings
+   represent a single point in time.
+
+The stat file consists of a single line of text containing 11 decimal
+values separated by whitespace.  The fields are summarized in the
+following table, and described in more detail below.
+
+Name            units         description
+----            -----         -----------
+read I/Os       requests      number of read I/Os processed
+read merges     requests      number of read I/Os merged with in-queue I/O
+read sectors    sectors       number of sectors read
+read ticks      milliseconds  total wait time for read requests
+write I/Os      requests      number of write I/Os processed
+write merges    requests      number of write I/Os merged with in-queue I/O
+write sectors   sectors       number of sectors written
+write ticks     milliseconds  total wait time for write requests
+in_flight       requests      number of I/Os currently in flight
+io_ticks        milliseconds  total time this block device has been active
+time_in_queue   milliseconds  total wait time for all requests
+
+read I/Os, write I/Os
+=====================
+
+These values increment when an I/O request completes.
+
+read merges, write merges
+=========================
+
+These values increment when an I/O request is merged with an
+already-queued I/O request.
+
+read sectors, write sectors
+===========================
+
+These values count the number of sectors read from or written to this
+block device.  The "sectors" in question are the standard UNIX 512-byte
+sectors, not any device- or filesystem-specific block size.  The
+counters are incremented when the I/O completes.
+
+read ticks, write ticks
+=======================
+
+These values count the number of milliseconds that I/O requests have
+waited on this block device.  If there are multiple I/O requests waiting,
+these values will increase at a rate greater than 1000/second; for
+example, if 60 read requests wait for an average of 30 ms, the read_ticks
+field will increase by 60*30 = 1800.
+
+in_flight
+=========
+
+This value counts the number of I/O requests that have been issued to
+the device driver but have not yet completed.  It does not include I/O
+requests that are in the queue but not yet issued to the device driver.
+
+io_ticks
+========
+
+This value counts the number of milliseconds during which the device has
+had I/O requests queued.
+
+time_in_queue
+=============
+
+This value counts the number of milliseconds that I/O requests have waited
+on this block device.  If there are multiple I/O requests waiting, this
+value will increase as the product of the number of milliseconds times the
+number of requests waiting (see "read ticks" above for an example).
--- a/kernel/Documentation/block/switching-sched.txt
+++ b/kernel/Documentation/block/switching-sched.txt
@@ -0,0 +1,37 @@
+To choose IO schedulers at boot time, use the argument 'elevator=deadline'.
+'noop', 'as' and 'cfq' (the default) are also available. IO schedulers are
+assigned globally at boot time only presently.
+
+Each io queue has a set of io scheduler tunables associated with it. These
+tunables control how the io scheduler works. You can find these entries
+in:
+
+/sys/block/<device>/queue/iosched
+
+assuming that you have sysfs mounted on /sys. If you don't have sysfs mounted,
+you can do so by typing:
+
+# mount none /sys -t sysfs
+
+As of the Linux 2.6.10 kernel, it is now possible to change the
+IO scheduler for a given block device on the fly (thus making it possible,
+for instance, to set the CFQ scheduler for the system default, but
+set a specific device to use the anticipatory or noop schedulers - which
+can improve that device's throughput).
+
+To set a specific scheduler, simply do this:
+
+echo SCHEDNAME > /sys/block/DEV/queue/scheduler
+
+where SCHEDNAME is the name of a defined IO scheduler, and DEV is the
+device name (hda, hdb, sga, or whatever you happen to have).
+
+The list of defined schedulers can be found by simply doing
+a "cat /sys/block/DEV/queue/scheduler" - the list of valid names
+will be displayed, with the currently selected scheduler in brackets:
+
+# cat /sys/block/hda/queue/scheduler
+noop anticipatory deadline [cfq]
+# echo anticipatory > /sys/block/hda/queue/scheduler
+# cat /sys/block/hda/queue/scheduler
+noop [anticipatory] deadline cfq