README.md
Background
----------

The way the pci dma mapping functions work is by taking a virtual address
and mapping it to a bus address. This works fine for I/O to pages that
have a virtual address mapping. For architectures with HIGHMEM, pages in
the highmem zone do not have such a mapping though (because of the limited
address space). The highmem pages can be mapped directly to a bus address
though, so if we can collapse the page -> virt -> bus translation into a
simple page -> bus address that's all we need. Usually block drivers will
use the pci_map_sg functionto do this mapping, with sg->address being
set to the virtual address of the buffer. The current approach I took is
to add a sg->page too, and have pci_map_sg map sg->page (if set) or
sg->address accordingly. Of course then we also need a sg->offset, that
was added to struct scatterlist too.

For single page mappings, there is pci_map_page.

Why
---

Data in the page cache can reside in highmem. When we want to write out
one of these pages, the following happens:

- WRITE buffer_head, with b_page in highmem
- Allocate new buffer_head and page in low memory
- Copy data from highmem page to newly allocated low mem page
- Continue writing out new page
- On end I/O, free bounce buffer_head and page

or for a read

- READ buffer_head, with b_page in highmem
- Allocate new buffer_head and page in low memory
- Continue reading in data
- On end I/O, copy back data to high memory page and free bounce

To access data in highmem pages, Linux plays TLB tricks by temporarily
mapping it into reserved low memory. The READ copy-back is extra expensive,
since the non-caching atomic mapping function has to be used because it
can (depending on low level driver) be called from IRQ context. Thus the
entire page copy also happens with interrupts disabled!

This is what Linux 2.2/2.4 currently does, and it doesn't take a genious
to note that this is less than optimal. Simple kernel profiling when doing
dbench disk testing shows the bouncing eating significant amounts of CPU.

Requiring a full page and buffer_head allocation on doing I/O is very
unhandy too, and has led to numerous deadlocks in the past.


Memory zoning
-------------



Status of bio patches
---------------------

Functional and stable for IDE and SCSI, the various RAID controllers
(cpqarray, cciss, DAC960) have not been tested as I don't have the hw but
they should work too. The cciss/cpqarray maintainer is currently testing
cciss.

Highmem I/O should work on:

- IDE (DMA of course, various PIO modes checked for correct
mapping and it seems good)
- cpqarray and cciss (untested)
- SCSI

SCSI and IDE both pass cerberus stress testing (aic7xxx and piix,
respectively), so I think the patch can be considered fairly stable.
SCSI support is currently enabled for aic7xxx and sym53c8xx. IDE highmem
has been tested the most and I'd be surprised if there are any show-stopper
bugs left there.

ll_rw_kio works, and brw_kiovec was greatly simplified by this. Not just
that, but with bio we can remove the embedded buffer_head and blocks array
in struct kiobuf which leads to a ~8kB reduction in size for each kiobuf.
No real performance testing on this yet, just a simple dd:

(clean kernel, 2.4.5-pre4)
bart:~ # time dd if=/dev/raw1 of=/dev/null bs=4k
128008+0 records in
128008+0 records out

real 0m43.130s
user 0m0.290s
sys 0m7.060s

(2.4.5-pre4 + bio)
bart:~ # time dd if=/dev/raw1 of=/dev/null bs=4k
128008+0 records in
128008+0 records out
real 0m38.478s
user 0m0.204s
sys 0m5.091s


Pending / TODO
--------------

- Convert more drivers that support I/O to higher memory pages to do so instead
of using bounce buffers.
- Drop plug_tq and convert to per-queue unplugging.
- Merge my queue-barrier patch
- Look into moving more of the per-major arrays into the request_queue
- Fix md and lvm
- Fix ide-scsi
- Fix drivers/s390 block stuff
- Probably still some drivers that aren't converted and not listed, look
- Loose kdev_t from the block layer completely

Feel free to jump in and start hacking on the TODO list :-)

Jens Axboe <axboe@suse.de>