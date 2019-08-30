Direct IO helpers for block devices and regular files on FreeBSD, Linux, macOS and Windows. #mechanical-sympathy

Installation

npm install @ ronomon / direct - io

Direct Memory Access

Direct memory access bypasses the filesystem cache and avoids memory copies to and from kernel space, writing and reading directly to and from the disk device cache. File I/O is done directly to and from user space buffers regardless of whether the file descriptor is a block device or regular file. To enable direct memory access, use the following open flag or method according to the platform:

O_DIRECT (FreeBSD, Linux, Windows)

Provide as a flag to fs.open() when opening a block device or regular file to enable direct memory access.

On Windows, O_DIRECT is supported as of libuv 1.16.0 (i.e. Node 9.2.0 and up), where O_DIRECT is mapped to FILE_FLAG_NO_BUFFERING.

setF_NOCACHE(fd, value, callback) (macOS)

Turns data caching off or on for an open file descriptor for a block device or regular file:

A value of 1 turns data caching off (this is the nearest equivalent of O_DIRECT on macOS).

of turns data caching off (this is the nearest equivalent of on macOS). A value of 0 turns data caching back on.

Turning data caching off with F_NOCACHE will not purge any previously cached pages. Subsequent direct reads will continue to return cached pages if they exist, and concurrent processes may continue to populate the cache through non-direct reads. To ensure direct reads on macOS (for example when data scrubbing) you should set F_NOCACHE as soon as possible to avoid populating the cache.

Alternatively, if you want to ensure initial boot conditions with a cold disk buffer cache, you can purge the entire cache for all files using sudo purge . This will affect system performance.

Buffer Alignment

When writing or reading to and from a block device or regular file using direct memory access, you need to make sure that your buffer is aligned correctly or you may receive an EINVAL error or be switched back silently to non-DMA mode.

To be aligned correctly, the address of the allocated memory must be a multiple of alignment , i.e. the physical sector size (not logical sector size) of the block device. Buffers allocated using Node's Buffer.alloc() and related methods will not meet these alignment requirements. Use getAlignedBuffer() to create aligned buffers:

getAlignedBuffer(size, alignment) (FreeBSD, Linux, macOS, Windows)

Returns an aligned buffer:

size must be greater than 0, and a multiple of the physical sector size of the block device (typically 512 bytes or 4096 bytes).

must be greater than 0, and a multiple of the physical sector size of the block device (typically 512 bytes or 4096 bytes). size must be at most require('buffer').kMaxLength bytes.

must be at most bytes. alignment must be at least 8 bytes for portability between 32-bit and 64-bit systems.

must be at least 8 bytes for portability between 32-bit and 64-bit systems. alignment must be a power of two, and a multiple of the physical sector size of the block device.

must be a power of two, and a multiple of the physical sector size of the block device. alignment must be at most 4194304 bytes for a safe arbitrary upper bound.

must be at most 4194304 bytes for a safe arbitrary upper bound. An alignment of 4096 bytes should be compatible with Advanced Format drives as well as backwards compatible with 512 sector drives. If you want to be sure, you should use getBlockDevice() below to get the actual physical sector size of the block device.

below to get the actual physical sector size of the block device. Buffers are instances of Node's Buffer class and have the same methods and properties, except that they are also aligned.

class and have the same methods and properties, except that they are also aligned. Buffers are allocated using the appropriate call (either posix_memalign or _aligned_malloc depending on the platform).

or depending on the platform). Buffers are zero-filled with memset() when allocated for safety.

when allocated for safety. Buffers are automatically freed using the appropriate call when garbage collected in V8, either free() or _aligned_free() depending on the platform.

or depending on the platform. getAlignedBuffer() should be used judiciously as the algorithm that realizes the alignment constraint can incur significant memory overhead.

Further reading:

Synchronous Writes

Direct memory access will write directly to the disk device cache but not necessarily to the disk device storage medium. To ensure that your data is flushed from the disk device cache to the disk device storage medium, you should also open the block device or regular file using the O_DSYNC or O_SYNC open flags. These are the equivalent of calling fs.fdatasync() or fs.fsync() respectively after every write, but with less system call overhead, and with the advantage that they encourage the disk device to do real work during the fs.write() call, which can be useful when overlapping compute-intensive work with IO.

Some systems implement the O_DSYNC and O_SYNC open flags by setting the Force Unit Access (FUA) flag, which works for SCSI but not for EIDE and SATA drivers.

Conversely, some systems have had bugs where calling fs.fdatasync() or fs.fsync() on a regular file would force a flush only if the page cache was dirty, so that bypassing the page cache using O_DIRECT meant the disk device cache was never flushed.

This means that the O_DSYNC and O_SYNC open flags are not sufficient on their own, but should be combined with fs.fdatasync() or fs.fsync() for durability or write barriers. This does not mean that these open flags are not useful. As we have already seen, they can reduce the latency of the eventual fsync call, eliminating latency spikes.

O_DSYNC (FreeBSD, Linux, macOS, Windows)

Flushes all data and only required associated metadata to the underlying hardware.

O_SYNC (FreeBSD, Linux, macOS, Windows)

Flushes all data and any associated metadata to the underlying hardware.

To understand the difference between O_DSYNC and O_SYNC , consider two pieces of file metadata: the file modification timestamp and the file length. All write operations will update the file modification timestamp, but only writes that add data to the end of the file will change the file length. The last modification timestamp is not required to ensure that the data can be read back successfully, but the file length is needed. Thus, O_DSYNC will only flush updates to the file length metadata, whereas O_SYNC will also flush the file modification timestamp metadata.

On Windows, synchronous writes are supported as of libuv 1.16.0 (i.e. Node 9.2.0 and up), where O_DSYNC and O_SYNC are both mapped to FILE_FLAG_WRITE_THROUGH.

Block Device Size and Sector Size

Node's fs.fstat() will not work at all for a block device on Windows, and will not report the correct size for a block device on other platforms. You should use getBlockDevice() instead:

getBlockDevice(fd, callback) (FreeBSD, Linux, macOS, Windows)

Returns an object with the following properties:

logicalSectorSize - The size of a logical sector in bytes. Some drives will advertise a backwards compatible logical sector size of 512 bytes while their physical sector size is in fact 4096 bytes.

physicalSectorSize - The size of a physical sector in bytes. You should use this to decide on the size and alignment parameters when getting aligned buffers so that reads and writes are always a multiple of the physical sector size. Some virtual devices may report a physicalSectorSize of 0 bytes.

size - The total size of the block device in bytes.

serialNumber - The serial number reported by the device. (FreeBSD, Linux)

Block Device Path and Permissions

You will need sudo or administrator privileges to open a block device. You can use fs.open(path, flags) to open a block device, where the path you provide will depend on the platform:

/dev/sda (FreeBSD, Linux)

/dev/disk1 (macOS)

\\.\PhysicalDrive1 (Windows)

You can use these shell commands to see which block devices are available:

$ camcontrol devlist (FreeBSD)

$ lsblk (Linux)

$ diskutil list (macOS)

$ wmic diskdrive list brief (Windows)

Mandatory Locks

Windows has special restrictions concerning writing to block devices. You must lock the block device by providing the O_EXLOCK flag when opening the file descriptor, or else by calling setFSCTL_LOCK_VOLUME() after opening the file descriptor. On other platforms it is good practice to lock the block device by providing either the O_EXCL or O_EXLOCK flag when opening the file descriptor:

O_EXCL (Linux)

Provide as a flag to fs.open() when opening a block device to obtain an exclusive mandatory (and not just advisory) lock. When opening a regular file, behavior is undefined. In general, the behavior of O_EXCL is undefined if it is used without O_CREAT . There is one exception: on Linux 2.6 and later, O_EXCL can be used without O_CREAT if the path refers to a block device. If the block device is in use by the system e.g. if it is mounted, fs.open() will fail with an EBUSY error.

O_EXLOCK (macOS, Windows)

Provide as a flag to fs.open() when opening a block device or regular file to obtain an exclusive mandatory (and not just advisory) lock.

On macOS, when opening a regular file with O_EXLOCK , fs.open() will block until any existing lock is released. While adding O_NONBLOCK can avoid this, it also introduces other IO semantics. Using O_EXLOCK should therefore be limited to opening a block device on macOS. If the block device is in use by the system, i.e. it is mounted, fs.open() will fail with an EBUSY error.

On Windows, O_EXLOCK is supported as of libuv 1.17.0 (i.e. Node 9.3.0 and up), where O_EXLOCK is mapped to an exclusive sharing mode of 0. If the block device or regular file is already open, fs.open() will fail with an EBUSY error.

setFSCTL_LOCK_VOLUME(fd, value, callback) (Windows)

Locks a block device on Windows if not in use:

A value of 1 locks the block device using FSCTL_LOCK_VOLUME.

of locks the block device using FSCTL_LOCK_VOLUME. A value of 0 unlocks the block device using FSCTL_UNLOCK_VOLUME if it was previously locked.

of unlocks the block device using FSCTL_UNLOCK_VOLUME if it was previously locked. A locked block device can be accessed only through the file descriptor that locked it.

A locked block device remains locked until the application unlocks the block device, or until the file descriptor is closed, either directly through fs.close() , or indirectly when the process terminates.

, or indirectly when the process terminates. If the specified block device is a system volume or contains a page file, the operation will fail.

If there are any open files on the block device, the operation will fail. Conversely, success of this operation indicates that there are no open files.

The system will flush all cached data to the block device before locking it.

The NTFS file system treats a locked block device as a dismounted volume. The FSCTL_DISMOUNT_VOLUME control code functions similarly but does not check for open files before dismounting.

control code functions similarly but does not check for open files before dismounting. Without a successful lock operation, a dismounted volume may be remounted by any process at any time.

Advisory Locks

setFlock(fd, value, callback) (FreeBSD, Linux, macOS)

Apply or remove an advisory lock on an open regular file:

A value of 1 locks the regular file using flock(LOCK_EX | LOCK_NB) .

of locks the regular file using . A value of 0 unlocks the regular file using flock(LOCK_UN) .

of unlocks the regular file using . Advisory locks allow cooperating processes to perform consistent operations but do not guarantee consistency since other processes may not check for the presence of advisory locks.

A locked regular file remains locked until the application unlocks the regular file, or until the file descriptor is closed, either directly through fs.close() , or indirectly when the process terminates.

Benchmark

The write performance of various block sizes and open flags can vary across operating systems and between hard drives and solid state drives. Use the included write benchmark to benchmark various block sizes and open flags on the local file system (by default) or on a specific block device or regular file:

WARNING: The write benchmark will erase the contents of the specified block device or regular file if any.

[sudo] node benchmark .js [device|file]