To get the best performance out of Buurst SoftNAS®, in either on-premise or cloud-based solutions, consult and remember these best practices.
As with any storage system, NAS performance is a function of a number of many different combined factors:
- Cache memory (first level read cache or ARC)
- 2nd level cache (e.g., L2ARC) speed
- Disk drive speed and the chosen RAID configuration
- Disk controller and protocol
Cache Memory (first level)
Solid state disk (SSD) and PCIe flash cache cards offer high-speed read caching and transaction logging for synchronous writes. However, not all SSDs are created equal and some are better for these tasks than others. In particular, pay close attention to the specifications regarding 4K IOPS.
For read caching (L2ARC), both read and write IOPS matter, as do the sequential throughput specifications of the device. For running a database, VMware VMDK, or other workloads that produce large amounts of random, small (e.g., 4KB) reads and writes, then ensure the SSD and flash cache devices provide high IOPS for 4K reads/writes.
For the write log (ZIL), extremely fast write IOPS is most important (the ZIL is only read after a power failure or other outage event to replay synchronous write transactions that may not have been posted prior to the outage, so write IOPS is most critical for use as a ZIL). ZFS always uses a ZIL (unless the variable set "sync=disabled"). By default, the ZIL uses the devices which comprise the storage pool. An "SLOG" device (called a "Write Log" in SoftNAS®) offloads the ZIL from the main pool to a separate log device, which improves performance when the right log device is chosen and configured properly.
2nd Level Read Cache
To further improve read and query performance, configure a Read Cache device for use with SoftNAS®. SoftNAS® leverages the ZFS "L2ARC" as its second level cache.
Cloud-based Read Cache
For cloud-based deployments, choose an instance type which includes local solid state disk (SSD) disks. The storage server will make use of as much read cache as it has been provided. Read cache devices can be added and removed at any time with no risk of data loss to an existing storage pool.
For many cloud vendors, there are two choices for SSD read cache:
- Local SSD - this is the fastest read cache available, as the local SSDs are directly attached to each instance and typically provide up to 120,000 IOPS
- Block storage Provisioned IOPS - these volumes can be assigned to SSD, providing a specified level of guaranteed IOPS
On-Premise Read Cache
For on-premise deployments, add one or more SSD devices to the local storage server. Use of a properly-designed read cache is essential to get the IOPS and throughput for database, VDI, vMotion and other workloads comprised primarily of small I/O operations (e.g., lots of small files, VMDKs, database transactions, etc.)
The "write log" on SoftNAS® leverages the ZFS Intent Log (ZIL). The ZIL is a "transaction log" used to record synchronous writes (not asynchronous writes). When SoftNAS® receives synchronous write requests, before returning to the caller, ZFS first records the write in memory and then completes the write to the ZIL. By default, the ZIL is located on the same persistent storage associated with the storage pool (e.g., spinning disk media). Once the write is recorded in the ZIL, the synchronous write is completed and the NFS, CIFS or iSCSI request returns to the caller.
To increase performance of synchronous writes, add a separate write log (sometimes referred to as a "SLOG") device, as discussed in the Read Cache section above. A separate write log device enables ZFS to quickly store synchronous write data and return to the caller.
Note: This write log is only actually referenced in the event of a power failure or VM / instance crash, to replay the transactions that were not committed prior to the outage event. Writes remain in RAM cache, to satisfy subsequent read requests and to write to stage to permanent storage during normal transaction processing (every 5 seconds by default).
Note: Do not use local SSD or ephemeral disks attached directly to an instance for the write log, as these instance local devices are not guaranteed to be available again after reboot. Instead, use volumes with Provisioned IOPS for the Write Log (it's okay to use local SSD devices for Read Cache).
Disk Controller Considerations
There are several ways to get the most performance from these cache devices by following a few disk controller best practices:
In this configuration, the disk controller is passed through to the SoftNAS® VM. This enables the SoftNAS® OS to directly interact with the disk controller. This provides the best possible performance, but requires CPUs and motherboards which support Intel VT-d and disk controllers supported by CentOS operating system.
Note: For servers with the disk controller built into the motherboard, it is now common to install a virtual platform and then boot from USB, freeing up the disk controller for pass-through use.
PCIe Flash Cache Cards
There are flash memory plug-in cards with extremely fast NAND memory available in PCIe form. These make extremely fast memory available at high speeds through the PCIe bus. Be sure to choose a PCIe flash memory card that is supported by the hardware's virtualization vendor.
Raw Device Mapping
Some SSD devices can be mapped directly to the SoftNAS® VM using Raw Device Mapping (RDM). Raw device access allows SCSI commands to flow directly between the SoftNAS CentOS operating system and the SSD device for peak cache performance and IOPS, and to reduce context-switching between the SoftNAS® VM running CentOS and the virtualization host.
Disk controller pass-through is preferred to RDM on systems with processors and configurations that support it.
Disk Speed and RAID
Virtual Devices and IOPS
As SoftNAS® is built atop of ZFS, IOPS (I/O per second) are mostly a factor of the number of virtual devices (vdevs) in a zpool. They are not a factor of the raw number of disks in the zpool. This is probably the single most important thing to realize and understand, and is commonly not. A vdev is a "virtual device". A Virtual Device is a single device/partition that act as a source for storage on which a pool can be created. For example, in VMware, each vdev can be a VMDK or raw disk device assigned to the SoftNAS® VM.
A multi-device or multi-partition vdev can be in one of the following shapes:
- Stripe (technically, each chunk of a stripe is its own vdev)
- A dynamic stripe of multiple mirror and/or RaidZ child vdevs
ZFS stripes writes across vdevs (not individual disks). A vdev is typically IOPS bound to the speed of the slowest disk within it. So if with one vdev of 100 disks, a zpool's raw IOPS potential is effectively only a single disk, not 100.
If the environment utilizes a hardware RAID which presents a unified datastore to VMware then the actual striping of writes occurs in the RAID controller card. Just be aware of where striping occurs and the implications on performance (especially for write throughput).
For information about RAID, see section RAID Considerations.
A common misunderstanding is that ZFS deduplication is free, which can enable space savings on a ZFS filesystems/zvols/zpools. In actuality, ZFS deduplication is performance on-the-fly as data is read and written. This can lead to a significant and sometimes unexpectedly high RAM requirement.
Every block of data in a deduplicated filesystem can end up having an entry in a database known as the DDT (DeDupe Table). DDT entries need RAM. It is not uncommon for DDTs to grow to sizes larger than available RAM on zpools that aren't even that large (couple of TBs). If the hits against the DDT aren't being serviced primarily from RAM (or fast SSD configured as L2ARC), performance quickly drops to abysmal levels. Because enabling/disabling deduplication within ZFS doesn't actually do anything to the data already committed on disk, it recommended to not enable deduplication without a full understanding of its RAM and caching requirements. It may be difficult to get rid of later, after many terabytes of deduplicated data are already written to disk and suddenly the network needs more RAM and/or cache. Plan cache and RAM needs around how much total deduplicated data is expected.
Note: A general rule of thumb is to provide at least 2 GB of DDT per TB of deduplicated data (actual results will vary based on how much duplication of data is required).
Please note that the DDT tables require RAM beyond whatever is needed for caching of data, so be sure to take this into account (RAM is very affordable these days, so get more than may be needed to be on the safe side).
Extremely Large Destroy Operations - When destroying large filesystems, snapshots and cloned filesystems (e.g., in excess of a terabyte), the data is not immediately deleted - it is scheduled for background deletion processing. The deletion process touches many metadata blocks, and in a heavily deduplicated pool, must also look up and update the DDT to ensure the block reference counts are properly maintained. This results in a significant amount of additional I/O, which can impact the total IOPS available for production workloads.
For best results, schedule large destroy operations for after business hours or on weekends so that deletion processing IOPS will not impact the IOPS available for normal business day operations.