Problem

You can use this guide as a reference to help determine where performance bottlenecks may be stemming from.

Finding and diagnosing performance bottlenecks with SoftNAS involves examination of a few components.

Network, Load, and IO Wait

CPU load is an indicator of how many processes are waiting in the CPU execution queue. As load increases, time to execution increases, therefore making things 'slow'.

As a general rule, I start looking at CPU load and how that compares to IO wait and the number of processes running.

IO wait is a condition where the CPU is waiting on acknowledgement of an IO operation. In SoftNAS context, this will be a disk operation (read or write to object or blob) or a network operation like get, put, send, or recv.

In public cloud, everything is the network. In public cloud, a disk attached to a SoftNAS is provided over the network to a SAN in the provider data center, so even the throughput to disk is affected by network on certain instance types. This network connection has to serve clients, serve the object puts and gets, and serve the reads and writes to the block devices.

Network IO will always be limited to the slowest device in the process chain. If the client accessing the shared storage on SoftNAS is an instance type that only has moderate network performance or burst-able network performance, then you should expect moderate or burst-able throughput from that instance when accessing the SoftNAS shares.

You can verify network throughput by looking at 'SUM' values over time in the cloud provider portal (Cloudwatch for AWS, Metrics for Azure) and verify if the throughput is what you expect for the instance type.

We can assume to get somewhere around 50Mb/s on instances that are rated 'low' network performance.

We can assume to get somewhere between 150Mb - 300Mb/s burst on instances with 'moderate' network performance depending on the class type. (c4, m4, r4, etc)

We can assume to get somewhere around 1Gb/s on instances with 'high' network performance.

Instances rated at *up to 10Gb are burst-able up to 10Gb at some times, not dedicated 10Gb throughput. Expect a baseline performance around 1200Mb/s

Instances rated at 10Gb network are SLA'd at that level by the cloud provider.

Refer to the below example using an M4.Large which has 'moderate' network throughput:

  Network in + out = ~5,000,000,000 Bytes (5 GByte) over 5 Mins  (1GByte Min avg roughly)

At the same time on the SoftNAS, in the SAR logs, I see a high number in TCP transmit-wait (TCP-TW is the last column)

Conclusion regarding network bottleneck:

Disk, IO Wait, and Storage Type.

In the same way that congested networks can cause IO Wait, exhausted disk devices will cause the same.

Each cloud platform provider provides different storage options of different characteristics.

In general, the slower the medium (object, blob, HDD) the less expensive it is.

Your storage type will determine how fast data can be written and accessed by SoftNAS and is ultimately your read/write performance.

Block Storage Devices

Block storage refers to SSD's HDD's and most managed disk types.

Block storage is typically allocated at an SLA'd throughput for the premium types (i.e. 250 MB) or guaranteed at a number of IOPS (3 IOPS per GB).

To diagnose issues with block device throughput we can use 'iostat' along with metrics provided by the cloud platform.

The telemetry data that we will look for is %iowait and %util for any block devices.

'iostat -x 1' will show you block device statistics for all devices, updated every 1 second

If you refer to the below output you can see that there is a high %iowait, %w_await, and %util for my device labeled 'nvme1n1'.


I know that the disk is 50GB so I should get around 150 IOPS (3 IOP's per GB)

The below graph from the cloud provider shows that I pushed ~900 IOP/ sec over 5 minutes avg.

Conclusion regarding block device throughput:

Divide that 900 IOPS avg by the 5 mins and you see that we got ~180 IOP/sec which is about what our disk should max out at.

Based on the %util from the iostat output and some confirmation from the cloud provider metrics I can confirm that my bottleneck is the disk that my pool is deployed on. I am exceeding the allowed IOPS for that block device.

Object Storage Devices

Object storage devices refer to S3, Azure Blob, and OpenStack Swift containers being used as disk devices.

Reading and writing is handled via HTTP GET request (for reads) and HTTP PUT request (for writes).

The public cloud platforms each limit the throughput of a blob object to around 500 requests or just under 60MB/sec depending on which platform.

The best way to diagnose any performance issues related object storage devices is to refer to the SoftNAS logs.

Look for 'Delaying query', 'EC', and 'Error' messages from the s3backer process.


Conclusion regarding object device throughput:

The above means that we are overrunning the allowed throughput to the object backend and the platform is telling us to back off with the requests.

In this case you either need to back the load of from your clients if you want to continue to use object disks or move the SoftNAS pools to faster block storage devices with more consistent throughput characteristics.

There is nothing we can do as these throttles are imposed by the platform.


RAM and SoftNAS Performance

Usable system memory is another key performance factor in how SoftNAS operates.

The SoftNAS kernel is tuned for file services so it does not behave as you would expect a standard virtual machine or file server to behave.

The appliance will always consume whatever physical RAM it has available for file system caching and ZFS metadata operations. Thats a good thing because serving from RAM and L2ARC is faster than serving files from disk.

The general behavior will be that it will consume most of the RAM it can for file operations, however the dynamic back pressure provided by ZFS should never allow it to SWAP.

How much RAM it can give back depends on the number of open file handles, exported mount points, number of connected clients, and the file sharing protocol (NFS or CIFS).

The only way to reclaim any of the RAM that ZFS is currently allocating would be to export and and then re-import the zpools (zpool export pool_name, zpool import pool_name).

Exporting and importing a pool would obviously cause an outage for that file system so its not a recommended practice. You will likely hit that same memory threshold again as files are accessed and cached.

If the instance that you are running SoftNAS on is swapping, then that is an indication that you need to scale up RAM on the instance type to handle the current number of file operations.

You can use the command 'free -m' to quickly see RAM allocation and check to see if the system is swapping: (below)

free -m

                   total          used        free     shared    buffers     cached

Mem:         32149      28322       3826          0         55        348

-/+ buffers/cache:      27918       4230

Swap:         4095        1237           2858

Conclusions regarding RAM resources:

From the above example I can see that I am swapping and need to scale this system beyond 32GB of physical RAM if I want to operate at this current load.

The more RAM you add to a system, the more you can cache from RAM if other dependent resources allow.

Protocols like SAMBA/CIFS consume more RAM due to having to store extended file attributes (x-attr's) like when integrated into Active Directory.

Enabling deduplication on SoftNAS is going to consume more RAM as it has to maintain dedupe-tables.

If you use object-backed pools (Azure Blob, S3, or Swift) you will consume more RAM as the blocks are held in cache while they are being chunked and uploaded or download.


CPU and SoftNAS Performance

CPU load is a measure of how many instructions we have sitting in the system bus queue waiting for the processor to execute.

A true CPU load issue happens when there is not enough total CPU time available to handle the number of concurrent request coming into the system bus.

CPU's work in a FIFO buffer (First In First Out), so lets also keep that in mind.

Typically two things will be affecting your CPU load on SoftNAS:

Exhausted CPU:

Check processes in 'top': compare number of tasks, to %wait, %user, and %system. 

Refer to the below example showing the output of 'top' regarding number of 'Tasks'.

In this example I have 1001 simultaneous write threads going to 1001 different directories at one time and my SoftNAS only has 2 cores.

The load average is above 517 and that is because there are a total of 2279 tasks running at once.

Its difficult for 2 CPU's to handle that many long-running processes so its backed up 517 deep in the queue.

In that example my IO wait is low (1.2%wa) and all the CPU time is going to system (58.8%sy) and user (8.1%us).

This was a classic example of just needing more CPU to handle the number of requests.

Waiting CPU:

CPU waiting on IO request to/from network or to/from disk will also cause a CPU load condition.

Refer to sections one and two of this doc (Network and Disk) regarding how to diagnose which component is causing the condition.


In the above example you can notice that all my CPU time is going to %wait (44.6%wa) and %idle (46.1%id) and my performance is slow.

There are only 278 total tasks running so the load average of 1.05 still seems high for 2 CPU's.

In this example I need to go look at disk throughput and network throughput metrics to understand this CPU wait condition.

Conclusion regarding CPU bottlenecks:

The two frequent conditions regarding CPU bottlenecks are going to be related to number of processes or CPU waiting on an underlying IO subsystem.

The 'number of processes' problem will grow as customers add 'number of volumes' to the replication cycle. When you start reaching a high number of volumes on a deployment, the CPU will become consumed with long-running replication tasks, about 5 tasks per volume.

Enabling the ZFS compression feature will consume more CPU time.

Enabling the volume encryption feature will consume more CPU time.

Enabling in-flight encryption for replication will consume more CPU time (and also affect network performance). 

Client Type, Protocol, and SoftNAS Performance

Different network clients and operating systems will behave differently when accessing the same shared volume on SoftNAS. Sometimes mount options on the client side can help. Sometimes mount options cannot help and it's just a characteristic of that client type. Also choosing the right protocol for the right client becomes important. You would probably not want to use NFS for Windows clients and you probably would not want to use SMB/CIFS for Linux clients. Use jumbo frames whenever you can.

NFS

CIFS/SMB (and Active Directory)

If we are exporting shares as CIFS/SMB then we can assume its likely for Windows clients. We should also take into consideration whether or not these shares are integrated into Active Directory. Getting extended attributes and checking permissions against an Active Directory server will add to the transaction time.


Throttling

The public cloud pricing model is based on throttles. As you pay more money, you get devices and compute resources with higher limits and less throttles.

As discussed in previous topics, your will hit a maximum throughput number set by the cloud provider for that device or compute instance.

These are the platform throttles that you have to keep in mind and can be verified by referring to the cloud provider SLA:

There are also artificial throttles that can be imposed at the software level on SoftNAS. (see below)

Some throttles may be unintentional. (see below)

Conclusion regarding throttling:

When it comes to throttling it's important to be aware of any limits imposed by the cloud provider, which are beyond our control, and to be mindful of any software throttles that we may have imposed intentionally or unintentionally.

Housekeeping

There are a few housekeeping issues that can also affect SoftNAS performance. The three that you commonly come across are issues related to disk space, logging, and volume fragmentation. 

Available disk space on the root drive:

Currently the root volume of SoftNAS only has a 30GB file system. If regular housekeeping and on the root volume is not kept in check, then there is potential for it to become full. When the root volume becomes 100% utilized processes fail and the UI stops responding as it can no longer log authentication attempts. The four main places you may have to truncate at times are:

To tell how much space the subdirectories in your root volume are consuming you can run the below command at the CLI:

  du -xh / 2>/dev/null | sort -h -r | head -n 15


Logging:

Logging can also affect performance of the appliance.


Available space in the ZFS pool:

The amount of free space affects how ZFS writes data to the pool devices. 


Fragmentation in the ZFS pool:

The output of 'zpool list' shows a value of %FRAG.  The FRAG value is an abstract measure of how fragmented the free space in the pool is. The FRAG percentage tells you nothing about how fragmented (or not fragmented) your data is, and thus how many seeks it will take to read it back. (see below). Obviously more fragments in free space means you will be doing more random reads and writes moving forward which will have a big impact on how quickly the file system can perform.



The above example shows that all the remaining space in my pool is located in contiguous sectors with no fragmentation. It should be this way because this is a newly created pool and I have not deleted any data from it or reclaimed any sectors to be re-written by ZFS. 

Conclusion regarding housekeeping:

Housekeeping issues are not always the most obvious when troubleshooting performance related problems, but as you can tell from above, they can have a huge impact on the performance of the appliance. We need to be mindful of available disk space in the storage pools and the root volume as well as look out for fragmentation on the devices. We also need to be mindful of how we handle logging and log rotation as all of those can impact performance.