Implementing Round-Robin on disk subsystems for AIX

In our current environment, we have LUNs allocated from the SAN to multiple AIX systems. In the case of the AIX blades, we use MPIO instead of Powerpath (for Multiple paths to the disks for redundancy). By default, the MPIO settings are NOT implemented in AIX, nor are disks when they are added so multiple changes need to be made. Below is a listing of the MPIO settings required to be set.

For the fibre adapter, the default settings (for the fscsisX child devices) are dyntrk=no fc_err_recov=delayed_fail. So to change this on all adapters, the paths have to be removed, then the child device can be changed. The safest way of changing this is to make the change to take effect on reboot. Otherwise some I/O errors can be experienced and the potential for data loss can occur. To implement this change on reboot, issue the code:

lsdev |grep fscsi | awk {'print "chdev -l " $1 " -a dyntrk=yes -a fc_err_recov=fast_fail -P" | sh

Alternatively, you can bring the paths down for an adapter at a time, then make the changes, followed by re-adding the paths. This has the added benefit of not requiring a reboot (however, later the round_robin settings do require a reboot).

To do this, use this one-liner:
lspath | sort | uniq | grep fscsi0 | awk {‘print “rmpath -dl ” $2 ” -p ” $3’} | sh

Make your change:
chdev -l fscsi0 -a dyntrk=yes -a fc_err_recov=fast_fail

Then re-scan the bus to bring the paths back in with cfgmgr.

Repeat for each of the adapters, ensuring that you have more than one active path (before removing paths).

Explanation of the options:
dyntrk = Dynamic Tracking. Detects any change in n_port id and in case of link failure and reroute the traffic to the new path.
fc_err_recov = The fast_fail attribute stops any further IO requests on the fc in event of a link failure thus avoiding any io loss in a multi path environment.

The next line of changes have to deal with the actual hdisks themselves. The two important MPIO settings to set here (for every hdisk device) are:
hcheck_interval and reserve_policy.

hcheck_interval = This setting determines how often the o/s will check to see if a path has recovered from failure. The value should be higher than zero (defaults to 0), typically 30 or 60. This specific setting should be done on the VIOS + any lpars.
reserve_policy=no_reserve –> This determines the “lock” on a device. Very important in the cause of dual VIOS’s. It is required to prevent a disk being reserved for a specific path, which would defeat failover/round robin. pr_shared is applicable for HBAs supporting SCSI3 persistent reserves standard, the blade environment does not currently support this. The default of single_path_reserve means only the SCSI initiator can access this device which will lock it and stops alternate initiators from accessing the LUN.

Now to implement round robin on an AIX system there is one caveat. By default all of the paths to your disks will have a path priority of 1. This means if a path fails, it will fall over to the next seen path. This is determined by the initial order they are found in. Now there seems to be a bug with the path priority and round-robin. If you DO NOT change the default path priority, when you reboot your server it will take upwards on 40 minutes to boot up. To get around this, you have to set the path priorities.

Hence, the first thing you do is change your hdisk devices. You can do that like so:

lspv | grep hdisk | grep -v hdiskpower | awk {'print "chdev -l " $1 " -a algorithm=round_robin -a hcheck_interval=60 -a reserve_policy=no_reserve -P"'} | sh

Next item is to update is the path priorities to the hdisk devices. In this instance there are 4 fibre ports (fcs0-3). fcs0 and fcs1 are 4GB adapters (preferred) and fcs2 & fcs3 are the 8GB adapters. So we’ll set the path priority to opt to fail-over to the 4GB adapters, followed by the 8GB adapters. These one-liners should take care of that.

lspath -F"status name path_id parent connection" | grep .*fscsi0 | awk {'print "echo "$2" / "$4" / "$5" && chpath -l "$2" -p "$4" -w \""$5"\" -a priority=3"'} | sh lspath -F"status name path_id parent connection" | grep .*fscsi1 | awk {'print "echo "$2" / "$4" / "$5" && chpath -l "$2" -p "$4" -w \""$5"\" -a priority=5"'} | sh lspath -F"status name path_id parent connection" | grep .*fscsi2 | awk {'print "echo "$2" / "$4" / "$5" && chpath -l "$2" -p "$4" -w \""$5"\" -a priority=10"'} | sh lspath -F"status name path_id parent connection" | grep .*fscsi3 | awk {'print "echo "$2" / "$4" / "$5" && chpath -l "$2" -p "$4" -w \""$5"\" -a priority=12"'} | sh

The path priorites are 1-255, with priority of 1 having the highest priority and 255 the lowest.

Now, reboot and the hdisk and mpio settings will be active. Once the system is back on-line you can run nmon to confirm data is flowing across all of the fibre adapters.

Category: AIX