Troubleshooting a non-responsive LPAR RPC connection

Recently I learned that one of the LPARs on the L570 had stopped communicating with the HMC via it’s Resource Controlling Subsystem Daemon. What this meant is you could control the LPAR, open terminal windows, etc, however, you are unable to dynamically add/remove resources to the intended LPAR. This of course, led me down a path of troubleshooting. In the end, I was able to find two important URLs which helped to diagnose and fix the issue.

The HMC indicates there may be an issue with rscd, so ultimately that would be the first thing to check. So on the server I run:

lssrc -a | grep rsc ctrmc rsct 151850 active IBM.ServiceRM rsct_rm 435004 active IBM.ERRM rsct_rm 435342 active IBM.CSMAgentRM rsct_rm 438966 active IBM.AuditRM rsct_rm 432716 active IBM.HostRM rsct_rm 160048 active IBM.DRM rsct_rm 430636 active ctcas rsct inoperative

You may have more of the listed sub-sytems in an “inoperative state”. I ended up starting groups in my troubleshooting steps, and haven’t put it back to how it was yet. The important one of note here is ctrmc. That daemon should be running, potentially with IBM.HostRM and IBM.DRM. Re-starting the daemons didn’t fix my problem, so I had to move on to finding something else.

The command /usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc can be run on the LPAR and also from the IVM/HMC (as root). In my case it showed a piece of valuable information in the first field:
/usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc Management Domain Status: Management Control Points O A 0x4fd90ac021cb301e 0001 192.168.15.112

The first field indicated a problem:

O

Indicates that the RMC connection is “Down”, as determined by the RMC heartbeat mechanism. The partition is either not active or it may also indicate that the RMC daemon on the specified node is not connecting properly. Ensure that the partition can communicate with the HMC or IVM server on the public network and that port 657 is not being blocked on the public network firewall.

Now thanks to a couple of URLs I found, I was able to find some other useful commands, and ultimately a command that fixed the issue. So the first couple of commands that came in handy were:
* /usr/sbin/rsct/bin/ctsvhbal # (List the current identity or identities for the HMC and LPAR)
* /usr/sbin/rsct/bin/ctsthl -l # (This switch lists the trusted host list)

So the two commands above helped determine that everything was set properly. So everything looked right, but it just didn’t function. Thanks to this last command. It put the subsystems back in properly, and it started to communicate.
* /usr/sbin/rsct/install/bin/recfgct

The problem is now resolved, which is apparent by running /usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc. I noted that the first field now displays I.

I
Indicates that the partition is “Up” as determined by the RMC heartbeat mechanism (i.e. an active RMC connection exists).

So of course, the real test was done on the HMC. I clicked on the LPAR, chose Dynamic Logical Partitioning, Memory Resources, Add. Then told the system to add 1GB of RAM to the running system. It took a bit of time to allocate this to the running system. However, when it was done I ssh’d into the LPAR and confirmed 1GB additional RAM had been added.
As promised, the two URLs which came in handy to help diagnose and fix this issue:

* http://unix.ittoolbox.com/groups/technical-functional/ibm-aix-l/ibmhostrm-was-inoperative-and-now-missing-from-the-list-1514480
* http://www.aixmind.com/?p=489

Category: AIX