After an upgrade or maintenance on one or more of the nodes in a VSAN cluster one of the hosts can stop contributing performance stats. This is not a production down issue, but should be addressed to see the most up-to-date stats across all the nodes.
The fix for this is one of three things, but each of them involves turning off performance statistics on the cluster which will cause all historical performance stats to be removed. My hope is that VMware will fix this issue in an upcoming release because a loss of historical is not tolerable in all environments.
1. View the health of the VSAN by logging into the vCenter web client. Navigate to the appropriate vCenter and cluster, then click the “Monitor” tab, followed by “Virtual SAN” then click on “Health.” Expanded “Performance service” and click the warning for “All hosts contributing stats”
2. At the bottom you will now see the list of hosts that are not contributing stats
3. Now that we’ve identified the problem host, we need to disable VSAN performance service temporarily. Navigate to the “Manage” tab for this cluster then click on “Health and Performance” under “Virtual SAN”
4. Click “Turn off” in the “Performance Service” box
a. Click “OK” to confirm stopping the service which will erase all existing performance data
5. Confirm Perform Service has been disabled by refreshing the page
6. SSH to the affected host (using putty or similar SSH client) we identified in step 2 (you may have to enable SSH on the host before you can connect).
7. Run the command below to restart the VSAN management agent. This should have no production impact so it is safe to perform outside of a maintenance window.
a. /etc/init.d/vsanmgmtd restart
8. Once the service has been restarted, go back to the vCenter web client and the click the “Edit” button for the Performance Service box
9. Select the appropriate storage policy from the drop down list, ensure the “Turn ON Virtual SAN performance service” box is checked and click “OK”
10. Confirm that the performance service is turned on and reporting healthy
If this does not fix the issue, you can restart the process, but this time instead of restarting the vsanmgmt service on the one node, do it on all of the nodes in the cluster. Once the services have been restarted across all nodes then restart the performance service and all nodes should be contributing stats.
I have also seen a case where restarting the service on all nodes didn’t fix the problem. In that scenario I was able to fix the problem by entering maintenance mode on the problem node and choose “full data migration” so all the data would be removed from the cluster. After that was complete I completely rebuilt the host from scratch (including wiping the disks claimed by VSAN) then moving it back into the cluster. I haven’t heard from VMware of any other ways to fix this issue.