TROUBLESHOOTING EMC IONIX (SMARTS) – PART 2

Last part of this series “Troubleshooting EMC Ionix (SMARTS)”, we discussed some of the key tools in enabling an Enterprise Management administrator to understand if their EMC ITOI (IT Operations Intelligence) infrastructure is healthy. In this part, we will talk about dmdebug and how it can help you efficiently understand statistics about your in-memory domain manager.

A traditional Ionix ITOI infrastructure looks like this (your mileage will vary):

Logical ITOI Domains

 

The red arrows represent what is termed “subscriptions”; that is, one domain manager that depends on metadata about the subscribed domain for a certain purpose or function.  For example, SAM subscribes to all of the underlying domain managers from which it depends for meta-topology and notifications (events).  It does so via a TCP-based subscription that is either A) timed (usually every 5 minutes, configurable in the ICS configuration) or B) triggered by an event.

We will discuss how dmdebug can help you identify bottlenecks in the subscription process.

The command

DMDEBUG is part of the dmctl command line utility.  It is executed in this fashion:

<SM_HOME>/smarts/bin/dmctl -s <DOMAIN NAME> exec dmdebug <option1> <option2> etc

You can issue a –help and the syntax for using the options are printed in the logfile of the <DOMAIN NAME> domain.  For the purpose of this example, we’ll say we are executing the command against domain manager ITOI-SAM.  This example domain is a Service Assurance Manager domain.

Domain Queues

A useful option is to dump queues.  Queue output will help you understand if your domain manager, such as SAM, is experiencing high I/O in relation to subscription to the subscribed underlying domains.  For example, if your underlying Availability Manager domain has recently performed a discovery and topology reconfigure has occurred, and SAM is re-sync’ing the topology base.

To execute this command:

SM_HOME>/smarts/bin/dmctl -s ITOI-SAM exec dmdebug –queues

This would generate data such as the one below in the ITOI-SAM.log file (amongst other similar entries):

 

 

Note the item of interest here is “Current size”.  It seems fairly large (this sample is actually from a client, who had thousands of idle active events in memory within SAM), but the important matter here is not the size, but how quickly the domain manager is able to process the queues.

In this example, the domain manager was in fairly “good” health internally as it was able to process it within a reasonable amount of time (see chart below).  What was of impact to this customer were clients of events from this particular SAM domain manager – Global Consoles e.g. the Java GUI that operations leverages to see the events and root causes identified by ITOI.

Even though the SAM domain manager was able to process data quickly, the Java GUI was not so lucky, therefore many operators for this client were experiencing delays and latency in the events received.

(NOTE:  The example above shows zeros for the history (Size, Flow, Late “rows”) as this test was a warm “restart” of the SAM domain managers. Being that Domain Managers are memory-based, such statistics are reset during each dm restart).

Note that the timeline is in minutes (baseline through 5 minutes).  As you can see, even though the subscription front end driver queued over 6k items, it processed them within 2 minutes back to baseline levels.

EMC defines the parameters from this dmdebug output as such:

The first line gives you the name of the queue and the number of workers. Note the Subscriber Front Ends like this one never have any workers since they are not used as normal server queues.

The second line gives you the current size, and an *exact* maximum size.

Finally, there is a new value, the total number of entries “processed”. To be exact, this is the total number of entries that have ever been pulled off the queue.

The next three rows provide averages. Each column reports values averaged over a different time interval: The last 3 minutes, 30 minutes, 300 minutes (5 hours), and 3000 minutes (50 hours, or about 2 days).

Each row reports on a different value:

  • Size is the queue size (i.e., it’s an average of what “Current size” reports for this moment)
  • Flow is the number of queue entries processed (i.e., it’s an average of what “Current processed entries” reports).
  • Late is the lateness: The delay between when an entry in a timer queue was scheduled to run and when it actually ran. (The Late row is always all 0’s for a server queue. It may be changed in future releases to report how long elements are staying in a server queue.)

So there you go.  This should help you get started doing some “under the hood” analysis of your ITOI domains.

In the next part, we will cover how to track problematic devices in your IP Availability Manager domain that are causing your discovery times to take a long time.

Thanks,
Vinnie

Written by

April 29, 2011
Comments 0

Speak Your Mind

*


*