ArsDigita Mail Transport Agent Monitor

By Branimir Dolicki, Part of Philip Greenspun's Arsdigita Server Architecture

Why ArsDigita MTA Monitor?

To learn why MTA Monitor is good for you read the section on E-mail server monitoring in Arsdigita Server Architecture.

What it does

The ArsDigita MTA Monitor is designed for monitoring a group of mail transport agents administered by one or more administrators. The basic operation of this software is as follows: Every five minutes it will try to connect to the SMTP port of every machine in turn and check whether it can talk to the MTA program. Every fifteen minutes it will also try to send a little email (emailet) to each server and wait for response. If the response doesn't come in one minute it will send a notification via email to the administrator(s). A little Perl script has to be installed on each monitored server that will respond to incoming emailets. If multiple incidents occur in short period of time, no more than one notification will be sent every fifteen minutes. The ArsDigita MTA Monitor will try to lump as many events (also, if they involve different servers) as possible into a single notification. The parameters such as frequency of SMTP probing, frequency of emailets sent as well as how long the monitor will wait for response before raising alarm can be changed in an initialization file. In addition to notifications, the MTA Monitor will also send one email per day summarizing performance.

Requirements

How does it work?

The ArsDigita MTA Monitor wakes up every five minutes (configurable in the configuration file mmon.ini) and monitors both SMTP port response and MTA throughput. In this case, throughput does not mean bandwidth or mail capacity, but rather that the remote system is accepting delivery as well as delivering mail expidiently. As it is usually not necessary to monitor MTA throughput as often as SMTP port response, the MTA Monitor allows you to specify that MTA throughput should be measured, for example, every third time. This is controlled by the run_period and run_group columns in the table mmon_servers (look at the the data model). The run_period column tells the MTA Monitor how often to measure MTA throughput. If you want to measure MTA throughput every fifteen minutes you would set run_period to 3.

You may want to test the throughput of different MTAs at different times. That's why there is the run_group column: consider you have three MTAs: A B and C. You want to test SMTP port response every five minutes and MTA throughput every fifteen minutes, but you want to distribute the throughput testing, so that if A is tested at 08:00, B is tested at 08:05 and C at 08:10. You could accomplish this by setting run_group of A to 0, of B to 1 and of C to 2.

Administrators of smaller sites with only a handful of MTAs don't need to bother with run_groups. In fact, there is an advantage to putting all the servers into a single group: if all the monitored servers seem to have problems with MTA throughput, the problem is probably not with them but with the local MTA - the one that is supposed to spawn the local Receiver script. Or maybe the whole network crashed immediately after sending all the emailets and before any of them returned back. On the other hand, on a big site it might be useful to group together all the servers that are on a single ethernet segment, or those which are using a single host as outgoing mail relay.

There are four basic components of the ArsDigita MTA Monitor:

Sender
The Sender wakes up every five minutes and tries to connect to every monitored server in turn. If it can't connect to a server or there is some other error it records that event in the database. Every fifteen minutes it sends a little email to every monitored mailserver. We call these little emails emailets to distinguish them from "real" emails that we sent to administrators notifying them about the state of all the MTAs. Every emailet has a unique ID which is contained in its Subject: line.
Bouncer
This component sits on the monitored server. Its task is very simple. Whenever the Bouncer receives an email from anywhere, it just replies to its sender, prepending the usual Re: to the Subject: line. I have implemented it as a Perl script.
Receiver
On the Monitor side, the Receiver's task is to report on any successfully bounced emailets. It consists of two parts:
receiver.pl
This Perl script is spawned by the monitor's own MTA in response to every bounced emailet. Its task is to extract the emailet ID from the Subject: Re: line of the bounced emailet and start receiver.tcl by grabbing the local URL http://<yourserver>:<yourport>/mmon/receiver.tcl?id=<the-id>
receiver.tcl
This is an AOLserver Tcl script which enters the information about successfully bounced emailets into the database.
Checker
After the Sender has done his job we wait 60 seconds to leave some time for the fired emailets to bounce back. For each previously sent emailet The Checker checks whether it has bounced back. If an emailet hasn't bounced, it records the event in the database.
Messenger
The Messenger looks in the event log and if it finds any yet unreported events, good or bad news, it adds an item to the report to be sent to administrators. Every reported event is marked in the database (reported_p='t'). The reports are sent to a single E-mail address that can be set in the configuration file. I expect this to be an administrators' mailing list address. The mailing list would be maintained independently of the MTA Monitor. In order to avoid sending too much email, the Messenger will simply exit before it does anything if it turns out that last notification has been sent less than fifteen minutes ago. In that case any unreported events will remain marked as unreported so that a next invocation of the Messenger can report on them. Note that this implies the following behavior: If there have been no events in the last 15 minutes, a new event (or multiple new events recorded in a single Sender/Checker run) will be reported instantly in a single notification. If more bad events are recorded in the next run, five minutes later, they will not be reported, nor will they be reported in the run, ten minutes later. They won't be reported until the third run, fifteen minutes from the first instant report.

In fact, the Checker, the Sender and the Messenger are one and the same script scheduled to wake up every five minutes. After the Sender finishes the script sleeps 60 seconds and then wakes up the Checker. The Messenger is the last part of the scheduled script. I describe them as seperate components simply because it's easier to think of them that way.

States of reachableness

The current state of reachableness is encapsulated in two columns of the table mmon_servers. Those columns are smtp_ok_p and last_unbounced_emailet_id. The boolean column smtp_ok_p can assume values t and f. It tells us whether the SMTP was working last time we checked. The column last_unbounced_emailet_id signals a problem with the MTA's throughput. If it holds some emailet_id, i.e. if it is not null, then we have a throughput problem. Otherwise, the MTA's throughput is OK. All four combinations of reachableness are possible.

What if an email doesn't come back?

Suppose our MTA monitor sends an emailet which doesn't come back after the required period of time. Should we try to send another one after a few minutes, or should we consider the MTA to be down until the first mail returns as it should?

I prefer the second approach. Delivering email to a blackhole is perhaps the worst thing that an MTA can do. We should take the problem seriously and not ignore the problem by saying "other emailets are bouncing so it's OK now". We should investigate. Furthermore, if we try to send an email every five minutes and all of those emails stay trapped, there will be the ugly consequence of receiving many stale emailets once the problem is fixed.

So, if an emailet doesn't return we suspend sending further emailets until an administrator explicitly turns the monitor on in the control panel and editing that servers parameters.

Installation

Debugging your setup

If something goes wrong I recommend you to open four windows and tail -f these four files simultaneously:

Useful readings

Copyright and Legal Status of this Software

This software is copyright 1999 ArsDigita, LLC and licensed under the GNU General Public License, version 2 (June 1991).
bdolicki@arsdigita.com