A combination of several factors came together to cause a significant and
rapid web server performance decrease that was not planned for.
Everyone in the web field has read about how quickly servers and
other infrastructure resources that had been adequate can become inadequate.
Usually these stories are about dot coms that are experiencing rapid
growth but associations with an online presence need to watch their
server capacity carefully too. ATLA or more specifically I got caught
by surprise related to ATLA NET's server performance.
In the fall of 1998 my boss asked me if we should budget for a new
web server in the coming fiscal year which would begin the following August.
My response was that I didn't think it would be necessary. While traffic
and load on the server was certainly increasing I really didn't think
we would need a new server until the following year (August 2000). I
expected that traffic on our leased line which had recently been
upgraded to a full T1 would be the limiting factor sooner.
I couldn't have been more wrong. I could remember watching performance
monitor when you could literally see individual page hits register
as little CPU spikes off an almost idle state. During much of the
workday there were now periods of irregular 5 - 20% loads but still
some near idle patches. When Lyris was sending a large email CPU would
bounce around in the 50% to near 100% for a few minutes. Occasionally
it would hit 100% but it never stayed there. Web page response was
always fast even when the machine looked busy. Growth had been
gradual and fairly steady for about a year since we went live on our
in house system. We'd seen no instances where some special event or
circumstance drove traffic levels to several times their normal for
any sustained period.
I took a four week vacation in December and when I returned in January
the server was clearly facing performance issues. While there were still
idle off hour periods there were growing peak periods, mid day to late
after noon, where the server could become noticeably sluggish. Only very
occasionally were static pages noticeably affected but all CGI scripts
were taking a noticeable performance hit. Lyris scripts were the worst
with the administrative interface becoming almost unusable in peak
hours. Member directory searches which had nearly always been sub second
from the LAN, regardless of the search, often took 15 seconds and sometimes
a minute. There were even rare CGI time outs.
As I worked with the server and though there were a few things that
I could do to ameliorate the performance situation it was
increasingly clear we needed a new web server. In a period between two
and three months, I'd gone from saying we could wait 20 months to
upgrade to saying we needed to now. There were multiple factors
that I'd failed to take into account.
First, some growth is geometric and what looks like modest growth can
build a lot quicker than you'd intuitively expect. More important is
that a number of performance related factors have thresholds. Stay
under certain limits and things look pretty good but exceed those
limits and performance can degrade dramatically. There is also the
matter of how certain processes interact, in this case the web and
list servers specifically.
Also NT does not appear to degrade as gracefully under full loads
compared to UNIX and other true multi user systems. NT is true
multi tasking and multi threaded but like Novell is not a true
multi user system. Multiple users access server processes via
client software. Without adding third party products, multiple users do
not and cannot log in and work entirely within NT's address space.
For all the metrics that NT's Performance Monitor includes, none of
them seem to directly correlate to load average numbers so it's hard
to make a direct comparison with UNIX like systems. Since a load
average of 1 means an average of one process is actively waiting for
CPU time, I'd think that means that about 50% of the time an NT like
CPU meter would be at 100%. By the time the load average is 2 - 3 I'd
think a CPU meter would be almost continuously at 100%. This is
a typical load for a moderately busy UNIX host. A load average of
5 will surely affect user response time but such a system is still
likely to be quite usable. UNIX is expected to run for extended
periods at "100%". 100% on an NT CPU meter is indicative of performance
problems that need to be addressed.
Some of these issues have already been discussed with a different
slant in List Server Issues:
Don't Install a List Server on a Web Server. In addition to
doing this we were keeping unlimited searchable, i.e. indexed list
archives. Lyris is not a full text search tool and its not very efficient
at indexing large and continuously growing message bases.
With our settings Lyris was steadily getting more users who were sending
more messages, each of which needed to be sent to a larger recipient pool
and these were being stored in continuously growing databases with
constantly updated indexes. One of the first things I did to relieve
the performance load was to cut the message archives to a 120 day period.
This immediately reduced the message base sizes and significantly
limited how fast they could grow. (With a constant retention period,
rather than a message number limit, the databases will grow is size
as long as the number of messages per day grows.) Lyris had quickly passed
from the point where message sends were normally discrete events with
significant gaps between to one where uninterrupted message
transmissions continue for extended times. Though static page delivery
was only marginally affected, all CGI processes were significantly impacted
and these represented too many key system functions. Something had
to be done and the sooner the better.
Aside from changing the archive period there were not a lot of options
that would get much of a performance gain. As time passed the system
load only increased. We considered filling the second, still empty
CPU slot but the Pentium Pro 200 was now obsolete and not easily available
and was overpriced when it could be obtained. To be sure they matched, two
CPUs were needed, and the box hadn't been open in about 2 1/2 years. Changing
the insides of a machine that hadn't been opened in that long increases
the potential for system failure. The single disk drive also occasionally
made disconcerting noises. The only real option appeared to be a
We ordered a dual Pentium II 400 to replace the single Pentium Pro 200.
A variety of problems including a defective server that had to be replaced
after it had been fully configured and was almost ready to switch as
well as dealing with a complex tape changer caused the change to take
much longer than anticipated. I would not make the change until the
backup was fully functional. We migrated the existing configuration
including the web server / Lyris mix to the new server. This allowed
about a 5 minute change over as we rebooted both the old and new servers
with switched IP addresses. We would have liked to have separated the
web and list servers but there were simply too many unknowns to make
a major configuration change that might also require DNS changes at the same
timeas switching servers. The server that ATLA switched to in the Spring
of 1999 was in use a year later but already it was not unusual for CGI response
to be somewhat sluggish.
Top of Page -
Copyright © 2000 - 2014 by George Shaffer. This material may be
distributed only subject to the terms and conditions set forth in
These terms are subject to change. Distribution is subject to
the current terms, or at the choice of the distributor, those
in an earlier, digitally signed electronic copy of
http://GeodSoft.com/terms.htm (or cgi-bin/terms.pl) from the
time of the distribution. Distribution of substantively modified
versions of GeodSoft content is prohibited without the explicit written
permission of George Shaffer. Distribution of the work or derivatives
of the work, in whole or in part, for commercial purposes is prohibited
unless prior written permission is obtained from George Shaffer.
Distribution in accordance with these terms, for unrestricted and
uncompensated public access, non profit, or internal company use is