GeodSoft logo   GeodSoft

Performance Surprise

A combination of several factors came together to cause a significant and rapid web server performance decrease that was not planned for.

Everyone in the web field has read about how quickly servers and other infrastructure resources that had been adequate can become inadequate. Usually these stories are about dot coms that are experiencing rapid growth but associations with an online presence need to watch their server capacity carefully too. ATLA or more specifically I got caught by surprise related to ATLA NET's server performance.

In the fall of 1998 my boss asked me if we should budget for a new web server in the coming fiscal year which would begin the following August. My response was that I didn't think it would be necessary. While traffic and load on the server was certainly increasing I really didn't think we would need a new server until the following year (August 2000). I expected that traffic on our leased line which had recently been upgraded to a full T1 would be the limiting factor sooner.

I couldn't have been more wrong. I could remember watching performance monitor when you could literally see individual page hits register as little CPU spikes off an almost idle state. During much of the workday there were now periods of irregular 5 - 20% loads but still some near idle patches. When Lyris was sending a large email CPU would bounce around in the 50% to near 100% for a few minutes. Occasionally it would hit 100% but it never stayed there. Web page response was always fast even when the machine looked busy. Growth had been gradual and fairly steady for about a year since we went live on our in house system. We'd seen no instances where some special event or circumstance drove traffic levels to several times their normal for any sustained period.

I took a four week vacation in December and when I returned in January the server was clearly facing performance issues. While there were still idle off hour periods there were growing peak periods, mid day to late after noon, where the server could become noticeably sluggish. Only very occasionally were static pages noticeably affected but all CGI scripts were taking a noticeable performance hit. Lyris scripts were the worst with the administrative interface becoming almost unusable in peak hours. Member directory searches which had nearly always been sub second from the LAN, regardless of the search, often took 15 seconds and sometimes a minute. There were even rare CGI time outs.

As I worked with the server and though there were a few things that I could do to ameliorate the performance situation it was increasingly clear we needed a new web server. In a period between two and three months, I'd gone from saying we could wait 20 months to upgrade to saying we needed to now. There were multiple factors that I'd failed to take into account.

First, some growth is geometric and what looks like modest growth can build a lot quicker than you'd intuitively expect. More important is that a number of performance related factors have thresholds. Stay under certain limits and things look pretty good but exceed those limits and performance can degrade dramatically. There is also the matter of how certain processes interact, in this case the web and list servers specifically.

Also NT does not appear to degrade as gracefully under full loads compared to UNIX and other true multi user systems. NT is true multi tasking and multi threaded but like Novell is not a true multi user system. Multiple users access server processes via client software. Without adding third party products, multiple users do not and cannot log in and work entirely within NT's address space.

For all the metrics that NT's Performance Monitor includes, none of them seem to directly correlate to load average numbers so it's hard to make a direct comparison with UNIX like systems. Since a load average of 1 means an average of one process is actively waiting for CPU time, I'd think that means that about 50% of the time an NT like CPU meter would be at 100%. By the time the load average is 2 - 3 I'd think a CPU meter would be almost continuously at 100%. This is a typical load for a moderately busy UNIX host. A load average of 5 will surely affect user response time but such a system is still likely to be quite usable. UNIX is expected to run for extended periods at "100%". 100% on an NT CPU meter is indicative of performance problems that need to be addressed.

Some of these issues have already been discussed with a different slant in List Server Issues: Don't Install a List Server on a Web Server. In addition to doing this we were keeping unlimited searchable, i.e. indexed list archives. Lyris is not a full text search tool and its not very efficient at indexing large and continuously growing message bases.

With our settings Lyris was steadily getting more users who were sending more messages, each of which needed to be sent to a larger recipient pool and these were being stored in continuously growing databases with constantly updated indexes. One of the first things I did to relieve the performance load was to cut the message archives to a 120 day period. This immediately reduced the message base sizes and significantly limited how fast they could grow. (With a constant retention period, rather than a message number limit, the databases will grow is size as long as the number of messages per day grows.) Lyris had quickly passed from the point where message sends were normally discrete events with significant gaps between to one where uninterrupted message transmissions continue for extended times. Though static page delivery was only marginally affected, all CGI processes were significantly impacted and these represented too many key system functions. Something had to be done and the sooner the better.

Aside from changing the archive period there were not a lot of options that would get much of a performance gain. As time passed the system load only increased. We considered filling the second, still empty CPU slot but the Pentium Pro 200 was now obsolete and not easily available and was overpriced when it could be obtained. To be sure they matched, two CPUs were needed, and the box hadn't been open in about 2 1/2 years. Changing the insides of a machine that hadn't been opened in that long increases the potential for system failure. The single disk drive also occasionally made disconcerting noises. The only real option appeared to be a hardware upgrade.

We ordered a dual Pentium II 400 to replace the single Pentium Pro 200. A variety of problems including a defective server that had to be replaced after it had been fully configured and was almost ready to switch as well as dealing with a complex tape changer caused the change to take much longer than anticipated. I would not make the change until the backup was fully functional. We migrated the existing configuration including the web server / Lyris mix to the new server. This allowed about a 5 minute change over as we rebooted both the old and new servers with switched IP addresses. We would have liked to have separated the web and list servers but there were simply too many unknowns to make a major configuration change that might also require DNS changes at the same timeas switching servers. The server that ATLA switched to in the Spring of 1999 was in use a year later but already it was not unusual for CGI response to be somewhat sluggish.

`

transparent spacer

Top of Page - Site Map

Copyright © 2000 - 2014 by George Shaffer. This material may be distributed only subject to the terms and conditions set forth in http://GeodSoft.com/terms.htm (or http://GeodSoft.com/cgi-bin/terms.pl). These terms are subject to change. Distribution is subject to the current terms, or at the choice of the distributor, those in an earlier, digitally signed electronic copy of http://GeodSoft.com/terms.htm (or cgi-bin/terms.pl) from the time of the distribution. Distribution of substantively modified versions of GeodSoft content is prohibited without the explicit written permission of George Shaffer. Distribution of the work or derivatives of the work, in whole or in part, for commercial purposes is prohibited unless prior written permission is obtained from George Shaffer. Distribution in accordance with these terms, for unrestricted and uncompensated public access, non profit, or internal company use is allowed.

 
Home >
About >
Large Project >
performance.htm


What's New
How-To
Opinion
Book
                                       
Email address

Copyright © 2000-2014, George Shaffer. Terms and Conditions of Use.