GeodSoft logo   GeodSoft

NT Server Down, Won't Be Fixed - 8/20/2001

Following the most recent failure of my Windows NT 4, IIS 4 web server, I've decided to discontinue maintaining my NT web server mirror. I recently (Oct. 2001) learned that the underlying cause of the problem was a bad memory chip. For some time I'd thought it was caused by the Microsoft security "rollup" patch. The server sat unused for several weeks, then I installed a much larger hard disk and tried to install Linux. I had to upgrade the BIOS but even after that Linux installs failed. Subsequently I installed OpenBSD. Within an hour or so of replacing my live OpenBSD web server, the new server was displaying "segmentation fault" and "memory fault" messages when I checked on its performance.

I then remembered I'd added a new memory chip at the same time the rollup patch was applied to the NT server. I confirmed that the errors were repeatable across reboots and not present when the new chip was removed or replaced. Linux also installed cleanly with the bad chip removed. Microsoft can't be responsible for the bad memory but it's clear that neither NT nor Linux protected the system from the bad memory as did OpenBSD. While OpenBSD displayed meaningful errors that led directly to a fix of the underlying problem, it also continued to serve web pages without interruption. It's most unlikely that if Apache had been executing at bad memory addresses it would have functioned normally, but it does seem likely that OpenBSD would have protected the system from the results, and provided useful messages through the system logging and the console.

Because NT almost totally trashed the disk system, restoring the NT server would require reinstalling a basic NT 4 system and restoring from backups. For those interested in more information about the crash, subsequent troubleshooting and the reasons I'll not restore this system, a detailed account follows.

Around 3 A.M., Sunday, August 19, 2001, one of my automated alarms went off, alerting me to the fact that the shared drive on my NT server was no longer available. I determined the web site, which is on that drive was also not responding. I went to the console to diagnose the problem and was confronted with a login dialog box. This was odd since as I'd not logged out after last using this machine. It wasn't the standard login dialog of a Primary Domain Controller (PDC) but rather the standalone server dialog box without the domain name field. For the past year or so, about 30% of the time the NT box reboots, it displays this inappropriate login dialog which won't let me log in. I have to press the reset button and normally following the reboot, the PDC login dialog is displayed letting me log in.

When I pressed the reset button, I was careless and pressed the reset button on the Linux server which is next to the NT server. The Linux server had been up for just over 11 months. As I looked at the front of the machine, I realized what I'd just done and screamed in rage. This was the longest any machine I've been responsible for had been up and I had expected it to reach a year, barring an extended power outage.

When I logged in on the NT machine, I started IE 5 to see if the web site was up. I was somewhat puzzled, because when the alarm triggered by the shared drive went off, the NT server was responding to automated pings but the web site was down. I don't remember the precise sequence of events but the site did not come up and a Dr. Watson dialog box appeared. CPU use went to 100% and stayed there. I tried to invoke task manager. The hourglass displayed briefly but task manager did not start. I wanted to kill the runaway process. I tried the start menu; the task bar would appear but when I clicked start, the program menu would not appear. I clicked on a few desktop icons including the command prompt icon but nothing would start. I could task switch between the few tasks that were running but otherwise could do nothing.

Then the system spontaneously rebooted. During the boot, chkdsk started running. While chkdsk was running, it displayed numerous messages about disk corrections it was making. During this, the machine spontaneously rebooted again. This time it came up with the error message "Windows NT could not start because the following file is missing or corrupt: \WINNT\system32\l_intl.nls. You can attempt to repair this file by starting Windows NT Setup using the original Setup floppy or CD-ROM. Select r' at the first screen to start repair." It was then after 4 am. I powered off the NT machine and shutdown or reconfigured the alarms that were going off because the NT server was not pingable.

The next morning when I powered up the server, I got the same "could not start" message. I reset the system and booted to the backup, minimal install system that I keep for system backup and recovery purposes. I started Zip Central and NT spontaneously rebooted. Subsequently, the backup system spontaneously rebooted twice, when the blue green background appeared. The fourth boot to the backup system completed but a "Directcd.exe - Entry Point Not Found" dialog box contained the message "The procedure entry point CopyAcceleratorTableW could not be located in the dynamic link library USER32.dll." Starting Zip Central generated a corresponding message.

As the backup system was clearly not useable, I decided to try to "repair" the main install in \WINNT. After going through the three install floppies, I was prompted for the install CD. I then got a series of messages telling me that specific files did not match the original install file and asking if I wanted to restore the file, skip it or restore all files. It was immediately obvious these were the system files, upgraded by various service packs since the original NT 4 CD. I skipped each. At first the order appeared to be alphabetic but soon the pattern ended. As files in various directories and not in any logical sequence appeared, I decided there was no meaningful order.

After approximately two hundred files, l_intl.nls was listed. I restored this and then ended the install / repair procedure ignoring messages that the install was not complete. I successfully rebooted and was able to use the control panel to change the system's IP address (so I could cover the NT server's normal IP address, with a working web server). At first, the system looked OK but when I started Zip Central, the system spontaneously rebooted again. After completing the reboot, I logged in again and again tried Zip Central. It started but soon displayed a "Zip Central - Untitled" dialog box stating "Access violation at address 00403DF8 in module 'ZIPCENTRAL.EXE' . Read of Address." I was then interrupted and unable to return to the machine until the following morning.

The machine had rebooted and was displaying the Press Ctrl + Alt + Delete to login dialog. I tried and the machine hung and eventually I pressed reset. The next reboot completed, and it was clear from the Event Viewer, which turned out to be one of the few programs that actually worked, that the machine had spontaneously rebooted about three times since I left it. Notepad and several other programs came up with errors similar to the one Zip Central had displayed. Solitare froze as soon as I tried to move a card; I had to terminate sol.exe via the non responsive task dialog. Easy CD Creator and one other program caused immediate spontaneous reboots when selected. The web and FTP servers as well as NetBIOS services were not functioning.

By this point is was entirely clear that both systems were thoroughly corrupted, probably as a result of the disk problems indicated by the numerous correction attempts made by chkdsk. To restore the systems, I'd need to do a fresh install of the backup NT system and from that, restore the full system from recent backups.

A few hours after I had installed the patch, I received a SANS "Security Alert Consensus" message stating Finally, a number of you wrote in about the Microsoft post-SP6a security "rollup" patch we discussed in the last issue of SAC. It appears that the "rollup" crashed a ton of systems and created a fair amount of general chaos.' They go on to say "whenever possible, test patches should be tried on nonproduction machines." Most organizations don't have essentially identical test servers to their production servers and testing patches on dissimilar systems is of little value.

The next paragraph is obviously obsolete as a result of learning that bad memory was the unerlying cause. NT did nothing to help me identify that problem and I'm leaving the following paragraph as it was written, based on knowledge just after the crash. The bad memory was not in the machine for any other problems referred to.

Despite the time this has already taken, I don't know if, a) Microsoft's latest security patch resulted in total system failure, causing damage as severe as a skilled intruder could inflict, since by comparison an erased disk would be simple to fix or b) a previously, almost stable system (other problems over the past year and a half are documented elsewhere) spontaneously self destructed. If a), there is no way to know without perhaps days of experimenting and testing, whether to restore to pre- patch state and avoid the patch or to restore to the post patch state. If b), then no course I take today can assure me this won't recur tomorrow or perhaps four months from now.

As I write this, I am on the fourth draft of a very long (over 120 page) comparison of Linux, OpenBSD and Windows server systems. In large part because of problems like these, plus Microsoft's increasing licensing costs and extraordinarily poor security record, I'd already decided not to upgrade my Windows NT server to Windows 2000 or any successor. I accept the consensus opinion that 2000 is better than NT but nothing I've seen or read suggests it comes close to remedying the fundamental architectural defects that make Windows so clearly inferior as a server platform for my specific needs.

Since I do not intend to continue with Windows servers in my business, I can see no reason to spend more time on what is now an almost obsolete Microsoft product, rehashing the same kinds of problems that I've extensively documented on this page and elsewhere. I'll migrate the backup / CD-R functions to another machine, most likely my NT workstation for the near term. When time permits, I expect to mirror my web site to a third OS which will be some as yet to be determined distribution of Linux or perhaps FreeBSD.

transparent spacer

Top of Page - Site Map

Copyright © 2000 - 2014 by George Shaffer. This material may be distributed only subject to the terms and conditions set forth in http://GeodSoft.com/terms.htm (or http://GeodSoft.com/cgi-bin/terms.pl). These terms are subject to change. Distribution is subject to the current terms, or at the choice of the distributor, those in an earlier, digitally signed electronic copy of http://GeodSoft.com/terms.htm (or cgi-bin/terms.pl) from the time of the distribution. Distribution of substantively modified versions of GeodSoft content is prohibited without the explicit written permission of George Shaffer. Distribution of the work or derivatives of the work, in whole or in part, for commercial purposes is prohibited unless prior written permission is obtained from George Shaffer. Distribution in accordance with these terms, for unrestricted and uncompensated public access, non profit, or internal company use is allowed.

 
Home >
About >
Building GeodSoft.com >
ntdown.htm


What's New
How-To
Opinion
Book
                                       
Email address

Copyright © 2000-2014, George Shaffer. Terms and Conditions of Use.