GeodSoft logo   GeodSoft

New Windows Anomalies and Some UNIX Comparisons - 4/9/01

Here I use "new" to mean anomalies not previously seen by or commented on by me. From what I know of Windows, it would surprise me greatly if others had never encountered these or similar problems.

There was a power outage here yesterday that lasted over 40 minutes. All my systems are on UPSs but generally not with enough reserve to last that long. I happen to use APC UPSs which come with the hardware and software necessary to connect Windows PCs to the UPSs. This allows an automated "managed" shutdown to be initiated after a set period of time rather than simply letting the machine crash when the battery runs out. For reasons of system integrity, a managed shutdown is preferred, especially on systems that may be engaged in active disk activity such as writing log files. Due to the fragility of the Windows registry, it's more important that Windows systems be shut down properly rather than simply be allowed to crash.

Not only the computers but all routers, switches and other equipment necessary to maintain functioning web sites are on UPSs. My sites can and do remain up and available through short local power outages. This means that the web servers and firewall are likely to be active and thus writing logs at the time of a power failure.

My NT server and workstation shutdown in response to the UPS software. The firewall, which will nearly always be active if any of the web servers are, failed when the UPS battery ran down. The UPS that the Linux and OpenBSD web servers were on lasted until power was restored. The Linux and OpenBSD servers had been up 204 and 53 days respectively. Both continued their long up-time runs. (The Linux machine was up for 336 days when the reset button was accidentally pressed. The OpenBSD machine was up until it was moved.) Neither ever crashes. Except for hardware changes and operating system upgrades, it's hard to think of a reason for ever rebooting either. Any daemon (service) can be stopped and started and upgraded if necessary, at will and networking setup changed without rebooting.

The firewall had to go through its file systems checks and thus took somewhat longer than normal to boot. This is still not much longer than a typical NT server boot including starting of services. Everything worked as expected and there no problems; the system remained up until it was upgraded a few months later.

The NT workstation was OK but the NT server didn't fare quite so well. When the NT server rebooted, it first came up with the wrong login dialog box. This server is a Primary Domain Controller and the login box is supposed to have three fields for user name, password and domain. The domain field was not displayed on the login box, just the user name and password fields. I made several attempts to log in but knew it would be futile.

I think this is the third time I've now seen this odd behavior after a reboot. The first time, I did not recognize the odd login dialog and tried repeatedly, with every username and password I knew to log on but none were accepted. I did a hardware reset and the normal login prompt appeared. I logged in successfully on the first attempt. The second and third times I saw the odd login box, I used the hardware reset after only a few failed login attempts. I think in each case the normal login appeared after a single extra reboot.

Following the reboot (1) and login, the NT server appeared normal; there were no error messages or dialog boxes indicating any failed services. The first sign of a problem was when my intrusion detection system started displaying warnings that logs from the other servers were not available. They were not available because FTP was not running on the NT server. Neither was the web server as it turned out. Manual attempts to start both failed, generating the following error message: "Could not start the World Wide Web Publishing Service service (sic) on \\NT. Error 2140. An internal Windows NT error occured." The FTP message was identical except for substituting "FTP Publishing Service". Multiple attempts produced the same results.

Another reboot (2) failed to start either service and manual attempts generated the same error messages. At no time were there either system dialog box error messages or event log entries to indicate that a service had failed to start. It was my own custom programmed warning system and no NT feature that alerted me to the problem. Microsoft's unhelpful error messages provided no useful information that I was not already aware of. As these messages were displayed only in response to manual efforts to fix an error condition that was identified by other means, no Microsoft error message or system log played any useful role in either identifying or fixing a major system problem. Even after I was aware of the problem and knew when it occurred, I could find no event log entries describing the problem.

I guessed that something in the registry had become corrupted. I rebooted (3) to the alternate copy of NT that I install on all my NT systems as insurance against just such occurrences. When I saw that the "system" part of the registry was about 50% larger than the backup made a few hours before the power failure, even though no system changes had been made, I thought for sure I'd found the problem. I restored the backup registry over the production registry and rebooted (4) once more.

The web and FTP servers did not start and manual attempts produced the same error messages as before. At this point I was almost out of ideas and starting to consider a full reinstall. Before doing so, I decided to check all recently changed files on the system to see if any other changes might have contributed to the problem.

A search showed that MetaBase.bin had changed a couple of days before. This file stores most IIS and FTP configuration data. I'd restricted a computer that was violating the Terms of Use and causing lots of error messages, from accessing the web site. I decided to restore the earlier version of MetaBase.bin but before I did, I discovered the real cause of the problem. In the directory where MetaBase.bin is stored there was also a MetaBase.bin.bak. The .bak file matched the time stamp and size of the last modifications that I'd made. The active MetaBase.bin file was one byte smaller and time stamped during the second reboot following the power outage. For some reason, during the first reboot that allowed me to log in after the power failure, NT had replaced the proper MetaBase.bin with a damaged file and kept the original as a .bak copy.

I can't even begin to imagine what caused this NT behaviour. As soon as I copied the previous MetaBase.bin file into place (the one with the new IP restrictions), I was able to start both IIS and FTP without further error messages. I rebooted (5) once more to see if the system was fixed and consistent. This time the web and FTP servers started automatically, as they are supposed to. The NT server now appears to be doing what it was prior to the power outage.

Before drawing some final conclusions, I want to comment on the one aspect in this that favors NT. When you buy an APC UPS, it comes with the software for Windows to force an orderly shutdown of the machine before the battery runs out. Corresponding software is available for the more common UNIX variants. To get it you'll need to go to a web site or send in a form as it's not included in the packaging with the UPS.

For two reasons this is not significant. First, with most UNIX's it simply is no big deal if they experience a hard crash as caused by a power failure. I have yet to see any UNIX variant not successfully come back from such a situation and resume normal operations afterwards. I know this is not always true and it's probably a really bad idea for heavily used database server to count on this. Still, for many light to moderate use UNIX servers, it's quite reasonable to let them run on UPS power until the battery fails.

To use the automated shutdown features wisely, unless the UPS is high end with software that can shutdown at a specified battery percentage rather than a fixed number of minutes, you must be conservative in your estimate of the UPS battery life. Thus there will be a several minute period between the automated shutdown and the end of useful battery life. It's not very unusual for power to be restored during that period. Depending on the computer power switch, a computer may or may not come back on after power is restored. Switches on newer computers are less likely to restart a computer following a power outage than on older computers. Where computers are not tended to 24 hours a day, an automated shutdown in response to a power failure, can turn what would have been a non event into hours or days of down time (if the UPS had enough battery power to outlast the outage).

The other factor is that just having the Windows software and installing it does not assure the system will shutdown as expected. If the UPS software does not issue the shutdown with the correct options, application or user dialog boxes may prevent the shutdown from completing. Ensuring that a UPS is properly configured and will both shut a server down as expected and take full advantage of available battery capacity, requires time consuming tests. If a server and UPS are not run through an actual test where power is removed, restored prior to shutdown time, removed and kept off until after a shutdown, restored and shut off after shutdown and restored and left on, the system has not been tested. If a UPS is added to a production server, it's almost certain, proper tests will not be conducted. Anyone sufficiently knowledgeable and willing to do the necessary testing, will also be able to get the necessary software from the UPS manufacturer, if it is available. The same person would make availability, part of the UPS selection process.

This latest experience is just one more example of the inadequacies of Windows NT (and 2000) as a server operating system. A UNIX (OpenBSD) server suffers a hard crash when the battery power runs out. Pressing the power button is the sum total of the recovery procedures when power is back. In contrast, an NT server, following an orderly UPS initiated shutdown trashes itself. No system error messages or logs announce or reveal any problems. Third party software reveals the problem. Approximately three hours of investigation, good recent backups and five reboots are required to get the system back to where it was prior to the power outage.

transparent spacer

Top of Page - Site Map

Copyright © 2000 - 2014 by George Shaffer. This material may be distributed only subject to the terms and conditions set forth in http://GeodSoft.com/terms.htm (or http://GeodSoft.com/cgi-bin/terms.pl). These terms are subject to change. Distribution is subject to the current terms, or at the choice of the distributor, those in an earlier, digitally signed electronic copy of http://GeodSoft.com/terms.htm (or cgi-bin/terms.pl) from the time of the distribution. Distribution of substantively modified versions of GeodSoft content is prohibited without the explicit written permission of George Shaffer. Distribution of the work or derivatives of the work, in whole or in part, for commercial purposes is prohibited unless prior written permission is obtained from George Shaffer. Distribution in accordance with these terms, for unrestricted and uncompensated public access, non profit, or internal company use is allowed.

 
Home >
About >
Building GeodSoft.com >
newntano.htm


What's New
How-To
Opinion
Book
                                       
Email address

Copyright © 2000-2014, George Shaffer. Terms and Conditions of Use.