GeodSoft logo   GeodSoft

Site Synchronization Script - 7/7/00

Developing a script to automatically synchronize almost identical web sites on Linux, Windows NT and OpenBSD web servers. The only differences between the sites are graphics identifying the web server and OS and the links to the other sites. I've chosen to keep the sites almost identical. Anything outside the unique page content could be very different on the three sites.

It's taken me a while to get to the site synchronization script. I figured out one key piece (automating FTP) sometime ago but never could come up with a clean solution. Every idea I had seemed like too much of a kludge with obvious potential problems. The work of trying to keep a development site and three "live" sites based on it in sync manually is just too much. If you try to update all sites every time you make a minor change there is an incredible overhead and if you try to save your updates until you have a bunch you risk forgetting some of the minor fixes you've made.

I finally decided to deal with this almost two weeks ago. There have been several things going on but I have about four full days invested in this and still have no real solution in sight. I have a sort of workable process to update the Linux and BSD machine. It is a kludge with a couple potentially serious problems.

This is not a standard disk synchronization issue that can be solved with one of the standard commercial or freeware products that keeps disk drives or selected directories synchronized either automatically or initiated by a user action. First every single page (file) has to be under separate user control as to whether or not it should be sent to the publicly visible sites. For example one might be working on a significant new page for some time and it might contain an outline, sentence fragments, notes to yourself or any other artifacts created in early draft stages that you surely don't want the public to see. Work on such a page might take a day or two or might be interrupted for months. In the meantime one may do any number of minor corrections or additions to existing pages that should be made visible as soon as they are completed. No process that deals with the whole site can handle this. There has to be a mechanism by which files to be moved are user identified. Some possibilities include copying it to a transfer area, entering it into a data entry screen that the transfer program uses as input or "marking" the file in some such as by setting an artificial modification time stamp on the ready files. Anything that lets a process distinguish user selected files is a potential candidate.

Also, even though the bulk of the content of matching files on each of the three web sites is the same, each HTML document has site specific areas that identify the OS and web server on which the document resides. Each site could have a different color scheme or even different navigation aids, if this was desired. The files are never in sync the way a utility that synchronizes disks or directories understands synchronization. Following each transfer, the standardization process needs to be run against each file that's been moved. It should only process the moved files and not the whole site. Otherwise the overhead of doing the whole site each time there is a minor file change anywhere in the site negates one of the real advantages of the current methods: serving static pages that only need adjustment when their content changes or they are moved to a new system.

The approach that I chose makes use of a second, typically empty, directory tree that matches the directory tree of the web sites. On the development / sending site, files are copied to the appropriate directory when they are ready for display. A background process periodically scans this directory tree and when it finds a file or files, sets up to do a transfer. It does this by writing a text file that contains the FTP commands to effect a transfer to the correct location. As the script walks the transfer directory tree, it writes pairs of cd and lcd commands. Based on file extension it writes an ASCII or binary command for each file then a put statement with the file's name. The cd and lcd commands include ".." as the script works back up the directory tree. For each destination site, ftp is invoked and the FTP command script is piped to ftp.

After the transfers are complete the directory tree is cleared of files. This is the cause of one of the two serious problems with the existing process. If other files are copied into the transfer directory while a transfer is in progress, they may be erased without being transferred. As I wrote this, the solution to that occurred to me. Write a second script (or batch file on NT) that mirrors the directory walk and only erases the individual files identified for transfer during that run. The same approach can be used on the receiving side as well where there is a problem if an additional file or file is delivered after the receiving side script has been built but before it completes executing. The receiving side script is somewhat like the FTP command script but invokes the Perl standardization script for each transferred file.

Fairly early in the process of testing I accidentally moved rather than copied some files from the development site to the transfer directory (drag rather than control-drag). The files were erased. Fortunately I had them on backups but if I had worked on those files that day they would not have been on backups. I decided make a backup directory on the sending system to which all transferred files were copied but not erased. I decided to also make a backup on the receiving system before processing the files. It seemed like a good idea at the time but the result was that I ended up working with 15 sets of almost identical directory trees, 12 actively on each test. Since I wasn't going to work in the real directories being used by the web servers until the transfer process was fairly thoroughly debugged, I had to create a test directory to act as the destination directory on each receiving system.

The destination directory had to have the necessary perl scripts and library files as well as the text files from which the navigation, search form and platform descriptions were drawn. It needed to be empty of the other files so I could easily tell which files were actually transferred. Later when files were deliberately transferred over previous transferred copies I had to rely on time stamps. At the beginning of each set of tests I wanted things as clean as practical. With a sending or destination directory, a transfer directory and a backup directory in active use on each system it became quite tricky keeping track of which system I was on and exactly where I was in the testing process. There were as many delays from testing mistakes as fixing problems in the scripts.

The first rounds of testing were done entirely between the NT workstation and the Red Hat Linux system. Linux performed as expected and the problems I dealt with entirely related to script and testing issues. After I thought I was pretty close to a workable solution, the BSD system was brought into the testing. For hours, I could not even get the scripts to run. I built simplified test scripts and dumped the entire environment to files trying to figure out what was going on. I assumed there was something wrong with the script or my setup.

It took more than a day to conclude cron was never even trying to start the jobs in the first place. I reached this conclusion after I was consistently able to start jobs by not specifying a starting time but simply using asterisks to start a job every minute. I'd watch the background processes until I saw my job then edit crontab to stop more from starting. I always got two and killed one. The man pages for at say there's a bug related to starting jobs. Apparently cron doesn't get the crontab updates instantaneously either. There is a small delay. Normally when testing background jobs, after testing the basic logic in the foreground, I go to crontab and / or at to schedule in the background to deal with environmental issues. To minimize testing time I always try to kick of the job on the next minute unless I don' think I can get it saved in time. Every other system I've worked on gets these updates immediately. OpenBSD 2.6 looks like it has a delay. It's like scheduling jobs to run at 12:01 when it's already 12:01:01. They never run as the time has past (unless they run the next day or hour). Anything that's in the schedule goes off when expected and if you give a few minute lead time jobs start when expected but it trying to cut it close really costs you. 2.7 appears to have fixed this.

The very first time I did a test with NT I got a file in the root directory to transfer. I couldn't get any file from a subdirectory to go across. I spent time experimenting with xcopy and cp syntax and starting from the command line as well as Perl. Eventually I got back to exactly where I started from and it worked fine (for a while) so it seemed that there was a test setup problem. After I'd done about 8 successive transfers where everything worked exactly as it should, I thought I had it. I copied 16 graphics files into the transfer directory tree. They all went the UNIX systems but never got to the NT system. The exact same processes were still looping in memory on all four systems (they're set for 24 hours duration by default). I never got anther file to transfer to NT. The FTP server on NT is fine. I don't have clue what changed.

For the time being I've gone back to simply dragging the files over to the NT system. Since they're developed with the NT setup I don't need to run the standardize script. At some point I'll return to this but for now it's taken far too long. Once again NT displays truly bizzare behaviour though I have to admit I spent much longer on the BSD system. There however I always had some idea what to try next.

transparent spacer

Top of Page - Site Map

Copyright © 2000 - 2014 by George Shaffer. This material may be distributed only subject to the terms and conditions set forth in http://GeodSoft.com/terms.htm (or http://GeodSoft.com/cgi-bin/terms.pl). These terms are subject to change. Distribution is subject to the current terms, or at the choice of the distributor, those in an earlier, digitally signed electronic copy of http://GeodSoft.com/terms.htm (or cgi-bin/terms.pl) from the time of the distribution. Distribution of substantively modified versions of GeodSoft content is prohibited without the explicit written permission of George Shaffer. Distribution of the work or derivatives of the work, in whole or in part, for commercial purposes is prohibited unless prior written permission is obtained from George Shaffer. Distribution in accordance with these terms, for unrestricted and uncompensated public access, non profit, or internal company use is allowed.

 
Home >
About >
Building GeodSoft.com >
sitesync.htm


What's New
How-To
Opinion
Book
                                       
Email address

Copyright © 2000-2014, George Shaffer. Terms and Conditions of Use.