Data Import Process
Importing log data generated by the Distributed.Net proxy server into the
Personal Proxy Statistics database has always been the slowest and most
disk intensive process of creating stats. Every system administrator
who has ever run ppstats has emailed me and asked if this could be fixed
somehow, perhaps using a cache file of some sort. Because it is actually
a fairly complicated problem, my standard response has always been "Wait
for version 8.0". Well, this is version 8.0 and now we have to
deal with this problem head on.
There are a few key goals to keep in mind when designing the data import
process. Some are obvious, but they are worth restating.
- The import should be fast. There was no problem reading in a few
small log files each time ppstats ran, but there was definitely a problem reading
in many small log files, or worse, many large log files each time. The time
it takes ppstats to import data should not depend on how many or how big the log files are.
Therefore the import procedure should know to skip files already imported.
- The database should not be a disk hog. Efficient use of the database is important.
This can be done by using a third-normal database design. Data can also be "compressed" in
the sense that similar data over a period of time can be combined. The current plan is
to group Email, Host, OS, CPU, and Version by the hour of day it was submitted.
Informal testing of this kind of by hour compression yields a 20:1 over the same data uncompressed.
Testing of a by day compression yields an even better 90:1 ratio, however this algorithm
sacrifices the ability to do time localization.
- Time localization. Many users are asking for the ability
to create stats based on their local time zone instead of GMT. This is possible if the PostgreSQL
"SET TIME ZONE" command is used effectively. All log files should be imported as one
time zone, for instance GMT. After the import is finished, the database can be switched
to a new time zone and all datetime stamps are automatically updated to reflect the
new zone.
- No log files will be deleted/moved/archived/compressed during import. In an earlier
release of ppstats, I made the cardinal sin of doing an in-place decompression of a proxy
log.gz file, reading the stats from the uncompressed log file, then gzip'ing the log when
finished. This procedure modified the log files, and had the consequence of leaving the log files
in a bizzare and unstable state when the script either broke because of error, or the server
itself was rebooted or crashed in the middle of stats run. One user even lost precious log files
in this scenerio, and was never able to recreate them. This was a Bad Thing(TM). Therefore all
log files will be opened in read-only mode, will not be moved/renamed/deleted for any reason, and
compressed log files will be copy/extracted to a temporary file - leaving the original as is. If
the stats admin wants to compress his log files to save space, he is certainly welcome and
encouraged to do so. But log compression of existing files will not be a feature of ppstats
(it can actually be done using the proxyper server itself).
$Id: import.php3,v 1.2 2000/03/26 16:12:00 kpesce Exp $