Boost logo

Boost :

From: Carl Daniel (cpdaniel_at_[hidden])
Date: 2002-03-19 09:15:21


----- Original Message -----
From: "Gerhard Häring" <gerhard_at_[hidden]>
> Beman Dawes wrote:
> > At 08:31 PM 3/18/2002, Carl Daniel wrote:
> >
> > > The archive is still a bit confused. Try again tomorrow - I'm
> > > uploading a
> > > file with messages up through 27136 right now. In
> > > the current archive the threads are messed up, and the messages are
> > > in an odd order.
> >
> > A little background - YahooGroups doesn't have a function to export an
> > archive in bulk. Carl had to write a script to view each message one at
> > a time. He then had to write a script to convert from HTML to mbox
> > formats. Since the messages date partially from the time when they were
> > eGroups messages (before Yahoo groups bought eGroups), he had to cope
> > with format changes in mid-stream. [...]
>
> Doesn't anybody here have an mbox archive of the old messages that you
> can use?

No one had one that went all the way back, so we had to make one. Between November 1998 and March 2002, there were
27113 messages posted to the egroups/yahoo list. The mbox file is 80 megabytes in length.

Early tests with the new system caused some concern with the amount of header information that the archiver retains and
displays with each message (in some cases the headers are larger than the message itself, so it's really a question of
efficient use of screen real estate), so we've been experimenting with removing unneeded headers from the messages to
make better use of the space. The mbox file with only the essential headers remaining is 54 Mb.

The other concern was the behavior of the site with such a large number of messages. As you can see on the site now,
you get a page with over 100 links - it's a bit daunting. So we spent some time experimenting with separating the
archives into monthly files. In the end that strategy had to be abandoned due to difficulties with the archiver, but
it's a possibility for the future.

Finally, we will of course preserve a copy of the 80Mb mbox file so that going forward, we can move to another system,
or re-structure this one with far less pain than was incurred in getting everything off of yahoo.

-cd


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk