From: Carl Daniel (cpdaniel_at_[hidden])
Date: 2002-02-19 11:37:54
From: "David Abrahams" <david.abrahams_at_[hidden]>
> Beman and I are trying to prepare Boost for a move from YahooGroups to
> Mailman, and I'm afraid I need some help from someone who knows a bit more
> about www protocols than I do.
> I want to collect an archive of past messages. Beman wrote a simple Python
> a few months ago which would download the yahoogroups web pages for the
> messages using urllib. Unfortunately, yahoogroups added periodic redirection
> to pages containing advertisements, so the old script doesn't work. The
> nature of the beast is that if you visit
> http://groups.yahoo.com/group/boost/message/1000 in your web browser, you'll
> often end up at
> 000 instead, a page containing an advertisement. The latter page contains a
> link to "/group/boost/message/1000" which always takes you to the right
> place. It looks to me as though that link needs to be in the context of the
> ad page in order to work properly, because I can't figure out how to make
> urllib retrieve the right one. I'm sure I'm just missing something simple.
This is a problem I've solved recently, writing what is in effect a scriptable browser which supports cookies, follows
re-directions, and has sophisticated pattern matching facilities to locate links on pages, etc.
Unfortunately, the work belongs to the company for whom I did the contract, so I can't provide it to Boost just yet.
I'll will inquire with them about it though - since they're in a completely different industry, perhaps they'd let me
use the code for the good of the C++ community.
In case they don't, my approach was roughly this: run HTML through an HTML to XML converter, parse that into an XML
DOM, use XPath as the query language in an XSLT-like navigation language of my design which could locate & follow links,
capturing any desired information along the way & formatting it as XML using XSLT or custom-made converters. That would
seem to solve more than the problem you've run into, but it's a non-trivial amount of work.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk