Boost logo

Boost Users :

Subject: [Boost-users] Non-Boost question: HTML parser with hpricot
From: Boost lzw (boostlzw_at_[hidden])
Date: 2010-01-08 04:55:19


Hi folks,

Sorry for posting a question unrelated to boost. I know pretty well that
boost has perfect solutions to it. But I am working on a legacy system using
Hpricot of Ruby on Rail. So only Hpricot-specific suggestions please. Thank
you.

In my HTML parser, I can parse a html file with the following hpricot
commands:
(1) doc = open( "MyFileToParse.html" ) { |f| Hpricot(f) }
(2) elements = (doc.search("/html/body/table/tr/td/table/tr/td/font") )
(3) puts (elements[13]).inner_html

to get the following output:

Giaever G, et al (2002). Functional profiling of the Saccharomyces c
erevisiae genome. Nature, 418:387-91. [<a href="
http://www.ncbi.nlm.nih.gov/entr
ez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=12140549&dopt=Abstract"
target="_
blank">PubMed</a>]

How can I proceed to get the following results (3) and (4) respectively?
(3) Giaever G, et al (2002). Functional profiling of the Saccharomyces c
erevisiae genome. Nature, 418:387-91.

(4) http://www.ncbi.nlm.nih.gov/pubmed/12140549?dopt=Abstract

NOTE: to get (4) I need to take two more steps: (5) replace "&" with "?" (6)
replace "PubMed" with "pubmed" (this might be trivial, but how?) in the
process of parsing in addition to "normal" HTML parsing.

Thanks a lot in advance.
Robert



Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net