Boost logo

Boost Users :

Subject: Re: [Boost-users] [BGL] SPPRC variant
From: Jeremiah Willcock (jewillco_at_[hidden])
Date: 2010-08-05 12:30:40


On Thu, 5 Aug 2010, Adam Spargo wrote:

> ok, I'll try to give a better description.
>
> When you sequence a genome, you have lots of short fragments coming off the
> sequencing machine in a random order, the genome is covered with overlapping
> fragments to a considerable depth. The assembly task is then to put them
> together again in the correct order. It's kind of like doing a big jigsaw
> puzzle.
>
> I'm following the overlap-layout-consensus model. You find all overlaps
> between fragments and draw a graph with fragments as vertices and overlaps as
> edges. Each edge is bidirectional, labelled with the sequences that overhang
> each end of the overlap. The bidirectionality reflects the double stranded
> nature of DNA. The correct assembly is then a path through this graph.
>
> For example if you have overlapping fragments A and B, with overlap O(A,B),
> the edge between A and B has (A-B) in one direction and (B-A) in the other
> direction. Then A.(B-A) or B.(A-B) are valid assemblies of A and B.
>
> The problem is that there are many spurious overlaps - edges which look like
> true overlaps but which are due to repetitive sequences in the genome. There
> are also sequencing errors, but I'll deal with them in phase 2, I'm just on
> simulated data at present.
>
> Traditional algorithms look for Hamiltonian Path in this graph, in practice
> what happens is that you output the sequence on simple walks, leaving gaps
> between the 'contigs' at every cycle.
>
> So the sequencing guys worked out a way to produce read-pairs - each fragment
> has a pair, the distribution of the insert size between the pairs is not
> really known, for example when they aim for 3000 nucleotides the mean could
> be something like 2000 or 4000, with a sd of something like 600.
>
> In general you have several libraries of read-pairs with different insert
> size.
>
> What is currently done is to approximate the insert size distribution on the
> contigs and then use this information to join contigs into scaffold and
> attempt to fill in the gaps, similarly contigs which contradict the read-pair
> information are broken or discarded.
>
> What I want to do is to use the read-pair data on the graph, by removing
> infeasible edges which hopefully correspond to spurious overlaps. If I have
> insert sizes longer than the repeats in the genome I should be able to
> disambiguate all the tangles and be left with an acyclic path from which I
> can just read off the assembly without the need for a layout algorithm at
> all.
>
> I plan to do what is called transitive reduction on the graph before looking
> at the read pairs, that is if X->Y, Y->Z, X->Z - remove X->Z - just retain
> the maximal overlaps and kick out the shorter ones with the same information.

There is a transitive reduction algorithm in BGL, and a better version on
the way. I do not believe it handles weights, though, which it looks like
you need. You might want to contact Eric Wolf <eric_at_boese-wolf.eu>, who
is working on the new transitive reduction code.

> I was just wondering if this problem fits with a well known SPPRC or am I in
> unvisited territory?
>
> It's quite a worthwhile project, because the 'finishing' stage of sequencing
> a genome is the most expensive these days, but we had the necessary
> information on the graph already, keeping the easy bits and throwing away all
> the tangles. The finishers never even get to look at the tangles.
>
> Maybe that was an even worse description ...
>
> At the moment I'm just seeing what the graph looks like for different classes
> of repeat which I insert into my simulation data and trying to get my head
> around the problem. I'm unclear as to whether I should find the fundemental
> cycle basis and attack each one in turn with the pair information or whether
> I could formulate this as a SPPRC and just call the BGL function.
>
> I'm not expecting the answer to come out of the ether, but if anyone has any
> insights I'd be glad to hear them.

I'll think about this further and see if I have any insights.

-- Jeremiah Willcock


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net