Boost logo

Boost Users :

Subject: Re: [Boost-users] [BGL] SPPRC variant
From: Adam Spargo (aws_at_[hidden])
Date: 2010-08-05 04:47:32


ok, I'll try to give a better description.

When you sequence a genome, you have lots of short fragments coming off
the sequencing machine in a random order, the genome is covered with
overlapping fragments to a considerable depth. The assembly task is then
to put them together again in the correct order. It's kind of like doing a
big jigsaw puzzle.

I'm following the overlap-layout-consensus model. You find all overlaps
between fragments and draw a graph with fragments as vertices and overlaps
as edges. Each edge is bidirectional, labelled with the sequences that
overhang each end of the overlap. The bidirectionality reflects the
double stranded nature of DNA. The correct assembly is then a path through
this graph.

For example if you have overlapping fragments A and B, with overlap
O(A,B), the edge between A and B has (A-B) in one direction and (B-A) in
the other direction. Then A.(B-A) or B.(A-B) are valid assemblies of A and
B.

The problem is that there are many spurious overlaps - edges which look
like true overlaps but which are due to repetitive sequences in the
genome. There are also sequencing errors, but I'll deal with them in phase
2, I'm just on simulated data at present.

Traditional algorithms look for Hamiltonian Path in this graph, in
practice what happens is that you output the sequence on simple walks,
leaving gaps between the 'contigs' at every cycle.

So the sequencing guys worked out a way to produce read-pairs - each
fragment has a pair, the distribution of the insert size between the pairs
is not really known, for example when they aim for 3000 nucleotides the
mean could be something like 2000 or 4000, with a sd of something like
600.

In general you have several libraries of read-pairs with different insert
size.

What is currently done is to approximate the insert size distribution on
the contigs and then use this information to join contigs into scaffold
and attempt to fill in the gaps, similarly contigs which contradict the
read-pair information are broken or discarded.

What I want to do is to use the read-pair data on the graph, by removing
infeasible edges which hopefully correspond to spurious overlaps. If I
have insert sizes longer than the repeats in the genome I should be able
to disambiguate all the tangles and be left with an acyclic path from
which I can just read off the assembly without the need for a layout
algorithm at all.

I plan to do what is called transitive reduction on the graph before
looking at the read pairs, that is if X->Y, Y->Z, X->Z - remove X->Z -
just retain the maximal overlaps and kick out the shorter ones with the
same information.

I was just wondering if this problem fits with a well known SPPRC or am I
in unvisited territory?

It's quite a worthwhile project, because the 'finishing' stage of
sequencing a genome is the most expensive these days, but we had the
necessary information on the graph already, keeping the easy bits and
throwing away all the tangles. The finishers never even get to look at the
tangles.

Maybe that was an even worse description ...

At the moment I'm just seeing what the graph looks like for different
classes of repeat which I insert into my simulation data and trying to get
my head around the problem. I'm unclear as to whether I should find the
fundemental cycle basis and attack each one in turn with the pair
information or whether I could formulate this as a SPPRC and just call the
BGL function.

I'm not expecting the answer to come out of the ether, but if anyone has
any insights I'd be glad to hear them.

Adam.

--
Dr Adam Spargo
High Performance Assembly Group   email: aws_at_[hidden]
Wellcome Trust Sanger Institute   Tel: +44 (0)1223 834244 x7728
Hinxton, Cambridge CB10 1SA       Fax: +44 (0)1223 494919
On Wed, 4 Aug 2010, Jeremiah Willcock wrote:
> On Wed, 4 Aug 2010, Adam Spargo wrote:
>
>> Hi,
>> Are there any experts on the list that would give me some advice on a 
>> shortest path with resource constraints type problem.
>> 
>> In brief: I have a graph with tangles. The tangles are due to spurious 
>> edges which arise due to repeats in genomes. If the underlying genome 
>> didn't have any repeats, then I could just find the Euler path (which would 
>> coincide with the Hamilton path) and I'd be done. I have extra data which 
>> can be interpretted as the distance between pairs of nodes, which I hope to 
>> use to remove as many of the spurious edges as possible, but having read 
>> some papers this sounds like the SPPRC.
>> 
>> The trouble is that the distances aren't exact, they come from a 
>> distribution, which is only approximately known. However my graph will
>> have lots of untangled regions from which I can make a better 
>> approximation, iteratively maybe.
>
> There is a resource-constrained shortest paths algorithm in BGL, although I 
> have never used it.  There is someone working on a newer version, but I don't 
> know about the recent progress of that.  Do you have a more precise 
> description of what you are trying to accomplish, especially in graph terms?
>
> -- Jeremiah Willcock
> _______________________________________________
> Boost-users mailing list
> Boost-users_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/boost-users
>
-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net