Boost logo

Boost :

Subject: Re: [boost] Genetics library: Volunteers needed
From: Antony Polukhin (antoshkka_at_[hidden])
Date: 2015-07-23 14:38:35


2015-07-21 15:11 GMT+03:00 Andy Thomason <a.thomason_at_[hidden]>:

> Hi All,
>
> I am recruiting users for the putative genetics library.
>

Hi,

I like the idea of genetic library in Boost!

However code misses essential optimizations and suffers from premature
optimizations.

* dna_string misses reserve() in assignment. This makes some of the
push_back()s slow.
* Attempt to understande the exact search rewarded me with headache (cool
hack, I've enjoyed it!). Too many magic constants and variables, this makes
the algo hard to maintain. Also I have a disbelive that the algorithm is
optimal:
You are comparing by 4 nucleotides. 256 nucleotide combinations with length
4 exist. Let's assume for simplicity that nucleotides are uniform
distributed. Algorithm will often give false positives: it will be
triggered roughtly once each 256 nucleotide comparisons. You're doing some
kind of vectorization, so algo will give false positives each ~8 loop
bodies.

Comparing by longer nucleotide chain will trigger the compare_inexact less
often. For example comparing by 8 necleotides will trigger false positive
once per ~65500 comparisons.

* comparison operators require improvements. Compare sizes first (it's
cheap!). Use memcmp in cases like `values < rhs.values || values ==
rhs.values`. memcmp will give you an integer that already shows is value
bigger\smaller\equal, without a need to iterate over the data for seconf
time.

* `const auto str_values = str.get_values();` - must be `const auto&
str_values = str.get_values();`
* provide an enum for nucleotides { nA = 0, nT = ...}. This would make the
library more user friendly.

There's more. If you're interested, I can investigate further

-- 
Best regards,
Antony Polukhin

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk