Boost logo

Boost :

From: Daryle Walker (darylew_at_[hidden])
Date: 2008-08-13 19:04:03


If no one has noticed, I'm trying out a MD5 system on our Subversion
server under "$ROOT/sandbox/md5/", hopefully to succeed Boost.CRC.
(Let's use the skills & experience gained over the past... 7 years.)
I'm sharing my design ideas to make sure I'm not missing anything.

1. The results of a MD5 run are encapsulated in the md5_digest
class. It's actually a POD type, and the only supported operations
are equality and streaming standard I/O. I couldn't think of
anything else you would/could want to do with this type.

2. Although "everyone" writes MD5 functions as byte-running
algorithms, it's actually a bit-running one. I've constructed my
classes around bit-oriented operation, so maybe this'll bring a new
perspective. I also use 64-bit integer types directly, instead of a
pair on 32-bit integers like the "standard" implementations. (All
the recent work on Boost.Integer was to enable this library.)

3. The computation of MD5 runs was completely encapsulated in the
md5_computer class. Like Boost.CRC's design, this lets a run be done
in piecemeal. There's also a function that does a single run over a
buffer, that internally calls the computation class. Besides the
actual computation work, the class had accessors to the current state
of computation. I noticed that a lot of this code could be reused
for other coding schemes, so I tried to create a hierarchy that
spread the functionality. I felt that it got too unwieldy. I
determined that the problem was that presentation and computation
parts of the class burden each other. So I divided the class into
the current md5_computer (presentation) and md5_context (computation)
classes. The presentation class has a different hierarchy behind it,
with more generics over OOP, but still contains a computation
object. The computation class publicly has a producer-generator &
consumer-functor interface, and its attributes cannot be accessed
except for the associated presentation class, which is a friend. The
system is like the I/O-streams' separation into stream and stream-
buffer classes. Is this separation a good idea?

4. The presentation and computation classes have Boost.Serialization
functionality. I originally planned to have serialization for the
back-up classes, but I read a thread from May 2007 suggesting that
the serialization model should match the user's model, not the
implementation details, so those were skipped.

5. Note that the presentation and computation types only support
Boost.S11n while the digest type only supports standard streams.
Besides keeping the digest type POD, I didn't see a need to make the
p/c types printable. (Now I just realized that you may want to save
a MD5 in a data file. Maybe I'll work on serializing digests.)

6. Should the serialization routines be in the classes' headers, or
move to separate implementation somewhere within Boost.Serialization
(assuming this gets accepted, of course)? I think that Boost.Multi-
Index puts s11n in its own headers, but that's because the design
must be intrusive. It could be non-intrusive for the digest and
presentation classes, but probably not the computation class.

7. I'll probably try out the framework on at least one more coding
type. (I've read that a framework with only one concrete class is
probably locked to just that class due to the programmer never having
to confirm separation of concerns during testing.)

8. The usual byte-wise MD5 implementation probably does have a speed
advantage over this library, since it can just dump bytes directly
into a buffer until hashing time, then compute everything a byte at a
time. This library forces CHAR_BIT calls for each byte submission,
getting worse for buffer submissions. The library currently wastes
space in the computation class, since it stores a 512-Boolean array.
(A "bool" could rip off an unsigned-char, wasting CHAR_BIT - 1 bits,
or an "int," wasting more space! This is probably why
std::vector<bool> was invented.) Maybe switching to an unsigned-char
array can fix both problems.

8a. After the switch at the end of [8], a byte-oriented wrapping
variant of md5_context could be made that copies bytes directly into
the inner object's buffer, and calls the hash-updater as needed.
Note that the wrapping type would still do hash updates bit-wise;
could/should they be byte-wise too? A byte-wise hashing optimization
only works if CHAR_BIT is 8 (not even for higher integral powers of
two), so is the potential effort worth it? How would I test both
cases (octet-sized bytes and not)? Should I test both cases? Note
that the direct-byte-copying part also has problems if 512 %
CHAR_BIT != 0.

9. There's only Doxygen comments (which take up most of each file)
for documentation. I guess that any Quickbook files would be more
like general user guides.

-- 
Daryle Walker
Mac, Internet, and Video Game Junkie
darylew AT hotmail DOT com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk