Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2011-01-24 09:15:09


On Mon, Jan 24, 2011 at 9:48 PM, Stewart, Robert <Robert.Stewart_at_[hidden]> wrote:
> Dean Michael Berris wrote:
>>
>> Consider the following:
>>
>>   template <class String>
>>   void needs_utf8(String const & s) {
>>     view<utf8_encoded> utf8_string(s);
>>     if (!valid(utf8_string))
>>       throw invalid_string("I need a UTF-8 string.");
>>   }
>>
>>   template <class String>
>>   void needs_utf16(String const & s) {
>>     view<utf16_encoded> utf16_string(s);
>>     if (!valid(utf16_string))
>>       throw invalid_string("I need a UTF-16 string.");
>>   }
>>
>> I would say you have four choices when implementing `view`
>> and `valid`:
>>
>> 1. view converts, and valid is a no-op.
>> 2. view doesn't convert, and valid does the validation on the
>> underlying string.
>> 3. view converts, and valid does the validation on the
>> underlying string.
>> 4. view doesn't convert, but valid checks the validation on the view.
>>
>> I'm leaning towards #2.
>
> #1 and #3 would be wasteful for cases when the string is already known to have the desired encoding, so they are non-starters.
>
> I'm not sure I understand the distinction or reason for the distinction you imply by #2 versus #4.  #2's wording suggests that you mean valid() accesses the underlying string through the view, but why is that better or worse than just using the view as in #4?
>

In #2, you can have valid be implemented like this:

  template <template class <class> View, class Encoding>
  bool valid(View<Encoding> const & encoded_view) {
    if (!valid_length(encoded_view.raw(), Encoding())) // use static
tag-dispatch
      return false;
    // ... do other validity checking based on just the raw data
    // like BOM checking, character-by-character check on whether
    // there are invalid characters not within range, consider Base64
    // and/or hex-encodings aside from just Unicode, etc.
  }

Which you really would want to have for performance reasons -- case in
point, if the underlying string doesn't have a valid length for UTF-16
or UTF-32 strings, you get a win by just doing some math on the length
check for validity. Some libraries even make these parts compile to
vectorized code, use OpenMP, or might do some things like even do
GPU-assisted validation.

For #4 though this would be unnecessarily limited by the interface
provided by the view, which may mean that the only way you would write
a validator would be to try to get an iterator from the view where you
essentially wait for a dereference of an iterator to fail through some
mechanism -- maybe throw on dereference, or something like that.

By doing it through the #2 approach you can write a general validation
routine that can even be specialized on through the specific encoding.
You get the tag-dispatch goodness you can whenever for example you
have a specialized routine for validation in a given encoding, have
some room for partial/full specialization, etc.

-- 
Dean Michael Berris
about.me/deanberris

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk