Boost logo

Boost :

Subject: Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter
From: Matus Chochlik (chochlik_at_[hidden])
Date: 2011-08-12 05:00:36


On Fri, Aug 12, 2011 at 9:57 AM, Daniel James <dnljms_at_[hidden]> wrote:
> On 11 August 2011 12:57, Artyom Beilis <artyomtnk_at_[hidden]> wrote:
>>
>>> There's a lot of existing code which is not based on that assumption -
>>> we can't just wish it out of existence and boost should be compatible
>>> with it.
>>
>> Then cross platform, Unicode aware programming will always
>> (I'm sorry) suck with Boost :-)
>>
>>
>> Thats it...
>
> Unless a different solution can be found.

I see the old flam .. er discussion on text handling is back :)

>From the previous debate(s) I now accept that it would
be a bad idea just to force the encoding of std::string to be utf8,
So a (nearly) ideal text handling class should IMO look like this
(see usage below):

// text encoding tag types for conversion function dispatching
namespace /*or struct */ textenc
{
  struct utf8 {};
  struct utf16 {};
  struct utf32 {};
  struct winapi {};
  struct posix {};
  struct stdlib {};
  struct sqlite {};
  struct libpq {};
  ...
  struct libxyz {};

#if WE_ARE_ON_WINDOWS
  typedef winapi os;
#elif WE_ARE_ON_POSIX
  typedef posix os;
#elif ...
#endif

  struct gcc {};
  struct msvc {};
  struct icc {};
  struct clang {};

#if COMPILING_WITH_GCC
  typedef compiler gcc;
#elif COMPILING_WITH_MSVC
  typedef compiler msvc;
#elif ...
#endif
};

class text
{
public:
  // *** construction ***
  // by default expect UTF8
  text(const char* cstr)
  {
     assert(is_utf8(cstr));
     store(cstr);
  }

  // by default expect UTF8
  text(const std::string& str)
  {
     assert(is_utf8(str.begin(), str.end()));
     store(str);
  }

  // otherwise use the tag type to
  // do any necessary conversions
  template <typename Char, typename EncodingTag>
  text(const Char* cstr, EncodingTag encoding)
  {
     // use an overload to convert from the encoding
     // basically if the tag is textenc::winapi then use
     // the winapi-supplied functions and convert to utf8
     // if it's posix look at the locale and convert with the posix function
     // if the tag is textenc::msvc convert the msvc literal from
     // whatever crazy encoding it uses to utf8, ...etc.
     convert_and_store(cstr, encoding));
  }

  template <typename Char, typename EncodingTag>
  text(const std::basic_string<Char>& cstr, EncodingTag encoding)
  {
     convert_and_store(str.begin(), str.end(), encoding));
  }

  // *** conversion ***
  // by default output in uft8
  const char* c_str(void) const;

  // by default in utf8 (could be a friend fn instead of member)
  std::string str(void) const;

  // (could be a friend fn instead of member)
  template <typename EncodingTag>
  std::string str(EncodingTag encoding) const
  {
     return convert_from(encoding);
  }

  // wide char string output
  template <typename EncodingTag>
  std::wstring wstr(EncodingTag encoding) const
  {
     return wconvert_from(encoding);
  }

  // implement whatever functionality
  // making sense for utf8-encoded-text
};

// usage

text t1 = "blahblah"; // must be utf8

// whatever encoding the compiler uses for wide literals
text t2(L"blablablabl", textenc::compiler());

text t3(some_posix_function(), textenc::posix());

text t4(SomeWinapiFunc(), textenc::winapi());
text t5(SomeWinapiFuncW(), textenc::winapi());

text t6(pq_some_func(), textenc::libpq());

text t7 = concat(t1, t2, t3, t4, t5, t6);

std::ostream& out = get_outs();
out << t7; // output in utf8

text t8;
std::istream& in = get_ins();
in.read_line(t8);

text t9;
in.read(t9, 1024);

some_function_expecting_utf8(t9.c_str());

SomeWinapiFunction(t8.str(textenc::winapi()).c_str());
SomeWinapiFunctionW(concat(t9, text::newline(),
t8).wstr(textenc::winapi()).c_str());

some_posix_function(transform(concat(t4, t7,
t9)).str(textenc::posix()).c_str());

some_wrapped_os_function(str(t8, textenc::os()));

some_stdlib_function(str(head(substring_after(t9, t2), 10), textenc::stdlib()));

i.e. besides the fact that the string "uses utf8" (there is already
a whole heap of such strings) it must also handle all the conversions
between utf8 and whatever the OS and the major libraries and
APIs expect and use; conveniently (and effectively).
Otherwise the effort is IMHO wasted.

Boost libraries (at the very least those wrapping OS functionality)
should adopt this text class, and do the conversions, "just-in-time"
when making the OS API call.

My 0.02Euro

Best,

Matus


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk