Boost logo

Boost :

Subject: [boost] [codecvt] UTF-8 codecvt
From: Artyom Beilis (artyomtnk_at_[hidden])
Date: 2012-01-24 06:33:01


Hello, There is an implementation of UTF-8 codecvt facet for various purposes. Unfortunately it has two problems: 1. It does not implement UTF-16 properly (only UCS-2) 2. It requires linking with some library. Boost.Locale provides codecvt facets and I'm planning to add some template for codecvt converter in header files basically Boost.Locale will have header only version of utf-8 codecvt facet and actually very simple pattern to implement any codecvt facet for stateless encoding. I thought about two options: 1. Put it in boost::locale namespace as part of Boost.Locale library 2. Create some small "codecvt" boost library that would provide a simple    framework for generation of codecvt facets for boost in general    and it would include utf8 codecvt as well. Generally it would look like this: Converter concept:     class Converter {     public:         // copyable         Converter(Converter const &);         // Max MB length for single Unicode code point         int max_length() const;          // convert a single mb sequence to code point         // returing constants illegal or incomplete in case         // if invalid sequence to incomplete sequence         uint32_t to_unicode(char const *&begin,char const *end) const;         // Convert codepoint u to [begin,end) returning mb length         // or returning illegal if it is impossible to convert u to mb sequence         // or U is invalid or incomplete if [begin,end) has not enough space         uint32_t from_unicode(uint32_t u,char *begin,char const *end) const;     }; Note, the converter works only for stateless encodings (which are 99% of all used encodings) And the basic facet would be defined as following:   template<typename CharType,typename EncoderType,int size = sizeof(CharType)>   class code_converter; Such that for converter concept utf8_converter I can create a utf8_codecvt facet as     code_converter<wchar_t,utf8_converter> - is a facet that     can be imbued to the locale object. The biggest "trick" is actually implementing codecvt facet to handle UTF-16 properly and it is implemented in Boost.Locale. So Boosters: Do You Want It Boost.Codecvt or it is fine to have it as part of Boost.Locale. In either case it would be header only "codecvt" framework.   Artyom Beilis -------------- CppCMS - C++ Web Framework:   http://cppcms.sf.net/ CppDB - C++ SQL Connectivity: http://cppcms.sf.net/sql/cppdb/


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk