|
Boost : |
Subject: [boost] [codecvt] UTF-8 codecvt
From: Artyom Beilis (artyomtnk_at_[hidden])
Date: 2012-01-24 06:33:01
Hello,
There is an implementation of UTF-8 codecvt facet for various purposes.
Unfortunately it has two problems:
1. It does not implement UTF-16 properly (only UCS-2)
2. It requires linking with some library.
Boost.Locale provides codecvt facets and I'm planning to add some
template for codecvt converter in header files basically
Boost.Locale will have header only version of utf-8 codecvt
facet and actually very simple pattern to implement
any codecvt facet for stateless encoding.
I thought about two options:
1. Put it in boost::locale namespace as part of Boost.Locale library
2. Create some small "codecvt" boost library that would provide a simple
framework for generation of codecvt facets for boost in general
and it would include utf8 codecvt as well.
Generally it would look like this:
Converter concept:
class Converter {
public:
// copyable
Converter(Converter const &);
// Max MB length for single Unicode code point
int max_length() const;
// convert a single mb sequence to code point
// returing constants illegal or incomplete in case
// if invalid sequence to incomplete sequence
uint32_t to_unicode(char const *&begin,char const *end) const;
// Convert codepoint u to [begin,end) returning mb length
// or returning illegal if it is impossible to convert u to mb sequence
// or U is invalid or incomplete if [begin,end) has not enough space
uint32_t from_unicode(uint32_t u,char *begin,char const *end) const;
};
Note, the converter works only for stateless encodings (which are 99% of all used
encodings)
And the basic facet would be defined as following:
template<typename CharType,typename EncoderType,int size = sizeof(CharType)>
class code_converter;
Such that for converter concept utf8_converter I can create a utf8_codecvt facet as
code_converter<wchar_t,utf8_converter> - is a facet that
can be imbued to the locale object.
The biggest "trick" is actually implementing codecvt facet to handle UTF-16 properly
and it is implemented in Boost.Locale.
So Boosters:
Do You Want It Boost.Codecvt or it is fine to have it as part of Boost.Locale.
In either case it would be header only "codecvt" framework.
Artyom Beilis
--------------
CppCMS - C++ Web Framework: http://cppcms.sf.net/
CppDB - C++ SQL Connectivity: http://cppcms.sf.net/sql/cppdb/
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk