Boost logo

Boost :

Subject: Re: [boost] GSoC Proposal Preparation For Encoding Awared String
From: Soares Chen (crf_at_[hidden])
Date: 2011-03-16 07:30:00


Just to add one more proposal that I missed. I remember Mathias
Gaunard suggested some where in the discussion to use range of two
iterators to represent code points, characters, and strings. I don't
remember the exact details but I think at least that is how
Boost.Unicode represent Unicode strings using boost::iterator_range. I
see it has similarity with other proposals, but I'll leave it for
later discussions. Mathias Gaunard please correct me if I understand
wrongly, thanks.

On Wed, Mar 16, 2011 at 1:20 AM, Soares Chen <crf_at_[hidden]> wrote:
> Hi all,
>
> I am an undergraduate student from the National University of
> Singapore and I am interested to take part in this year's GSoC with
> the Boost community. Currently I am preparing my proposal to add
> support for encoding awareness in string through new/existing Boost
> project, but there are some questions I would like to ask to clarify
> the community's interest in such a project.
>
> The inspiration I get came from several lengthy discussions I found in
> the Boost mailing list archive, which happened recently from mid
> January to mid February. [5][6][7] For a brief overview according to
> my own understanding, the heated debates was mainly around different
> ways to ensure consistency between the encoding expected when library
> code accepts strings, and the actual encoding of strings passed by
> users. The problem arise because a small minority of developers use
> std::string in different encoding than UTF-8, and the implicit
> assumption of UTF-8 encoding for std::string brings inconsistency and
> causes numerous bugs that are outside the scope of the Boost library.
>
> From the discussion, I found several proposals that have been made to
> solve the inconsistencies of std::string encoding:
> 1. Create new classes that warp around std::string, std::u16string,
> and std::u32string for each encodings and ensure encoding correctness
> simply through C++ type safety features. The classes are tentatively
> called the utf*_t classes (which many disliked the name) - Proposed by
> Chad Nelson with working prototype available. [1]
> 2. Continue to strongly enforce the assumption that all std::strings
> are UTF-8 encoded. Depreciate or make it hard to use other encodings
> in std::string.
> 3. Reinvent std::string and introduce boost::string. The new string
> class is proposed to be immutable but also delegate the encoding
> awareness to templated view<> classes that warp around boost::string,
> which IMHO the view<> classes share similarity with proposal (1). [7]
>
> Now I try not to go into the details and pros & cons of each proposal
> to avoid turning this thread into yet another string discussion. The
> original discussions have 542 messages in total, spanned a whole
> month, and I failed to find any conclusion that everyone could agree
> on. What I notice in the discussions is that there are several groups
> of people that have strong opinion on different ways to solve the
> problem and could not generally agree with each other. I also found
> that the discussions often drifted away and lose focus on the original
> problem, but every now and then someone would mention the problem
> again and proposed a solution that is similar to earlier proposals.
>
> Nevertheless, the discussion was extremely informative and insightful.
> I learned a lot by just reading through these discussions. But since I
> intend to start a GSoC project based on this subject, I hesitated on
> what I should really do in this project as I feel that there is no
> general agreement on how to solve this problem in the Boost community.
> Although I have some ideas and I personally lean towards proposal (1)
> by Chad Nelson, I think it'll be best if my project can fit in the
> interest of majority of the Boost community members. I think it is
> also best to avoid any further discussion on this topic to actively
> make the design decision, as the time period for GSoC is limited and
> the discussion tends to be never-ending.
>
> I think there are several options that I can choose for my project:
> 1. To use Chad Nelson's code as base, try to incorporate other ideas
> proposed in the mailing list, integrate with Boost.Locale, and make it
> Boost quality to submit for review. If this option is chosen, I wish
> that Chad Nelson can be my mentor.
> 2. To start a new code base, gather and compile ideas suggested in
> mailing list, final design decisions made by me and my mentor but not
> the community (to keep the project going on fast), make it Boost
> quality and submit for review.
> 3. To start the boost::string project, where another better string is
> reinvented and fix all the weaknesses of std::string.
> 4. Adopt different proposal, and improve on existing project such as
> Boost.Unicode [2] or Boost.Locale [3] such that it really solves the
> encoding awareness problem.
> 5. Any other suggestion?
>
> I hope to get feedback from you on what should I really focus on in
> this project. Of course I also hope that this subject is mature enough
> to be accepted as GSoC project as I can see great interest in the
> community to solve this problem. I would also like to clarify again
> that I do not intend to solve actual Unicode handling problems in this
> project - there are already excellent libraries such as Boost.Locale
> designed for it. My main objective is to design a set of interfaces
> that help to ensure encoding correctness and consistency when strings
> are being passed between different functions. I look forward for
> anyone that is interested in this project and is willing to be my
> mentor.
>
> Lastly, I apology for my grammar and any possible misunderstanding
> that caused by my bad writings. Please do correct me if I have missed
> anything or misunderstood some aspects. I will write a complete and
> formal proposal once I hear feedbacks from you.
>
> Thank you very much and hope that I can start contributing to the
> Boost community!
>
> Best Regards,
>
> Soares Chen
> National University of Singapore
>
>
> References:
> [1] The Oak Circle C++ (Unicode) Toolkit, by Chad Nelson.
> http://www.oakcircle.com/toolkit.html
> [2] Boost.Unicode, by Mathias Gaunard.
> http://mathias.gaunard.com/unicode/doc/html/
> [3] Boost.Locale. http://cppcms.sourceforge.net/boost_locale/html/index.html
> [4] Should UTF-16 be considered harmful?, Stack Overflow.
> http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful
> [5] Always treat std::strings as UTF-8?, Boost Mailing List
> Discussion. http://groups.google.com/group/boost-developers-archive/browse_thread/thread/13966c1a3d4ceadd/1be0173d252deb62
> [6] What will string handling in C++ look like in the future, Boost
> Maling List Discussion.
> http://groups.google.com/group/boost-developers-archive/browse_thread/thread/deed8f95125dce02/c6e517b77f403eda
> [7] [string] proposal, Boost Mailing List Discussion.
> http://groups.google.com/group/boost-devel-archive/browse_thread/thread/f8516df28af22c4b/400f2e616de10ef0
>
> (Sorry for linking the mailing list archive to Google Groups, but I
> feel that Google Group provides better interface for reading archives
> for those who haven't read the discussions)
>


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk