Boost logo

Boost Users :

From: Zach Laine (whatwasthataddress_at_[hidden])
Date: 2019-11-01 21:06:25


Mathias pointed out that I sent this just to him. So I'm replying again to
get this onto the list. Sorry for the noise.

On Fri, Nov 1, 2019 at 11:09 AM Zach Laine <whatwasthataddress_at_[hidden]>
wrote:

> On Fri, Nov 1, 2019 at 6:35 AM Mathias Gaunard <
> mathias.gaunard_at_[hidden]> wrote:
>
>> On Sat, 26 Oct 2019 at 02:11, Zach Laine via Boost-users
>> <boost-users_at_[hidden]> wrote:
>> >
>> > About 14 months ago I posted the same thing. There was significant
>> work that needed to be done to Boost.Text (the proposed library), and I was
>> a bit burned out.
>> >
>> > Now I've managed to make the necessary changes, and I feel the library
>> is ready for review, if there is interest.
>> >
>> > This library, in part, is something I want to standardize.
>> >
>> > It started as a better string library for namespace "std2", with
>> minimal Unicode support. Though "std2" will almost certainly never happen
>> now, those string types are still in there, and the library has grown to
>> also include all the Unicode features most users will ever need.
>> >
>> > Github: https://github.com/tzlaine/text
>> > Online docs: https://tzlaine.github.io/text
>>
>> I would start by removing the superlative statements about Unicode
>> being "hard" or "crazy".
>> It's not that complicated compared to the actual hard problems that
>> software engineers solve everyday. The only thing is that people
>> misunderstand what the scope of Unicode is, it's not just an encoding,
>> it's a a database and a set of algorithms (relying on said database)
>> to facilitate natural text processing of arbitrary scripts, and does
>> compromises to integrate with existing industry practices prior to all
>> those scripts being brought together under the same umbrella.
>>
>
> Right. Unicode encodes all natural languages that anyone has taken the
> time to put into Unicode. I stand by the implication that natural
> languages are crazy.
>
>
>> Now the string/container/memory management, this is quite irrelevant.
>> That sort of stuff has nothing to do with Unicode and I certainly do
>> not want some Unicode library to mess with the way I am organizing how
>> my data is stored in memory.
>> Your rope etc. containers belong in a completely independent library.
>>
>
> So then maybe don't use those parts? They're independent; you don't have
> to use them to use the Unicode algorithms.
>
>
>> What's important is providing an efficient Unicode character database,
>> and implementing the algorithms in a way that is generic, working for
>> arbitrary ranges and being able to be lazily evaluated (i.e. range
>> adaptors).
>> I already did all that work more than 10 years ago as a two-month GSoC
>> project, though there are some limitations since at that time ranges
>> and ranges adaptors were still fairly new ideas for C++. It does
>> however provide a generic framework to define arbitrary algorithms
>> that can be evaluated either lazily or eagerly.
>>
>
> Clearly you are more capable than I am. It took me a lot longer to do
> than 2 months. Why did you never submit this for a Boost review? You were
> thinking about it, ~10 years ago, but you never did....
>
>
>> To be honest I can't say I find your library to be much of an
>> improvement, at least in terms of usability, since the programming
>> interface seems more constrained (why don't things work with arbitrary
>> ranges rather than this "text" containers)
>
>
> They do, of course. I'm not sure why it is you think otherwise.
>
>
>> and verbose (just look at
>> the code to do transcoding with iterators),
>
>
> Are you referring to the verbosity of:
>
> char const * some_utf8 = /* ... */ ;
> out = std::ranges::copy(boost::text::as_utf32(some_utf8), out);
>
> , or:
>
> out = boost::text::transcode_utf_8_to_32(utf8_first, utf8_last, out);
>
> , or something else?
>
>
>> the set of features is
>> quite small,
>
>
> That is quite intentional. I want to standardize *basic* Unicode
> support. I feel that what I have in Boost.Text is the basic set that users
> will need, just to support languages or formatting conventions that are not
> common in their favorite environment. For instance, today there is no
> standard way of taking UTF-8 and turning it into UTF-16, or vice versa;
> this library is intended to work at that level. That is, it is intended to
> fill in needless gaps in Unicode support that exist in C++ -- gaps that no
> other major language besides C has. It is specifically not intended to
> replace all ICU functionality. Do you have specific things in mind that
> you think ~90% of Unicode-aware C++ users will need? Note that I did not
> say 100%.
>
>
>> and that the database itself is not even accessible,
>
>
> That's also intentional. Another goal of the library is to make Unicode
> as simple as possible for naive users who just want to do the basics. If I
> find requests for any new feature that has a compelling use case, I'll add
> that.
>
>
>> and
>> last I remember your implementation was ridiculously bloated in size.
>>
>
> I don't consider 1.5MB for a database containing all human languages in
> widespread use on computers to be a ridiculous size, but YMMV.
>
>
>> It also doesn't provide the ability to do fast substring search, which
>> you'd typically do by searching for a substring at the character
>> encoding level and then eliminating matches that do not fall on a
>> satisfying boundary, instead suggesting to do the search at the
>> grapheme level which is much slower, and the facility to test for
>> boundary isn't provided anyway.
>>
>
> I honestly don't know what you mean here. If you use the text::text or
> text::string types, those are just contiguous sequences of bits, like a
> std::vector or std::string. text::text exposes iterators to those bits
> which can be used to get grapheme, code point, and/or UTF-8 byte views of
> the underlying data. If you are using something else besides text::text or
> text::string two types, you presumably have access to your own bits in your
> own representation. What prevents you from doing whatever substring search
> you like, via std::search(), std::ranges::includes(), or something else?
> Boost.Text is not intended as a string algorithms library.
>
> I'm pretty sure I made similar comments in the past, but I don't feel
>> like any of them has been addressed.
>>
>
> I think you're referring to this email you sent in the Boost.Text
> interest thread from 14 months ago:
>
> """
> The Unicode library I did as a SoC project in 2009 was significantly
> smaller than that and if I recall correctly it has more data than the one
> in your library.
> Clearly some work can be done here to better optimize the database size.
> """
>
> I did make it a bit smaller. The other comments are new.
>
> Zach
>
>



Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net