tokenizer and wstring with VC7.1

newer
Re: [Boost-users] Boost Newbie:...

Douglas G. Hanley

20 Feb 2004 20 Feb '04

2:40 p.m.

Has anybody managed to get tokenizer working for wide characters with VC7.1 (boost version 1.31.0)? The following example works fine... typedef tokenizer<char_separator<std::string::value_type>, std::string::const_iterator, std::string> MyTokenizer; const char_separator<std::string::value_type> sep("a"); MyTokenizer token(std::string("abacadaeafag"), sep); for (MyTokenizer::const_iterator it = token.begin(); it != token.end(); ++it) { std::cout << *it; } ...while the following example produces no output... typedef tokenizer<char_separator<std::wstring::value_type>, std::wstring::const_iterator, std::wstring> MyTokenizer; const char_separator<std::wstring::value_type> sep(L"a"); MyTokenizer token(std::wstring(L"abacadaeafag"), sep); for (MyTokenizer::const_iterator it = token.begin(); it != token.end(); ++it) { std::wcout << *it; } Cheers, Douglas.

Attachments:

attachment.html (text/html — 2.7 KB)

Show replies by date

Keith MacDonald

21 Feb 21 Feb

9:10 a.m.

Messagetokenizer worked in Unicode for me, so I experimented with your example to try to find out what made the difference. To simplify building in different modes, I changed it to the following: // ==== BEGIN CODE ==== // Unicode Build: cl /D_UNICODE /EHsc /IF:\Dev\boost_1_31_0 tok.cpp // DBCS Build: cl /EHsc /IF:\Dev\boost_1_31_0 tok.cpp // #include <string> #include <string> #include <iostream> #include <boost/tokenizer.hpp> #ifdef _UNICODE typedef std::basic_string<wchar_t> string_t; #define _T(x) L##x #define STDOUT std::wcout #else typedef std::basic_string<char> string_t; #define _T(x) x #define STDOUT std::cout #endif typedef string_t::value_type char_t; typedef boost::tokenizer < boost::char_separator<char_t>, string_t::const_iterator, string_t

...

MyTokenizer;

const boost::char_separator<char_t> sep(_T("a")); int main() { #ifdef _BUG MyTokenizer token(string_t(_T("abacadaeafag")), sep); #else string_t s(_T("abacadaeafag")); MyTokenizer token(s, sep); #endif for (MyTokenizer::const_iterator it = token.begin(); it != token.end(); ++it) STDOUT << *it; return 0; } // ==== END CODE ==== The following table shows the output when _UNICODE and _BUG are defined: _UNICODE _BUG Output ----------------------------- undef def " bcdefg" def def "" undef undef "bcdefg" def undef "bcdefg" It seems that the tokenizer constructor is handling both Unicode and MBCS temporary strings incorrectly, with VC7.1. Keith MacDonald "Douglas G. Hanley" <DHanley@neverfailgroup.com> wrote in message news:8E1D6FAA50041A4CB4C2A0179B608D153B6412@ng-ald-mail.aldermaston.neverfailgroup.com... Has anybody managed to get tokenizer working for wide characters with VC7.1 (boost version 1.31.0)? The following example works fine... typedef tokenizer<char_separator<std::string::value_type>, std::string::const_iterator, std::string> MyTokenizer; const char_separator<std::string::value_type> sep("a"); MyTokenizer token(std::string("abacadaeafag"), sep); for (MyTokenizer::const_iterator it = token.begin(); it != token.end(); ++it) { std::cout << *it; } ...while the following example produces no output... typedef tokenizer<char_separator<std::wstring::value_type>, std::wstring::const_iterator, std::wstring> MyTokenizer; const char_separator<std::wstring::value_type> sep(L"a"); MyTokenizer token(std::wstring(L"abacadaeafag"), sep); for (MyTokenizer::const_iterator it = token.begin(); it != token.end(); ++it) { std::wcout << *it; } Cheers, Douglas. _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Keith MacDonald

22 Feb 22 Feb

8:46 a.m.

I've run the same tests with gcc 3.2.2 on RH9, without any problems, so I'll post this on the microsoft.public.vc.language newsgroup. Keith MacDonald "Keith MacDonald" <boost@mailclan.net> wrote in message news:c177ag$79p$1@sea.gmane.org...

...

Messagetokenizer worked in Unicode for me, so I experimented with your example to try to find out what made the difference. To simplify building in different modes, I changed it to the following:

// ==== BEGIN CODE ==== // Unicode Build: cl /D_UNICODE /EHsc /IF:\Dev\boost_1_31_0 tok.cpp // DBCS Build: cl /EHsc /IF:\Dev\boost_1_31_0 tok.cpp // #include <string> #include <string> #include <iostream> #include <boost/tokenizer.hpp>

#ifdef _UNICODE typedef std::basic_string<wchar_t> string_t; #define _T(x) L##x #define STDOUT std::wcout #else typedef std::basic_string<char> string_t; #define _T(x) x #define STDOUT std::cout #endif

typedef string_t::value_type char_t;

typedef boost::tokenizer < boost::char_separator<char_t>, string_t::const_iterator, string_t

...
MyTokenizer;

const boost::char_separator<char_t> sep(_T("a"));

int main() { #ifdef _BUG MyTokenizer token(string_t(_T("abacadaeafag")), sep); #else string_t s(_T("abacadaeafag")); MyTokenizer token(s, sep); #endif

for (MyTokenizer::const_iterator it = token.begin(); it != token.end(); ++it) STDOUT << *it;

return 0; } // ==== END CODE ====

The following table shows the output when _UNICODE and _BUG are defined:

_UNICODE _BUG Output ----------------------------- undef def " bcdefg" def def "" undef undef "bcdefg" def undef "bcdefg"

It seems that the tokenizer constructor is handling both Unicode and MBCS temporary strings incorrectly, with VC7.1.

Keith MacDonald

Bronek Kozicki

12:23 p.m.

On Sat, 21 Feb 2004 09:10:43 -0000, Keith MacDonald wrote:

...

MyTokenizer token(string_t(_T("abacadaeafag")), sep);

if you take a look into tokenizer constructor template <typename Container> tokenizer(const Container& c,const TokenizerFunc& f) : first_(c.begin()), last_(c.end()), f_(f) { } you will notice that it's not storing copy of its string argument; instead it stores only its begin and end iterator. When string variable is destroyed (and in your example it's temporary variable; thus its destroyed at the end of expression) these iterators are no longer valid. Problem you are experiencing here is unusual manifestation of undefined behaviour - you are working in invalid interators. I think that program crash would be better indicatation that you have serious problem, but undefined behaviour may manifest in any other way - this time it's just as if tokenizer is empty. Of course, slight change of program or compilation options may result in crash (or anything else), until you remove undefined behaviour:

...

string_t s(_T("abacadaeafag")); MyTokenizer token(s, sep);

Keith MacDonald

8:36 p.m.

Hmmm. I've been trying to use various members of the boost library as black boxes, but this issue highlights the danger of doing so. I suppose a language keyword is needed to specify when a non-temporary object is required as an actual parameter. Given that there's no such thing, perhaps it would be safer to eliminate such convenience constructors from the library? Keith MacDonald "Bronek Kozicki" <brok@rubikon.pl> wrote in message news:1d55128kvroww.cbh559gfhwom.dlg@40tude.net...

...

On Sat, 21 Feb 2004 09:10:43 -0000, Keith MacDonald wrote:

...
MyTokenizer token(string_t(_T("abacadaeafag")), sep);

if you take a look into tokenizer constructor

template <typename Container> tokenizer(const Container& c,const TokenizerFunc& f) : first_(c.begin()), last_(c.end()), f_(f) { }

you will notice that it's not storing copy of its string argument; instead it stores only its begin and end iterator. When string variable is destroyed (and in your example it's temporary variable; thus its destroyed at the end of expression) these iterators are no longer valid. Problem you are experiencing here is unusual manifestation of undefined behaviour - you are working in invalid interators. I think that program crash would be better indicatation that you have serious problem, but undefined behaviour may manifest in any other way - this time it's just as if tokenizer is empty. Of course, slight change of program or compilation options may result in crash (or anything else), until you remove undefined behaviour:

...
string_t s(_T("abacadaeafag")); MyTokenizer token(s, sep);

B.

Bronek Kozicki

9:09 p.m.

On Sun, 22 Feb 2004 20:36:54 -0000, Keith MacDonald wrote:

...

Hmmm. I've been trying to use various members of the boost library as black boxes, but this issue highlights the danger of doing so. I suppose a language keyword is needed to specify when a non-temporary object is required as an actual parameter. Given that there's no such thing, perhaps it would be safer to eliminate such convenience constructors from the library?

I think that simplest thing to do would be to explain the problem in tokenizer documentation. B. PS. There is a chance that C++ will be enriched with syntax allowing to detect rvalue (temporary value) used as function parameter, see: http://std.dkuug.dk/jtc1/sc22/wg21/docs/papers/2002/n1377.htm

8045

Age (days ago)

8047

Last active (days ago)

List overview

5 comments

3 participants

participants (3)

Bronek Kozicki
Douglas G. Hanley
Keith MacDonald