|
Boost Users : |
Subject: [Boost-users] (no subject)
From: СеÑгей Филиппов (rstain_at_[hidden])
Date: 2011-06-10 10:53:46
Hi all,
recently, I've found a bug in regex that can lead to access violation.
 std::string s(".*?");
 boost::regex regEx(s.begin(), s.end()); // Potential AV here
Â
basic_regex constructor creates a local variable of type traits::string_type and passes its the first element address and one-beyond-the-last element address to the assing() function.
  template <class InputIterator>
  basic_regex(InputIterator arg_first, InputIterator arg_last, flag_type f = regex_constants::normal)
  {
     typedef typename traits::string_type seq_type;
     seq_type a(arg_first, arg_last);
     if(a.size())
        assign(&*a.begin(), &*a.begin() + a.size(), f);
     else
        assign(static_cast<const charT*>(0), static_cast<const charT*>(0), f);
  }
Calling assign() eventually leads to creation of basic_regex_parser object and calling its parse_repeat() function. m_begin and m_end members are initialized with values passed to assign().
template <class charT, class traits>
bool basic_regex_parser<charT, traits>::parse_repeat(std::size_t low, std::size_t high)
{
...
     // OK we have a perl or emacs regex, check for a '?':
     if(this->m_traits.syntax_type(*m_position) == regex_constants::syntax_question)
     {
        greedy = false;
        ++m_position;
     }
     // for perl regexes only check for pocessive ++ repeats.
     if((0 == (this->flags() & regbase::main_option_type))
        && (this->m_traits.syntax_type(*m_position) == regex_constants::syntax_plus))
     {
        pocessive = true;
        ++m_position;
     }
...
}
In parse_repeat() the m_position, member points to '?' so condition in the first if is true and m_position is advanced by ++m_position. Now it is equal to m_end.
In the next if statement, *m_position is evaluated that may cause access violation.
Actually, it is unlikely that the sample code above will cause AV. It is because seq_type in basic_regex constructor is basic_string, and basic_string's underlying buffer is usually null-terminated. So m_end points to the null terminator. It always readable, no AV can happen.
The actual code I used is slightly different but the concept is the same:
boost::u32regex regEx = boost::make_u32regex(L".*?");
u32regex and make_u32regex are provided by the ICU library. That is a way to support unicode strings.
seq_type become std::vector<int,std::allocator<int> >. Some time the "a" object's underlying buffer was allocated right in the end of page, so basic_regex_parser::m_end pointed to the start of the next page that was not allocated. Dereferencing m_position caused access violation.
My environment is Visual VC++ 9.0, Boost 1.42.0 (as far I can see, the code in Boost 1.46.1 is the same).
By the way. When traits::string_type is basic_string, the code in basic_regex::basic_regex is not portable:
assign(&*a.begin(), &*a.begin() + a.size(), f);
If I remember correctly, standard does not guarantee that basic_string's controlling sequence is contiguous.
Just in case, the call stack when AV occured:
boost::re_detail::basic_regex_parser<int,boost::icu_regex_traits>::parse_repeat+0x87
boost::re_detail::basic_regex_parser<int,boost::icu_regex_traits>::parse_extended+0x182
boost::re_detail::basic_regex_parser<int,boost::icu_regex_traits>::parse+0x133
boost::re_detail::basic_regex_implementation<int,boost::icu_regex_traits>::assign+0x98
boost::basic_regex<int,boost::icu_regex_traits>::do_assign+0x14c
boost::basic_regex<int,boost::icu_regex_traits>::assign+0x16
boost::basic_regex<int,boost::icu_regex_traits>::assign<int *>+0xa2
boost::basic_regex<int,boost::icu_regex_traits>::basic_regex<int,boost::icu_regex_traits><boost::u16_to_u32_iterator<wchar_t const *,int> >+0xb8
boost::re_detail::do_make_u32regex<wchar_t const *>+0x25
boost::make_u32regex+0x2c
Sergey
Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net