Boost logo

Boost Users :

Subject: [Boost-users] (no subject)
From: Сергей Филиппов (rstain_at_[hidden])
Date: 2011-06-10 10:53:46


Hi all,
recently, I've found a bug in regex that can lead to access violation.
 std::string s(".*?");
 boost::regex regEx(s.begin(), s.end());  // Potential AV here
 
basic_regex constructor creates a local variable of type traits::string_type and passes its the first element address and one-beyond-the-last element address to the assing() function.
   template <class InputIterator>
   basic_regex(InputIterator arg_first, InputIterator arg_last, flag_type f = regex_constants::normal)
   {
      typedef typename traits::string_type seq_type;
      seq_type a(arg_first, arg_last);
      if(a.size())
         assign(&*a.begin(), &*a.begin() + a.size(), f);
      else
         assign(static_cast<const charT*>(0), static_cast<const charT*>(0), f);
   }
Calling assign() eventually leads to creation of basic_regex_parser object and calling its parse_repeat() function. m_begin and m_end members are initialized with values passed to assign().
template <class charT, class traits>
bool basic_regex_parser<charT, traits>::parse_repeat(std::size_t low, std::size_t high)
{
...
      // OK we have a perl or emacs regex, check for a '?':
      if(this->m_traits.syntax_type(*m_position) == regex_constants::syntax_question)
      {
         greedy = false;
         ++m_position;
      }
      // for perl regexes only check for pocessive ++ repeats.
      if((0 == (this->flags() & regbase::main_option_type))
         && (this->m_traits.syntax_type(*m_position) == regex_constants::syntax_plus))
      {
         pocessive = true;
         ++m_position;
      }
...
}
In parse_repeat() the m_position, member points to '?' so condition in the first if is true and m_position is advanced by ++m_position. Now it is equal to m_end.
In the next if statement, *m_position is evaluated that may cause access violation.
Actually, it is unlikely that the sample code above will cause AV. It is because seq_type in basic_regex constructor is basic_string, and basic_string's underlying buffer is usually null-terminated. So m_end points to the null terminator. It always readable, no AV can happen.
The actual code I used is slightly different but the concept is the same:
boost::u32regex regEx = boost::make_u32regex(L".*?");
u32regex and make_u32regex are provided by the ICU library. That is a way to support unicode strings.
seq_type become std::vector<int,std::allocator<int> >. Some time the "a" object's underlying buffer was allocated right in the end of page, so basic_regex_parser::m_end pointed to the start of the next page that was not allocated. Dereferencing m_position caused access violation.
My environment is Visual VC++ 9.0, Boost 1.42.0 (as far I can see, the code in Boost 1.46.1 is the same).
By the way. When traits::string_type is basic_string, the code in basic_regex::basic_regex is not portable:
assign(&*a.begin(), &*a.begin() + a.size(), f);
If I remember correctly, standard does not guarantee that basic_string's controlling sequence is contiguous.

Just in case, the call stack when AV occured:
boost::re_detail::basic_regex_parser<int,boost::icu_regex_traits>::parse_repeat+0x87
boost::re_detail::basic_regex_parser<int,boost::icu_regex_traits>::parse_extended+0x182
boost::re_detail::basic_regex_parser<int,boost::icu_regex_traits>::parse+0x133
boost::re_detail::basic_regex_implementation<int,boost::icu_regex_traits>::assign+0x98
boost::basic_regex<int,boost::icu_regex_traits>::do_assign+0x14c
boost::basic_regex<int,boost::icu_regex_traits>::assign+0x16
boost::basic_regex<int,boost::icu_regex_traits>::assign<int *>+0xa2
boost::basic_regex<int,boost::icu_regex_traits>::basic_regex<int,boost::icu_regex_traits><boost::u16_to_u32_iterator<wchar_t const *,int> >+0xb8
boost::re_detail::do_make_u32regex<wchar_t const *>+0x25
boost::make_u32regex+0x2c

Sergey



Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net