If explicit simd should not be used, by now, then you should help the compiler to generate more optimized code, by aligning properly the buffers.There is the boost.align library, which provides an aligned allocator (http://www.boost.org/doc/libs/1_60_0/doc/html/align.html)
void * malloc_(std::size_t alignment, std::size_t size) { alignment = std::max(alignment, alignof(void *)); size += alignment; void *ptr = std::malloc(size); void *ptr2 = (void *)(((uintptr_t)ptr + alignment) & ~(alignment-1)); void **vp = (void**) ptr2 - 1; *vp = ptr; return ptr2; } void free_(void *ptr) { std::free(*((void**)ptr-1)); }