Boost logo

Ublas :

From: Gunter Winkler (guwi17_at_[hidden])
Date: 2005-06-02 05:35:19


On Wednesday 01 June 2005 23:00, Michael Stevens wrote:
> On Wednesday 01 June 2005 22:24, Max Weinberg wrote:
> > Hi,
> > using Stefan's [corrected] code, I also see that valarray is up to two
> > times faster than ublas vector. It is hard to say where the inefficiency
> > comes from. I think it has to do with indirect adressing. ublas uses
> > indexed access when doing the vector addition. On each element access,
> > first the base address data () of the vector is fetched, then the index[]
> > is added.
> > When using the code variant below, you can get faster than valarray. [if
> > the code is correct, not thoroughly checked]. It uses sequential access
> > to vector elements. For unknown reasons, valarray also seems to use
> > indexed access (at least in the VC implementation).
>
> Oddly, and something I still find hard to believe, most modern processor
> compiler combinations execute indexed array access faster then using
> sequential pointers. uBLAS can use both. It defaults to indexing as this
> seems to work best. Not sure why VC should have problems in this case it
> certainly used to prefer indexing.
>
> You can influence uBLAS default by defining
> BOOST_UBLAS_USE_ITERATING
> when you compile.

This is a bad idea if I use gcc:
$ g++ -I /home/a11aguwi/include/ -o vector_add vector_add.cpp -DNDEBUG -O2
$ ./vector_add
3.23
3.3
5.69
5.7
$ g++ -I /home/a11aguwi/include/ -o vector_add vector_add.cpp -DNDEBUG -O3
$ ./vector_add
3.23
3.42
5.85
5.77
$ g++ -I /home/a11aguwi/include/ -o vector_add vector_add.cpp -DNDEBUG -O3
-funroll-all-loops
$ ./vector_add
2.9
2.91
4.7
4.69
$ g++ -I /home/a11aguwi/include/ -o vector_add vector_add.cpp -DNDEBUG -O3
-funroll-all-loops -DBOOST_UBLAS_USE_ITERATING
$ ./vector_add
2.92
2.91
48.72
49.02

the details: using valarray gcc creates this code
0x08048e10:
 fldl 0xffffffe8(%ecx)
 add $0x20,%esi
 faddl 0xffffffe8(%edx)
 fstpl 0xffffffe8(%edx)
 fldl 0xfffffff0(%ecx)
 faddl 0xfffffff0(%edx)
 fstpl 0xfffffff0(%edx)
 fldl 0xfffffff8(%ecx)
 faddl 0xfffffff8(%edx)
 fstpl 0xfffffff8(%edx)
 fldl (%ecx)
 add $0x20,%ecx
 faddl (%edx)
 fstpl (%edx)
 add $0x20,%edx
 cmp %ebx,%esi
 jb 0x8048e10

Here we see the very nice loop unrolling and mostly floating point operations.
The addressing is: base + displacement

for ublas vectors the generated code look like this:

    0x080496e0 fldl 0xffffffe8(%eax,%edx,1)
    0x080496e4 add $0x4,%esi
    0x080496e7 faddl 0xffffffe8(%ecx,%edx,1)
    0x080496eb fstpl 0xffffffe8(%ebx,%edx,1)
    0x080496ef fldl 0xfffffff0(%eax,%edx,1)
    0x080496f3 faddl 0xfffffff0(%ecx,%edx,1)
    0x080496f7 fstpl 0xfffffff0(%ebx,%edx,1)
    0x080496fb fldl 0xfffffff8(%eax,%edx,1)
    0x080496ff faddl 0xfffffff8(%ecx,%edx,1)
    0x08049703 fstpl 0xfffffff8(%ebx,%edx,1)
    0x08049707 fldl (%eax,%edx,1)
    0x0804970a faddl (%ecx,%edx,1)
    0x0804970d fstpl (%ebx,%edx,1)
    0x08049710 add $0x20,%edx
    0x08049713 cmp %edi,%esi
    0x08049715 jb 0x80496e0

again we see unrolled loops with a different adressing: base + 1*index +
displacement.
_But_ there a three different base pointers: %eax, %ecx and %ebx. This
indicates that a temporary is created.

Now I can draw 3 conclusions:
1. I have to use the noalias() or plus_assign() function in order to have best
performance
  (and indeed noalias( tmp ) += v or tmp.plus_assign(v) solves the problem)
2. Valarray does optimizations that may be dangerous, and ublas does not do
this.
3. ublas does not recognize that no temporary is needed
  
Assume you have a vector of size 10 and add the first 8 elements to the last
8:
  range(v, 2, 10) += range(v, 0, 8)
here you need either need a temporary or you have to start computing the last
element. (the STL seems to check if there is aliasing and use a special
assignment)

I think the problem is that ublas always assumes aliasing and the developer
has to explicitly state where aliasing can not be. May be we should make a
big note in the documentation.

The results:
$ g++ -g -I /home/a11aguwi/include/ -o vector_add vector_add.cpp -DNDEBUG -O2
-funroll-loops
$ ./vector_add
2.9 (valarray)
2.97 (noalias)
2.93 (plus/minus_assign)

mfg
Gunter