Subject: Re: [ublas] GSOC 2013
From: Hoang Giang Bui (hgbk2008_at_[hidden])
Date: 2013-03-27 11:18:08
I support Riccardo's idea. I like ublas to be template'd as much as
possible. Since it runs effectively with small matrix. For medium size
matrix (100 < n <1000), I would go to a blas/lapack binding. For bigger
size (>1000), I will go for a separate package (petsc/trilinos). Making
ublas effective globally, I think, personally, not a good idea,
providing that other linear algebra package has long duration of
development and performs stably/effectively.
On 03/26/13 22:06, Riccardo Rossi wrote:
> Dear Oswin,
> while you are for sure right on the performance side, you
> should recognise that linking to blas is the proverbial "pain in the
> ass". ever tried to do it in windows 64 without having a commercial
> fortran compiler? try, enjoy and report...
> on my side ublas is really good as it allows u doing operations with
> small matrices (either of fixed size or of variable size but still
> small) in a simple and portable way.
> such operations should use effectively the cache and hence NOT be
> memory bound. for this reason eigen, blitz++, etc can be faster than
> i really wish there was some work in having ublas to be competitive
> with eigen for such small matrices. my only point is that i would NOT
> mix the concept of vectors and of matrices.
> furthermore if i was to wish something than i would really love to
> have the possibility to have such small matrices as elements of a CSR
> matrix... then you would have a computation bound spmv (or spmm) which
> would be very nice for many applications.
> greetings to everyone
> On Tue, Mar 26, 2013 at 8:56 PM, oswin krause
> <mailto:oswin.krause_at_[hidden]>> wrote:
> there is one more thing i want to comment, and this is on the more
> serious side:
> On 23.03.2013 16:15, Nasos Iliopoulos wrote:
>> Since mdsd:array is a generic multi-dimensional container it is
>> not bound to algebraic operations. I expect that with proper
>> aligned memory allocation and SSE aglorithms (It is easy to add a
>> custom storage container that supports that) it will be as fast
>> as MKL, GotoBLAS, Eigen or armadillo. I believe that within that
>> context, a GSOC project will need to include both the matrix
>> container and the SSE algorithms tasks, or even AVX.
> and also the starting post from David itself:
> On 23.03.2013 13:47, David Bellot wrote:
>> OK, the idea behind this is to have a clean framework to enable
>> optimization based on SSE, Neon, multi-core, ... you name it.
> Just to make this clear: in the current state of the library, SSE,
> AVX, multi core computation etc won't cut it as soon as the
> arguments involved are bigger than ~32KB. In this case, uBLAS
> performance is memory bound.Thus we will only wait more efficient
> for the next block of memory. And even if it were not, the way
> ublas is designed makes it impossible to use vectorization aside
> from the c-style functions like axpy_prod, which can in 99% of all
> relevant cases be mapped on BLAS2/BLAS3 calls of the optimized C
> libraries(which give you AVX/SSE and OpenMP for free). If you
> expect that SSE helps you when computing your
> than you will be desperately disappointed in the current design.
> Now maybe some of you are thinking: "But all fast linear algebra
> libraries are using SSE, so you must be wrong". Simple answer:
> these libraries are not memory bound as they optimize for that
> (you can experience this yourself by comparing the performance of
> copying a big matrix to transposing it. Than try the transposition
> block-wise: allocate a small buffer, say 16x16 elements, and than
> read 16x16 blocks from the matrix, write them transposed into the
> buffer and than copy the buffer to the correct spot in the target
> matrix. this gives a factor 7 speed-up on my machine. no SSE, no
> Don't trust me, trust the writer of the gotoblas library:
> Goto, Kazushige, and Robert A. Geijn. "Anatomy of high-performance
> matrix multiplication." /ACM Transactions on Mathematical Software
> (TOMS)/ 34.3 (2008): 12.
> We all don't have enough time to implement fast linear algebra
> algorithms. Instead we should fall back to the numeric bindings as
> often as possible and use the power of expression templates to
> generate an optimal sequence of BLAS2/BLAS3 calls.
> I would also like to part in that if it happens.
> ublas mailing list
> ublas_at_[hidden] <mailto:ublas_at_[hidden]>
> Sent to: rrossi_at_[hidden] <mailto:rrossi_at_[hidden]>
> Dr. Riccardo Rossi, Civil Engineer
> Member of Kratos Team
> International Center for Numerical Methods in Engineering - CIMNE
> Campus Norte, Edificio C1
> c/ Gran Capitán s/n
> 08034 Barcelona, España
> Tel: (+34) 93 401 56 96
> Fax: (+34) 93.401.6517
> web:www.cimne.com <http://www.cimne.com/>
> ublas mailing list
> Sent to: hgbk2008_at_[hidden]