Vardan,

I tried matrix<float>. In that same sample, with matrix I get 48ms/step no matter I run that element-wise ops section of the code or not. With compressed_matrix I get 25ms/step with the element wise ops and without them I get 2ms/step.

x