lists.openwall.net  lists / announce owlusers owldev johnusers johndev passwdqcusers yescrypt popa3dusers / osssecurity kernelhardening musl sabotage tlsify passwords / cryptdev xvendor / Bugtraq FullDisclosure linuxkernel linuxnetdev linuxext4 linuxhardening PHC  
Open Source and information security mailing list archives
 

Date: Sun, 9 Feb 2014 07:49:03 +0400 From: Solar Designer <solar@...nwall.com> To: discussions@...swordhashing.net Subject: Re: [PHC] multiplyhardening (Re: NoelKDF ready for submission) On Sat, Feb 08, 2014 at 09:52:14PM 0500, Bill Cox wrote: > On Sat, Feb 8, 2014 at 3:31 AM, Solar Designer <solar@...nwall.com> wrote: > > value = ((value ^ (value >> 30)) * (mem[prevAddr + i]  3)) > > + mem[fromAddr + i] + i; > > > > (totally untested). > > > > Alexander > > I like this last one. This should get rid of the modulo 4 > simplifications. It adds another operation in the inner loop, and > it's the worst kind: the rightshift requires nothing but wires in an > ASIC. Still, this is what I'll try next if needed. Yes, the worst kind. Also, the XOR is only 2bit. Unfortunately. The only reason to choose this is that it's not entirely arbitrary but is also used in MT and thus has been a target for attacks for a while. Also, regarding latencies and throughput for various integer multiply instructions on x86 and x8664, including several SIMD forms, it is really nonobvious whether to choose 16x16>hi16(32) or 32x32>32 or 32x32>64 or 64x64>64 or even 64x64>128 for the primitive, especially if we're considering supporting SIMD (tunable). Up until Haswell and Avoton (8core Atom for servers), things were much clearer (32x32>32 had at most 5 cycles latency in 4x SIMD form on CPUs supporting such SIMD form), but those two microarchitectures, which we can't disregard now since both will be found on new servers, have messed the timings up  or maybe we're lucky to see this change (trend?) now rather than when it'd be too late. Based on data from http://users.atw.hu/instlatx64/ for currently most relevant CPUs, which I think are Intel's Core 2 through Haswell and Avoton, and AMD's Bulldozer/Piledriver, I got this: PMULHW 16x16>hi16 signed MMX(x4)  AVX2(x16) L: 35 T: 12 PMULHUW 16x16>hi16 unsigned SSE(x4),SSE2(x8),... L: 35 T: 12 PMULHRSW 16x16>16 fixed point SSSE3(x8), AVX2(x16) L: 35 T: 12 PMULDQ 32x32>64 signed SSE4(x2), AVX2(x4) L: 35 T: 12 PMULUDQ 32x32>64 unsigned SSE2(x2), AVX2(x4) L: 35 T: 12 PMULLD 32x32>32 SSE4(x4), AVX2(x8) L: 511 T: 111 IMUL 32x32>32 x86 L: 34 T: 12 IMUL 64x64>64 x8664 L: 36 T: 14 [I]MUL 32x32>64 x86 L: 45 T: 45 [I]MUL 64x64>128 x8664 L: 47 T: 37 These also appear to hold true for Pentium Pro family (PPro, P2, P3, and P2/P3 Celeron), for the instructions that exist on those CPUs, but not for P4 (which is slower, with its worst case appearing to be at L: 21 T: 11 for 64x64>128), nor for 1990's original Pentium P54C (L: 9 T: 9). As you can see, my old favorite PMULLD has really bad worstcase latency and even worse throughput, all due to Haswell and Avoton. On Haswell, it has 10 cycles latency (not only for the new 8x AVX2 form, but also for the old 4x SSE4.1/AVX form) and 2 cycles throughput (I take it to mean we'll have 5 SIMD vectors go through the pipeline in the 10 cycles). In terms of 32bit MULs/second throughput, this is fully compensated for by the doubling of vector width with AVX2 (when we're able to use that extra width), but 10 cycles latency is just inappropriate for our purpose, because we know an ASIC will cut it to 3 cycles. On Avoton, it's even worse: 11 cycles for both latency and throughput, and it only has SSE4.1, so we'll get only 4 32bit MULs per 11 cycles on it. On all CPUs in the Core 2 to Ivy Bridge range, PMULLD is 5 cycles latency (and 1 or 2 cycles throughput), so it's only moderately worse in terms of latency than nonSIMD 32bit multiplies (which are 3 or 4 cycles latency). So it would be within consideration. As a workaround, if we favor lower latency over higher throughput, maybe we should combine two PMULUDQ's to produce a lowerlatency equivalent of PMULLD when running on Haswell and Avoton (for Avoton, this would produce much higher throughput too). I've experimented with such emulation briefly, albeit in context of adding SSE2 support to php_mt_seed (which previously required SSE4.1+ for any SIMD at all), where plenty of parallelism is available (so I did not measure the effect on latency). Here are my emulated _mm_mullo_epi32()'s (normally SSE4.1+) with SSE2 intrinsics: #ifdef __GNUC__ #if 0 #define _mm_mullo_epi32(a, b) ({ \ __m128i _a = (a), _b = (b); \ __m128i p = _mm_mul_epu32(_a, _b); \ __m128i q = _mm_mul_epu32(_mm_srli_epi64(_a, 32), \ _mm_srli_epi64(_b, 32)); \ _mm_unpacklo_epi64( \ _mm_unpacklo_epi32(p, q), \ _mm_unpackhi_epi32(p, q)); \ }) #else #define _mm_mullo_epi32(a, b) ({ \ __m128i _a = (a), _b = (b); \ _mm_unpacklo_epi32( \ _mm_shuffle_epi32(_mm_mul_epu32(_a, _b), 0x08), \ _mm_shuffle_epi32(_mm_mul_epu32(_mm_srli_epi64(_a, 32), \ _mm_srli_epi64(_b, 32)), 0x08)); \ }) #endif #else #define _mm_mullo_epi32(a, b) \ _mm_unpacklo_epi32( \ _mm_shuffle_epi32(_mm_mul_epu32((a), (b)), 0x08), \ _mm_shuffle_epi32(_mm_mul_epu32(_mm_srli_epi64((a), 32), \ _mm_srli_epi64((b), 32)), 0x08)) #endif Instead of _mm_srli_epi64(..., 32), we may use _mm_srli_si128(..., 4)  which of these is faster and which is slower varies by CPU type. I think this can be optimized (manually) when put in specific context  there's not always a need to have the vector elements in the same order (and even in the same vectors) as with _mm_mullo_epi32() proper. I think this adds at best 4 cycles of latency on top of PMULUDQ's 5 cycles (on Haswell and Avoton), bringing the total latency to at least 9, which is only 1 or 2 cycles better than those arch's native PMULLD. Maybe the contextspecific optimization I mentioned can bring this closer to 5 cycles latency. This makes me think that maybe we actually want to depend on a 32x32>64 expanding multiply. It's 3 to 5 cycles latency with SIMD, and also 3 to 5 without (3 is reached by using x8664's 64x64>64 with 32bit inputs, otherwise it's minimum 4 cycles for nonSIMD)  so no harm from SIMD, as long as our algorithm is chosen such that no vector element shuffling is needed. Some other forms look somewhat attractive as well. Alexander
Powered by blists  more mailing lists