From: Stephen Nuchia (snuchia_at_[hidden])
Date: 2008-08-22 09:24:02
> Correct, it only prevents the compiler from reordering -
> it doesn't emit any fence instruction itself. You must
> use _mm_lfence/_mm_sfence/_mm_mfence if you want that.
> It just happens that for many things, fence instructions
> aren't needed on current x86 hardware.
This is correct as far as the take-home message is concerned but it is
not exactly right. The "Wintel" way of doing shared-memory
multiprocessing, as reflected in both the x86 instruction set
architecture and the implementation of all but the most exotic x86-based
systems, is based on a cache-coherent, write-through memory model.
Current x86 hardware realizes this model -- the instruction set
architecture -- using a number of tricks at the microarchitecture level.
Those details change from one generation to the next but the ISA stays
(mostly) the same.
Under the coherent-cache model, changes to memory in one instruction
stream are visible to all other instruction streams on the system
immediately. There is no need for a fence *instruction*, it could have
no effect in this model. But there is still a need to prevent the
compiler from reordering the memory operations in an instruction stream.
Now, there are extensions to the instruction set, notably with SSE2+,
that provide an escape from the cache-coherent memory model. With those
come fence instructions, because you need them once you leave the
Following quoted from
Streaming instructions include the non-temporal stores MOVNTDQ, MOVNTI,
MOVNTPS, MOVNTPD, MOVNTSD, MOVNTSS and the MMX instruction MOVNTQ.
However, unlike regular stores, non-temporal stores are weakly ordered
relative to other loads and stores. If strong ordering of stores is
required, an SFENCE instruction should be used between the non-temporal
stores and any succeeding normal stores. See Section 11.4, "Memory
Barrier Operations' on page 196 for further recommendations on memory
Streaming instructions can dramatically improve memory-write
performance. They write data directly to memory through write-combining
buffers, bypassing the cache. This is faster than PREFETCHW because data
does not need to be initially read from memory to fill the cache lines,
only to be completely overwritten shortly thereafter. The new data is
simply written to memory, replacing the old data in memory, so no memory
read is performed.
One application where streaming is useful, often in conjunction with
prefetch instructions, is in copying large blocks of memory.
Note:The streaming instructions are not recommended or necessary for
write-combined memory regions since the processor automatically combines
writes for those regions. Write-combine memory types are indicated
through the MTRRs and the page-attribute table (PAT).
Note:For best performance, do not mix streaming instructions on a cache
line with non-streaming store instructions.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk