Estaba leyendo esta pregunta muy interesante sobre Stack Overflow:
¿La multiplicación entera se realiza realmente a la misma velocidad que la suma en una CPU moderna?
Uno de los comentarios decía:
"No vale nada que en Haswell, el rendimiento de multiplicación de FP sea el doble que el de la suma de FP. Esto se debe a que los puertos 0 y 1 se pueden usar para multiplicar, pero solo el puerto 1 se puede usar para agregar. Dicho esto, puedes hacer trampa con fusibles -multiply agrega ya que ambos puertos pueden hacerlos ".
¿Por qué permitirían el doble de multiplicaciones simultáneas en comparación con la suma?
cpu
computer-architecture
alu
floating-point
intel
usuario1271772
fuente
fuente
Respuestas:
This possibly answers the title of the question, if not the body:
Floating point addition requires aligning the two mantissa's before adding them (depending on the difference between the two exponents), potentially requiring a large variable amount of shift before the adder. Then renormalizing the result of the mantissa addition might be needed, potentially requiring another large variable amount of shift in order to properly format the floating point result. The two mantissa barrel shifters thus potentially require more gate delays, greater wire delays, or extra cycles that exceed the delay of a well compacted carry-save-adder-tree multiplier front end.
Agregado para el OP: Tenga en cuenta que agregar las longitudes de 2 milímetros y 2 kilómetros no es 4 de ninguna de las unidades. Esto se debe a la necesidad de convertir una u otra medida a la misma escala o representación unitaria antes de la adición. Esa conversión requiere esencialmente una multiplicación por una potencia de 10. Lo mismo suele suceder durante la adición de coma flotante, porque los números de coma flotante son una forma de enteros escalados de forma variable (por ejemplo, hay una unidad o factor de escala, un exponente, asociado con cada número) Por lo tanto, es posible que deba escalar uno de los números con una potencia de 2 antes de agregar bits de mantisa sin procesar para que ambos representen las mismas unidades o escalas. Esta escala es esencialmente una forma simple de multiplicación por una potencia de 2. Por lo tanto, la adición de coma flotante requiere multiplicación(que, con una potencia de 2, se puede hacer con un desplazamiento de bit variable o una palanca de cambio de barril, lo que puede requerir cables relativamente largos en relación con los tamaños de los transistores, que pueden ser relativamente lentos en circuitos de litografía submicrométrica profunda). Si los dos números se cancelan en su mayoría (porque uno es casi el negativo del otro), entonces puede ser necesario reescalar el resultado de la suma y formatear adecuadamente el resultado. Por lo tanto, la adición puede ser lenta si además requiere 2 pasos de multiplicación (pre y post) que rodean la adición binaria de un número fijo (finito) de bits de mantisa que representan unidades o escalas equivalentes, debido a la naturaleza del formato de número (punto flotante IEEE) )
Added #2: Also, many benchmarks weight FMACS (multiply-accumulates) more than bare adds. In a fused MAC, the alignment (shift) of the addend can often be mostly done in parallel with the multiply, and the mantissa add can often be included in the CSA tree before the final carry propagation.
fuente
In FP multiplication, exponent processing turns out to be simple addition (for exactly the same reason that multiplication in the log domain is merely addition). You have come across logarithms, I hope.
Now consider how difficult it is to add two numbers in logarithmic form...
Floating point inhabits a grey area between the linear and log domains, with aspects of both. Each FP number comprises a mantissa (which is linear) and a (logarithmic) exponent. To determine the meaning of each bit in the mantissa, you first have to look at the exponent (which is just a scale factor).
In FP addition, exponent processing in the general case, requires barrel shifting the mantissa twice, where each barrel shift is effectively a special case of a slightly simplified multiplication.
(The first shift aligns both inputs to the same power of 2, so that a mantissa bit has the same binary weight in each operand.
A decimal example will suffice (though binary is obviously used)...
The second re-scales the output...
So paradoxically, a FP addition involves something very much like two multiplications which have to be performed sequentially, with the mantissa addition between them. In that light, the reported performance is not so surprising.
fuente
TL:DR: because Intel thought SSE/AVX FP add latency was more important than throughput, they chose not to run it on the FMA units in Haswell/Broadwell.
Haswell runs (SIMD) FP multiply on the same execution units as FMA (Fused Multiply-Add), of which it has two because some FP-intensive code can use mostly FMAs to do 2 FLOPs per instruction. Same 5 cycle latency as FMA, and as
mulps
on earlier CPUs (Sandybridge/IvyBridge). Haswell wanted 2 FMA units, and there's no downside to letting multiply run on either because they're the same latency as the dedicate multiply unit in earlier CPUs.But it keeps the dedicated SIMD FP add unit from earlier CPUs to still run
addps
/addpd
with 3 cycle latency. I've read that the possible reasoning might be that code which does a lot of FP add tends to bottleneck on its latency, not throughput. That's certainly true for a naive sum of an array with only one (vector) accumulator, like you often get from GCC auto-vectorizing. But I don't know if Intel has publicly confirmed that was their reasoning.Broadwell is the same (but sped up
mulps
/mulpd
to 3c latency while FMA stayed at 5c). Perhaps they were able to shortcut the FMA unit and get the multiply result out before doing a dummy add of0.0
, or maybe something completely different and that's way too simplistic. BDW is mostly a die-shrink of HSW with most changes being minor.In Skylake everything FP (including addition) runs on the FMA unit with 4 cycle latency and 0.5c throughput, except of course div/sqrt and bitwise booleans (e.g. for absolute value or negation). Intel apparently decided that it wasn't worth extra silicon for lower-latency FP add, or that the unbalanced
addps
throughput was problematic. And also standardizing latencies makes avoiding write-back conflicts (when 2 results are ready in the same cycle) easier to avoid in uop scheduling. i.e. simplifies scheduling and/or completion ports.So yes, Intel did change it in their next major microarchitecture revision (Skylake). Reducing FMA latency by 1 cycle made the benefit of a dedicated SIMD FP add unit a lot smaller, for cases that were latency bound.
Skylake also shows signs of Intel getting ready for AVX512, where extending a separate SIMD-FP adder to 512 bits wide would have taken even more die area. Skylake-X (with AVX512) reportedly has an almost-identical core to regular Skylake-client, except for larger L2 cache and (in some models) an extra 512-bit FMA unit "bolted on" to port 5.
SKX shuts down the port 1 SIMD ALUs when 512-bit uops are in flight, but it needs a way to execute
vaddps xmm/ymm/zmm
at any point. This made having a dedicated FP ADD unit on port 1 a problem, and is a separate motivation for change from performance of existing code.Fun fact: everything from Skylake, KabyLake, Coffee Lake and even Cascade Lake have been microarchitecturally identical to Skylake, except for Cascade Lake adding some new AVX512 instructions. IPC hasn't changed otherwise. Newer CPUs have better iGPUs, though. Ice Lake (Sunny Cove microarchitecture) is the first time in several years that we've seen an actual new microarchitecture (except the never-widely-released Cannon Lake).
Arguments based on the complexity of an FMUL unit vs. an FADD unit are interesting but not relevant in this case. An FMA unit includes all the necessary shifting hardware to do FP addition as part of an FMA1.
Note: I don't mean the x87
fmul
instruction, I mean an SSE/AVX SIMD/scalar FP multiply ALU that supports 32-bit single-precision /float
and 64-bitdouble
precision (53-bit significand aka mantissa). e.g. instructions likemulps
ormulsd
. Actual 80-bit x87fmul
is still only 1/clock throughput on Haswell, on port 0.Modern CPUs have more than enough transistors to throw at problems when it's worth it, and when it doesn't cause physical-distance propagation delay problems. Especially for execution units that are only active some of the time. See https://en.wikipedia.org/wiki/Dark_silicon and this 2011 conference paper: Dark Silicon and the End of Multicore Scaling. This is what makes it possible for CPUs to have massive FPU throughput, and massive integer throughput, but not both at the same time (because those different execution units are on the same dispatch ports so they compete with each other). In a lot of carefully-tuned code that doesn't bottleneck on mem bandwidth, it's not back-end execution units that are the limiting factor, but instead front-end instruction throughput. (wide cores are very expensive). See also http://www.lighterra.com/papers/modernmicroprocessors/.
Before Haswell
Before HSW, Intel CPUs like Nehalem and Sandybridge had SIMD FP multiply on port 0 and SIMD FP add on port 1. So there were separate execution units and throughput was balanced. (https://stackoverflow.com/questions/8389648/how-do-i-achieve-the-theoretical-maximum-of-4-flops-per-cycle
Haswell introduced FMA support into Intel CPUs (a couple years after AMD introduced FMA4 in Bulldozer, after Intel faked them out by waiting as late as they could to make it public that they were going to implement 3-operand FMA, not 4-operand non-destructive-destination FMA4). Fun fact: AMD Piledriver was still the first x86 CPU with FMA3, about a year before Haswell in June 2013
This required some major hacking of the internals to even support a single uop with 3 inputs. But anyway, Intel went all-in and took advantage of ever-shrinking transistors to put in two 256-bit SIMD FMA units, making Haswell (and its successors) beasts for FP math.
A performance target Intel might have had in mind was BLAS dense matmul and vector dot product. Both of those can mostly use FMA and don't need just add.
As I mentioned earlier, some workloads that do mostly or just FP addition are bottlenecked on add latency, (mostly) not throughput.
Footnote 1: And with a multiplier of
1.0
, FMA literally can be used for addition, but with worse latency than anaddps
instruction. This is potentially useful for workloads like summing an array that's hot in L1d cache, where FP add throughput matters more than latency. This only helps if you use multiple vector accumulators to hide the latency, of course, and keep 10 FMA operations in flight in the FP execution units (5c latency / 0.5c throughput = 10 operations latency * bandwidth product). You need to do that when using FMA for a vector dot product, too.See David Kanter's write up of the Sandybridge microarchitecture which has a block diagram of which EUs are on which port for NHM, SnB, and AMD Bulldozer-family. (See also Agner Fog's instruction tables and asm optimization microarch guide, and also https://uops.info/ which also has experimental testing of uops, ports, and latency/throughput of nearly every instruction on many generations of Intel microarchitectures.)
Also related: https://stackoverflow.com/questions/8389648/how-do-i-achieve-the-theoretical-maximum-of-4-flops-per-cycle
fuente
[cpu-architecture]
,[performance]
,[x86-64]
,[assembly]
, and[sse]
. I wrote an answer on C++ code for testing the Collatz conjecture faster than hand-written assembly - why? that a lot of people think is good. Also this about OoO pipelined execution.I'm going to look at this part:
"Why is it that they would allow"...
TL;DR - because they designed it that way. It is a management decision. Sure there are answers of mantissa and bit shifters, but these are things that go into the management decision.
Why did they design it that way? The answer is that the specs are made to meet certain goals. Those goals include performance and cost. Performance is geared not toward the operations, rather a benchmark like FLOPS or FPS in Crysis.
These benchmarks will have a mix of functions, some of those can be processed at the same time.
If the designers figure that having two functions of widget A makes it much faster, rather than two functions of widget B, then they will go with widget A. Implementing two of A and two of B will cost more.
Looking back when superscalar and super pipelines (before multi-core) first became common on commercial chips, these were there to increase performance. The Pentium has two pipes, and no vector unites. Haswell has more pipes, vector units, a deeper pipe, dedicated functions, and more. Why aren't there two of everything? Because they designed it that way.
fuente
This diagram from Intel may help:
It appears they've given each unit a FMA (fused multiply-add) as well as a multiply and a single adder. They may or may not share hardware underneath.
The question of why is a lot harder to answer without internal design rationales, but the text in the purple box gives us a hint with "doubles peak FLOPs": the processor will be targeting a set of benchmarks, derived from actual use cases. FMA is very popular in these since it is the basic unit of matrix multiplication. Bare addition is less popular.
You can, as has been pointed out, use both ports to do addition by with a FMA instruction where the multiplication parameter is 1, computing (A x 1) + B. This will be slightly slower than a bare addition.
fuente
Let's take a look at the time consuming steps:
Addition: Align the exponents (may be a massive shift operation). One 53 bit adder. Normalisation (by up to 53 bits).
Multiplication: One massive adder network to reduce 53 x 53 one bit products to the sum of two 106 bit numbers. One 106 bit adder. Normalisation. I would say reducing the bit products to two numbers can be done about as fast as the final adder.
If you can make multiplication variable time then you have the advantage that normalisation will only shift by one bit most of the time, and you can detect the other cases very quickly (denormalised inputs, or the sume of exponents is too small).
For addition, needing normalisation steps is very common (adding numbers that are not of equal size, subtracting numbers that are close). So for multiplication you can afford to have a fast path and take a massive hit for the slow path; for addition you can't.
PS. Reading the comments: It makes sense that adding denormalised numbers doesn't cause a penalty: It only means that among the bits that are shifted to align the exponents, many are zeroes. And denormalised result means that you stop shifting to remove leading zeroes if that would make the exponent too small.
fuente
-ffast-math
sets FTZ / DAZ (flush denormals to zero) to do that instead of take an FP assist.