FMA4 on Matisse?

Discussion:

FMA4 on Matisse?

(too old to reply)

Melzzzzz

2019-10-20 08:18:15 UTC

FMA4 works on my 2700X. Question is: is that supported
still on Zen2?
Instructions are not reported at all by cpuid but works.
I had some nice speedup with FMA4 on Zen.
I don't know why FMA3 won when FMA4 is clearly superior?

--
press any key to continue or any other to quit...
U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec
Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi
bili naoruzani. -- Mladen Gogala

Bonita Montero

2019-10-20 18:55:35 UTC

Permalink

Post by Melzzzzz
FMA4 works on my 2700X. Question is: is that supported
still on Zen2?
Instructions are not reported at all by cpuid but works.
I had some nice speedup with FMA4 on Zen.
I don't know why FMA3 won when FMA4 is clearly superior?

When dealing with floating-point-operations you have a lot of
instructions with long latencies. Even if you have parallel chains
of instructions that could be pipelined, there are mostly oppurtu-
nities to hide movs that might be necessary to prevent overwriting
registers. And often their value isn't needed to be remembered. So
your speedup could be only slightly and sometimes there's nothing
at all.

Anton Ertl

2019-10-21 08:36:49 UTC

Permalink

Intel supports FMA3, but not FMA4.

If your question is why they do that, I can only speculate:

1) Well-placed register-register moves can have latency 0 and cost no
execution unit resources, only some front-end resources, because the
register renamer eliminates them. So the advantage of FMA4 on Intel
may be miniscule (not sure if the Zen register renamer works the same
way; if so, maybe you can get similar performance with FMA3 if you
arrange the move appropriately).

2) FMA4 may require complications in the instruction decoder and
register renamer that the Intel engineers were not prepared to pay
for, given the small benefit.

- anton

--
M. Anton Ertl Some things have to be seen to be believed
***@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Bonita Montero

2019-10-21 11:56:39 UTC

Permalink

Post by Anton Ertl
1) Well-placed register-register moves can have latency 0 and cost no
execution unit resources, only some front-end resources, because the
register renamer eliminates them.

It takes one decoder-slot and thereby will delay other instructions
which go to the next decoded bundle.