Split-Path Fused Floating Point Multiply Accumulate (FPMAC)

Computer Arithmetic(2013)

引用 22|浏览2
暂无评分
摘要
Floating point multiply-accumulate (FPMAC) unitis the backbone of modern processors and is a key circuit determining the frequency, power and area of microprocessors. FPMAC unit is used extensively in contemporary client microprocessors, further proliferated with ISA support for instructions like AVX and SSE and also extensively used in server processors employed for engineering and scientific applications. Consequently design of FPMAC is of vital consideration since it dominates the power and performance tradeoff decisions in such systems. In this work we demonstrate a novel FPMAC design which focuses on optimal computations in the critical path and therefore making it the fastest FPMAC design as of today in literature. The design is based on the premise of isolating and optimizing the critical path computation in FPMAC operation. In this work we have three key innovations to create a novel double precision FPMAC with least ever gate stages in the timing critical path: a) Splitting near and far paths based on the exponent difference (d=Exy-Ez = {-2, -1, 0, 1} is near path and the rest is far path), b) Early injection of the accumulate add for near path into the Wallace tree for eliminating a 3:2compressor from near path critical logic, exploiting the small alignment shifts in near path and sparse Wallace tree for 53 bit mantissa multiplication, c) Combined round and accumulate add for eliminating the completion adder from multiplier giving both timing and power benefits. Our design by premise of splitting consumes lesser power for each operation where only the required logic for each case is switching. Splitting the paths also provides tremendous opportunities for clock or power gating the unused portion (nearly 15-20%) of the logic gates purely based on the exponent difference signals. We also demonstrate the support for all rounding modes to adhere to IEEE standard for double precision FPMAC which is critical for employment of this design in contemporary process- r families. The demonstrated design outperforms the best known silicon implementation of IBM Power6 [6] by 14% in timing while having similar area and giving additional power benefits due to split handling. The design is also compared to best known timing design from Lang et al. [5] and outperforms it by 7% while being 30% smaller in area than it.
更多
查看译文
关键词
far path,avx isa,ieee standard,timing design,instruction set architecture,wallace tree,near path,microprocessor chips,ieee rounding,point multiply accumulate,isa support,double precision floating point multiply-accumulate,server processor,microprocessor area,fpmac unit,fastest fpmac design,novel fpmac design,logic design,critical path,ibm power6,near path critical logic,normalization,double precision fpmac,ieee standards,exponent difference,fpmac operation,instruction sets,microprocessor frequency,split-path fused floating point multiply accumulate,contemporary client microprocessor,fpmac design,sse isa,microprocessor power,floating point arithmetic,critical path computation,adders,logic gates,hardware
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要