Algorithms for Fast Cubic Evaluation
2 July 2025It's been a while; a lot's happened. I got accepted to Stanford's MS CS program, and I even graduated from there last month. During my last quarter there, I took EE 372: Design Projects in VLSI Systems II. In the iteration of the course I took, Priyanka essentially gave us the source code for MINOTAUR, and asked us to improve it however we saw fit. I mainly focused on improving the vector unit — the part of the accelerator that handles activations, element-wise operations, and other low arithmetic-intensity tasks.
I was not the only one working on the vector unit though. Another group looked at changing the strategy it used to compute activation functions. Ultimately, they settled on piecewise-cubic activations, with programmable coefficients and interval bounds. I interacted with them, and I investigated ways to make the computation of these cubic polynomials more efficient.
Let's say we have some
%% p(x) = c_3 x^3 + c_2 x^2 + c_1 x + c_0. %%
Naïvely implementing this in hardware, by evaluating all the multiplications before computing the additions, gives a relatively poor result. It requires six multipliers and three adders, and its critical path consists of two multipliers and two adders.
flowchart TB x3[$$x$$] x2[$$x$$] x1[$$x$$] c0[$$c_0$$] c1[$$c_1$$] c2[$$c_2$$] c3[$$c_3$$] c3m0[$$\times$$] c3m1[$$\times$$] c3m2[$$\times$$] c3 --> c3m0 x3 --> c3m0 x3 --> c3m1 x3 --> c3m1 c3m0 --> c3m2 c3m1 --> c3m2 c2m0[$$\times$$] c2m1[$$\times$$] c2 --> c2m0 x2 --> c2m0 c2m0 --> c2m1 x2 --> c2m1 c1m0[$$\times$$] c1 --> c1m0 x1 --> c1m0 a0[$$+$$] a1[$$+$$] a2[$$+$$] c3m2 --> a0 c2m1 --> a0 c1m0 --> a1 c0 --> a1 a0 --> a2 a1 --> a2 a2 --> Output
A better idea is to use Horner's Scheme, which decomposes @@p@@ as
%% p(x) = ((c_3 \cdot x + c_2) \cdot x + c_1) \cdot x + c_0. %%
It has a longer critical path, at three multipliers and three adders. But, it uses less area — just the three multipliers and three adders. Possibly for that reason, this was the initial scheme used in MINOTAUR. Area is particularly important for its vector unit. Most of its operations are performed on 32-wide vectors, pipelined and in parallel. So, any area savings are multiplied by 32.
flowchart TB x3[$$x$$] x2[$$x$$] x1[$$x$$] c0[$$c_0$$] c1[$$c_1$$] c2[$$c_2$$] c3[$$c_3$$] c3m[$$\times$$] c2a[$$+$$] c3 --> c3m x3 --> c3m c3m --> c2a c2 --> c2a c2m[$$\times$$] c1a[$$+$$] c2a --> c2m x2 --> c2m c2m --> c1a c1 --> c1a c1m[$$\times$$] c0a[$$+$$] c1a --> c1m x1 --> c1m c1m --> c0a c0 --> c0a c0a --> Output
Another improvement over the naïve approach is to use Estrin's Scheme, which instead recursively factorizes @@p@@ as
%% p(x) = x^2 \cdot (c_3 x + c_2) + (c_1 x + c_0). %%
In total, Estrin's Scheme uses four multipliers and three adders. Its critical path consists of two multipliers and two adders. In other words, for just an additional multiplier compared to Horner's Scheme, this algorithm improves on its critical path by a full Multiply-Accumulate (MAC). And in fact, when this approach was implemented in MINOTAUR, it saved area over Horner's Scheme. Its shorter critical path allowed the pipeline depth to be reduced by one stage, eliminating an entire set of pipeline registers.
flowchart TB xsq[$$x$$] xl[$$x$$] xr[$$x$$] c0[$$c_0$$] c1[$$c_1$$] c2[$$c_2$$] c3[$$c_3$$] sq[$$\times$$] xsq --> sq xsq --> sq ml[$$\times$$] al[$$+$$] c3 --> ml xl --> ml ml --> al c2 --> al mr[$$\times$$] ar[$$+$$] c1 --> mr xr --> mr mr --> ar c0 --> ar mt[$$\times$$] at[$$+$$] sq --> mt al --> mt mt --> at ar --> at at --> Output
The above approaches were actually synthesized in MINOTAUR. It's possible that they leave performance on the table though. Specifically, note that all the algorithms given above take the "raw" coefficients @@c_3@@, …, @@c_0@@ as input. But, Wikipedia's page on Polynomial Evaluation points out that pre-processing these coefficients can decrease the number of multipliers and adders required. Knuth's Algorithm1 provides a concrete way to do that.
Knuth's Algorithm points out that, by applying polynomial long-division, we can write
%% p(x) = (x^2 + \alpha) (k_1 x + k_0) + \beta x + \gamma, %%
for some set of constants. The only knob we have is @@\alpha@@; once it's fixed, the divisor @@x^2 + \alpha@@ is set and the rest of the constants can be determined. The key idea is to judiciously set @@\alpha := \alpha^*@@ such that @@\beta = 0@@. This can be done by picking
%% \begin{align*} \alpha^* &= \frac{c_1}{k_1^*} \nl \gamma^* &= c_0 - \alpha^* k_0^* \nl k_1^* &= c_3 \nl k_0^* &= c_2, \end{align*} %%
which works so long as @@c_3 \neq 0@@. That case can be worked around for MINOTAUR. A few multiplexers can be used to reconfigure the existing multipliers and adders for Knuth's Algorithm to implement Horner's Scheme on quadratics. In the end, Knuth's Algorithm prescribes
def preprocess(c: list[float]):
cubic = c[3] != 0
if cubic:
k1 = c[3]
k0 = c[2]
α = c[1] / k1
ɣ = c[0] - α * k0
else:
k1 = c[2]
k0 = c[1]
α = float('nan') # Don't care
ɣ = c[0]
return (cubic, k1, k0, α, ɣ)
def hardware(
x: float,
cubic: bool,
k1: float, k0: float, α: float, ɣ: float,
) -> float:
quotient = k1 * x + k0
divisor = x * x + α
whole = quotient * divisor if cubic else quotient
return whole + ɣ
def evaluate(x: float, c: list[float]) -> float:
return hardware(x, *preprocess(c))
Ignoring MUX overhead, it requires three multipliers and three adders, and it has a critical path of two multipliers and two adders. Thus, it is strictly better than both Horner's and Estrin's Schemes. It does require preprocessing, but that's okay for MINOTAUR.
flowchart TB xq[$$x$$] xd[$$x$$] alpha[$$\alpha$$] gamma[$$\gamma$$] k0[$$k_0$$] k1[$$k_1$$] mq[$$\times$$] aq[$$+$$] k1 --> mq xq --> mq mq --> aq k0 --> aq md[$$\times$$] ad[$$+$$] xd --> md xd --> md md --> ad alpha --> ad mt[$$\times$$] at[$$+$$] aq --> mt ad --> mt mt --> at gamma --> at at --> Output
To close, even though none of the algorithms described here are entirely new, they don't seem to be widely known. For instance, I independently rediscovered Estrin's Scheme, and I came to Knuth's Algorithm myself after seeing a different algorithm inspired by it in a source I have since lost. Furthermore in my experience with MINOTAUR, Horner's Scheme is often treated as the "default" approach for polynomial evaluation in hardware, even when other approaches might be better. Either way, it was some work to find these algorithms, so hopefully this post can save someone else from doing redoing it.
Another question that remains is whether Knuth's Algorithm is "optimal". According to CS 4972 at UIUC, it is known that Knuth's Algorithm uses the lowest possible number of multiplications and additions (or subtractions). But, it does not show that it achieves the best possible critical path. As shown by Estrin's Scheme in MINOTAUR, it may be better to optimize that instead of total area.
-
There are multiple sources for Knuth's Algorithm. It seems this paper introduced it, but Sec. 2 of this one has a better exposition of it in my opinion. ↩