Algorithms for Fast Cubic Evaluation

2 July 2025

It's been a while; a lot's happened. I got accepted to Stanford's MS CS program, and I even graduated from there last month. During my last quarter there, I took EE 372: Design Projects in VLSI Systems II. In the iteration of the course I took, Priyanka essentially gave us the source code for MINOTAUR, and asked us to improve it however we saw fit. I mainly focused on improving the vector unit — the part of the accelerator that handles activations, element-wise operations, and other low arithmetic-intensity tasks.

I was not the only one working on the vector unit though. Another group looked at changing the strategy it used to compute activation functions. Ultimately, they settled on piecewise-cubic activations, with programmable coefficients and interval bounds. I interacted with them, and I investigated ways to make the computation of these cubic polynomials more efficient.

Let's say we have some

%% p(x) = c_3 x^3 + c_2 x^2 + c_1 x + c_0. %%

Naïvely implementing this in hardware, by evaluating all the multiplications before computing the additions, gives a relatively poor result. It requires six multipliers and three adders, and its critical path consists of two multipliers and two adders.

flowchart TB
    x3[$$x$$]
    x2[$$x$$]
    x1[$$x$$]

    c0[$$c_0$$]
    c1[$$c_1$$]
    c2[$$c_2$$]
    c3[$$c_3$$]

    c3m0[$$\times$$]
    c3m1[$$\times$$]
    c3m2[$$\times$$]
    c3 --> c3m0
    x3 --> c3m0
    x3 --> c3m1
    x3 --> c3m1
    c3m0 --> c3m2
    c3m1 --> c3m2

    c2m0[$$\times$$]
    c2m1[$$\times$$]
    c2 --> c2m0
    x2 --> c2m0
    c2m0 --> c2m1
    x2 --> c2m1

    c1m0[$$\times$$]
    c1 --> c1m0
    x1 --> c1m0

    a0[$$+$$]
    a1[$$+$$]
    a2[$$+$$]
    c3m2 --> a0
    c2m1 --> a0
    c1m0 --> a1
    c0 --> a1
    a0 --> a2
    a1 --> a2

    a2 --> Output

Data-flow graph of the naïve cubic evaluation algorithm. The @@\times@@ nodes multiply their two inputs, while the @@+@@ nodes add them. Furthermore, the input @@x@@ is duplicated and used in multiple places.

A better idea is to use Horner's Scheme, which decomposes @@p@@ as

%% p(x) = ((c_3 \cdot x + c_2) \cdot x + c_1) \cdot x + c_0. %%

It has a longer critical path, at three multipliers and three adders. But, it uses less area — just the three multipliers and three adders. Possibly for that reason, this was the initial scheme used in MINOTAUR. Area is particularly important for its vector unit. Most of its operations are performed on 32-wide vectors, pipelined and in parallel. So, any area savings are multiplied by 32.

flowchart TB
    x3[$$x$$]
    x2[$$x$$]
    x1[$$x$$]

    c0[$$c_0$$]
    c1[$$c_1$$]
    c2[$$c_2$$]
    c3[$$c_3$$]

    c3m[$$\times$$]
    c2a[$$+$$]
    c3 --> c3m
    x3 --> c3m
    c3m --> c2a
    c2 --> c2a

    c2m[$$\times$$]
    c1a[$$+$$]
    c2a --> c2m
    x2 --> c2m
    c2m --> c1a
    c1 --> c1a

    c1m[$$\times$$]
    c0a[$$+$$]
    c1a --> c1m
    x1 --> c1m
    c1m --> c0a
    c0 --> c0a

    c0a --> Output

Data-flow graph of Horner's Scheme.

Another improvement over the naïve approach is to use Estrin's Scheme, which instead recursively factorizes @@p@@ as

%% p(x) = x^2 \cdot (c_3 x + c_2) + (c_1 x + c_0). %%

In total, Estrin's Scheme uses four multipliers and three adders. Its critical path consists of two multipliers and two adders. In other words, for just an additional multiplier compared to Horner's Scheme, this algorithm improves on its critical path by a full Multiply-Accumulate (MAC). And in fact, when this approach was implemented in MINOTAUR, it saved area over Horner's Scheme. Its shorter critical path allowed the pipeline depth to be reduced by one stage, eliminating an entire set of pipeline registers.

flowchart TB
    xsq[$$x$$]
    xl[$$x$$]
    xr[$$x$$]

    c0[$$c_0$$]
    c1[$$c_1$$]
    c2[$$c_2$$]
    c3[$$c_3$$]

    sq[$$\times$$]
    xsq --> sq
    xsq --> sq

    ml[$$\times$$]
    al[$$+$$]
    c3 --> ml
    xl --> ml
    ml --> al
    c2 --> al

    mr[$$\times$$]
    ar[$$+$$]
    c1 --> mr
    xr --> mr
    mr --> ar
    c0 --> ar

    mt[$$\times$$]
    at[$$+$$]
    sq --> mt
    al --> mt
    mt --> at
    ar --> at

    at --> Output

Data-flow graph of Estrin's Scheme.

The above approaches were actually synthesized in MINOTAUR. It's possible that they leave performance on the table though. Specifically, note that all the algorithms given above take the "raw" coefficients @@c_3@@, …, @@c_0@@ as input. But, Wikipedia's page on Polynomial Evaluation points out that pre-processing these coefficients can decrease the number of multipliers and adders required. Knuth's Algorithm¹ provides a concrete way to do that.

Knuth's Algorithm points out that, by applying polynomial long-division, we can write

%% p(x) = (x^2 + \alpha) (k_1 x + k_0) + \beta x + \gamma, %%

for some set of constants. The only knob we have is @@\alpha@@; once it's fixed, the divisor @@x^2 + \alpha@@ is set and the rest of the constants can be determined. The key idea is to judiciously set @@\alpha := \alpha^*@@ such that @@\beta = 0@@. This can be done by picking

%% \begin{align*} \alpha^* &= \frac{c_1}{k_1^*} \nl \gamma^* &= c_0 - \alpha^* k_0^* \nl k_1^* &= c_3 \nl k_0^* &= c_2, \end{align*} %%

which works so long as @@c_3 \neq 0@@. That case can be worked around for MINOTAUR. A few multiplexers can be used to reconfigure the existing multipliers and adders for Knuth's Algorithm to implement Horner's Scheme on quadratics. In the end, Knuth's Algorithm prescribes

def preprocess(c: list[float]):
    cubic = c[3] != 0
    if cubic:
        k1 = c[3]
        k0 = c[2]
        α = c[1] / k1
        ɣ = c[0] - α * k0
    else:
        k1 = c[2]
        k0 = c[1]
        α = float('nan') # Don't care
        ɣ = c[0]
    return (cubic, k1, k0, α, ɣ)

def hardware(
    x: float,
    cubic: bool,
    k1: float, k0: float, α: float, ɣ: float,
) -> float:
    quotient = k1 * x + k0
    divisor = x * x + α
    whole = quotient * divisor if cubic else quotient
    return whole + ɣ

def evaluate(x: float, c: list[float]) -> float:
    return hardware(x, *preprocess(c))

Ignoring MUX overhead, it requires three multipliers and three adders, and it has a critical path of two multipliers and two adders. Thus, it is strictly better than both Horner's and Estrin's Schemes. It does require preprocessing, but that's okay for MINOTAUR.

flowchart TB
    xq[$$x$$]
    xd[$$x$$]

    alpha[$$\alpha$$]
    gamma[$$\gamma$$]
    k0[$$k_0$$]
    k1[$$k_1$$]

    mq[$$\times$$]
    aq[$$+$$]
    k1 --> mq
    xq --> mq
    mq --> aq
    k0 --> aq

    md[$$\times$$]
    ad[$$+$$]
    xd --> md
    xd --> md
    md --> ad
    alpha --> ad

    mt[$$\times$$]
    at[$$+$$]
    aq --> mt
    ad --> mt
    mt --> at
    gamma --> at

    at --> Output

Data-flow graph of Knuth's Algorithm.

To close, even though none of the algorithms described here are entirely new, they don't seem to be widely known. For instance, I independently rediscovered Estrin's Scheme, and I came to Knuth's Algorithm myself after seeing a different algorithm inspired by it in a source I have since lost. Furthermore in my experience with MINOTAUR, Horner's Scheme is often treated as the "default" approach for polynomial evaluation in hardware, even when other approaches might be better. Either way, it was some work to find these algorithms, so hopefully this post can save someone else from doing redoing it.

Another question that remains is whether Knuth's Algorithm is "optimal". According to CS 497² at UIUC, it is known that Knuth's Algorithm uses the lowest possible number of multiplications and additions (or subtractions). But, it does not show that it achieves the best possible critical path. As shown by Estrin's Scheme in MINOTAUR, it may be better to optimize that instead of total area.

There are multiple sources for Knuth's Algorithm. It seems this paper introduced it, but Sec. 2 of this one has a better exposition of it in my opinion. ↩
Original ↩

Algorithms for Fast Cubic Evaluation

Related Posts