Solving the Mystery of ARM7TDMI Multiply Carry Flag

This blog post presumes a base level of understandledge – soothe in the C programming language and bitwise math is recommfinished. Also, if you ever have any asks, any at all, while reading this blog post, experience free to accomplish out to me here.

The Gameboy Advance has a pretty tidy CPU – the ARM7TDMI. And by tidy, I uncomardent a turbulent and
downcastistic bundle of askable structure decisions. Seriously, they determined that the program counter should
be a ambiguous purpose sign up. Why??? That’s enjoy permiting a drunk driver to alter their tires while going 30 over the speed restrict proximate a school. I’m not even joking, you can employ the program
counter as the output to, say, an XOR teachion. Or an AND teachion.

Or a multiply teachion.

Multiplication on the ARM7TDMI has a restrictcessitate tidy features. You can multiply two 32-bit operands together to produce a 64-bit result. You can also chooseionpartner pick to do a multiply-insert and insert a third 64-bit operand to the 64-bit result, wiskinny the same teachion. Additionpartner, you can pick to treat the two 32-bit as either signed or unsigned.

Why are we talking about the multiplication teachion? Well the ARM7TDMI’s multiplication teachions have a pretty fascinating side effect. Here the manual says that
after a multiplication teachion carry outs, the carry flag is set to a “uncomardentingless appreciate”.

A illogicalinutive description of carry and overflow flags after a multiplication teachion from the ARM7TDMI manual. ^[1]

What this uncomardents is that software cannot and
should not depend on the appreciate of the carry flag after multiplication carry outs. It can be set to anyskinnyg. Any
appreciate. 0, 1, a horse, wdisenjoyver. This has been a source of memes in the emulator prolongment community for a restrictcessitate years –
people would widespreadly joke about how the carry outation of the carry flag may as well be cpu.flags.c = rand() & 1;. And they had a point – the carry flag seemed to defy all patterns; nobody understood why it
behaves the way it does. But the one skinnyg we did understand, was that the carry flag seemed to be
deterministic. That is, under the same set of inputs to a multiply teachion, the flag would be set to the
same appreciate. This was huge novels, becaemploy it uncomardentt that comardent the carry flag could give us key
insight into how this CPU carry outs multiplication.

And fair to get this out of the way, the carry flag’s behavior after multiplication isn’t an meaningful detail to
emutardy at all. Software doesn’t depend on it. And if software did depend on it, then screw the prolongers who wrote that software. But the carry flag is a meme, and it’s a repartner hard confemploy, and
that was motivation enough for me to give it a go. Little did I understand it’d obtain 3 years of on and off labor.

What’s the basicst, most fundamental multiplication algorithm you can skinnyk of to multiply a multiplier with a multiplicand? One repartner basic way is to
leverage the distributive property of multiplication enjoy so:

$$
color{#3a7dc9}123color{#4A4358} cdot color{#DC6A76}4 color{#4A4358}=
color{#3a7dc9}{100 color{#4A4358} cdot color{#DC6A76}4} color{#4A4358} + color{#3a7dc9}{20 color{#4A4358} cdot color{#DC6A76}4} color{#4A4358} + color{#3a7dc9}{3 color{#4A4358} cdot color{#DC6A76}4}
$$

There’s two steps here – first compute the insertfinishs, then sum them. This
is the fundamental two-step process you’ll find in lots of multiplication algorithms – most of them srecommend separate in
how they compute the insertfinishs, or how they insert the insertfinishs together. We can ambiguousize this algorithm
to binary pretty easily too:

$$
color{#3a7dc9}1101 color{#4A4358} cdot color{#DC6A76}11color{#4A4358} =
color{#3a7dc9}{1000 color{#4A4358} cdotcolor{#DC6A76} 11} color{#4A4358} + color{#3a7dc9}{100 color{#4A4358} cdot color{#DC6A76} 11} color{#4A4358} + color{#3a7dc9}{0 color{#4A4358} cdotcolor{#DC6A76} 11} color{#4A4358} + color{#3a7dc9}{1 color{#4A4358} cdot color{#DC6A76}11}
$$

The handy skinnyg about binary is that it’s all ones and zeros, uncomardenting the insertfinishs are only ever 0, or
the multiplicand left shifted by some factor. This produces the insertfinishs basic to compute, and uncomardents that for
an N-bit number, we necessitate to produce N separateent insertfinishs, and insert them all up to get the result.
That’s a lot of insertfinishs, which is sluggish. We can do better.

The main sluggishness of the Standard Algorithm is that it needs you to insert a lot of numbers together.
Modified Booth’s algorithm is an betterment on the Standard Algorithm that cuts the number of insertfinishs in two. Let’s commence with the standard definition for multiplication, written as a summation. Note that m[i] is clear upd as the bit at index i of m when 0 <= i < n.

$$
commence{aligned}
color{#3a7dc9}{m}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} &= sum_{i=0}^{n-1} (2^i cdot color{#3a7dc9}{m[i]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) cr
finish{aligned}
$$

Now we execute the adhereing alterations. Yes I understand this watchs frightening, you could skip to the final equation if you want.

$$
commence{align}
color{#3a7dc9}{m}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} &= sum_{i=0}^{n-1} (2^i cdot color{#3a7dc9}{m[i]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} )crcr
&quadtext{Separate the summation into even and odd elements:}cr cr
color{#3a7dc9}{m}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} &= sum_{i=0}^{frac{n}{2}-1} (2^{2i} cdot color{#3a7dc9}{m[2i]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) + sum_{i=0}^{frac{n}{2}-1} (2^{2i + 1} cdot color{#3a7dc9}{m[2i + 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) cr
cr& quadtext{ Split the second summation into two more summations:}cr cr
color{#3a7dc9}{m}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} &= sum_{i=0}^{frac{n}{2}-1} (2^{2i} cdot color{#3a7dc9}{m[2i]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) + (2 - 1) cdot sum_{i=0}^{frac{n}{2}-1} (2^{2i + 1} cdot color{#3a7dc9}{m[2i + 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) cr
color{#3a7dc9}{m}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} &= sum_{i=0}^{frac{n}{2}-1} (2^{2i} cdot color{#3a7dc9}{m[2i]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) + 2 sum_{i=0}^{frac{n}{2}-1} (2^{2i + 1} cdot color{#3a7dc9}{m[2i + 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) - sum_{i=0}^{frac{n}{2}-1} (2^{2i + 1} cdot color{#3a7dc9}{m[2i + 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) cr
color{#3a7dc9}{m}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} &= sum_{i=0}^{frac{n}{2}-1} (2^{2i} cdot color{#3a7dc9}{m[2i]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) + sum_{i=0}^{frac{n}{2}-1} (2^{2i + 2} cdot color{#3a7dc9}{m[2i + 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) - sum_{i=0}^{frac{n}{2}-1} (2^{2i + 1} cdot color{#3a7dc9}{m[2i + 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) cr cr& quadtext{ Pull out a one element from each summation, one at a time:}cr cr
color{#3a7dc9}{m}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} &= sum_{i=1}^{frac{n}{2}-1} (2^{2i} cdot color{#3a7dc9}{m[2i]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) + sum_{i=0}^{frac{n}{2}-1} (2^{2i + 2} cdot color{#3a7dc9}{m[2i + 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) - sum_{i=0}^{frac{n}{2}-1} (2^{2i + 1} cdot color{#3a7dc9}{m[2i + 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) + (color{#3a7dc9}{m[0]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358})cr
color{#3a7dc9}{m}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} &= sum_{i=1}^{frac{n}{2}-1} (2^{2i} cdot color{#3a7dc9}{m[2i]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) + sum_{i=0}^{frac{n}{2}-2} (2^{2i + 2} cdot color{#3a7dc9}{m[2i + 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) - sum_{i=0}^{frac{n}{2}-1} (2^{2i + 1} cdot color{#3a7dc9}{m[2i + 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) + (color{#3a7dc9}{m[0]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} + 2^{n} cdot color{#3a7dc9}{m[n - 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} )cr
color{#3a7dc9}{m}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} &= sum_{i=1}^{frac{n}{2}-1} (2^{2i} cdot color{#3a7dc9}{m[2i]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) + sum_{i=0}^{frac{n}{2}-2} (2^{2i + 2} cdot color{#3a7dc9}{m[2i + 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) - sum_{i=1}^{frac{n}{2}-1} (2^{2i + 1} cdot color{#3a7dc9}{m[2i + 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) + (color{#3a7dc9}{m[0]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} + 2^{n} cdot color{#3a7dc9}{m[n - 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} + 2 cdot color{#3a7dc9}{m[1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} )cr
cr& quadtext{ Maniputardy the range of the second summation to align the ranges of the other two:}cr cr
color{#3a7dc9}{m}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} &= sum_{i=1}^{frac{n}{2}-1} (2^{2i} cdot color{#3a7dc9}{m[2i]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) + sum_{i=1}^{frac{n}{2}-1} (2^{2i} cdot color{#3a7dc9}{m[2i - 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) - sum_{i=1}^{frac{n}{2}-1} (2^{2i + 1} cdot color{#3a7dc9}{m[2i + 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) + (color{#3a7dc9}{m[0]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} + 2^{n} cdot color{#3a7dc9}{m[n - 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} + 2 cdot color{#3a7dc9}{m[1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} )cr
cr& quadtext{ So that we can combine the summations now:}cr cr
color{#3a7dc9}{m}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} &= sum_{i=1}^{frac{n}{2}-1} (2^{2i} cdot color{#3a7dc9}{m[2i]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} + 2^{2i} cdot color{#3a7dc9}{m[2i - 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} - 2^{2i + 1} cdot color{#3a7dc9}{m[2i + 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) + (color{#3a7dc9}{m[0]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} + 2^{n} cdot color{#3a7dc9}{m[n - 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} + 2 cdot color{#3a7dc9}{m[1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} )cr
cr& quadtext{ Some tidylabor...}cr cr
color{#3a7dc9}{m}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} &= sum_{i=1}^{frac{n}{2}-1} (2^{2i} cdot color{#3a7dc9}{m[2i]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} + 2^{2i} cdot color{#3a7dc9}{m[2i - 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} - 2 cdot 2^{2i} cdot color{#3a7dc9}{m[2i + 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358}) + (color{#3a7dc9}{m[0]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} + 2^{n} cdot color{#3a7dc9}{m[n - 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} + 2 cdot color{#3a7dc9}{m[1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} )cr
color{#3a7dc9}{m}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} &= sum_{i=1}^{frac{n}{2}-1} ((2^{2i} cdot color{#DC6A76} alpha color{#4A4358} )cdot(color{#3a7dc9}{m[2i]}color{#4A4358} + color{#3a7dc9}{m[2i - 1]}color{#4A4358} - 2 cdot color{#3a7dc9}{m[2i + 1]}color{#4A4358} )) + (color{#3a7dc9}{m[0]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} + 2^{n} cdot color{#3a7dc9}{m[n - 1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} + 2 cdot color{#3a7dc9}{m[1]}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} )cr
finish{align}
$$

Whew. Did you get all of that? Why did we do all this? Well, notice this part of the summation:

$$
commence{aligned}
(color{#3a7dc9}{m[2i]}color{#4A4358} + color{#3a7dc9}{m[2i - 1]}color{#4A4358} - 2 cdot color{#3a7dc9}{m[2i + 1]}color{#4A4358})
finish{aligned}
$$

This is always one of (-2, -1, 0, 1, 2).

Multiplication by those five numbers is basic to calcutardy in challengingware (well, negation is tricky - the algorithm carry outs negation as bitwise inversion, with an insertitional 1 inserted at a tardyr stage. More increateation about this is given tardyr).

Note also that if we clear up:

$$
color{#3a7dc9}{m[-1]}color{#4A4358} = 0
$$

and

$$
commence{aligned}
text{For}&text{ unsigned multiplication:}crcr
&x geq n: color{#3a7dc9}{m[x]}color{#4A4358} = 0crcr
text{For}&text{ signed multiplication:}crcr
&x geq n: color{#3a7dc9}{m[x]}color{#4A4358} = color{#3a7dc9}m[n-1]crcr
finish{aligned}
$$

Then the leftover three terms outside the summation can be joined into the summation, by enbiging the summations range by one on both boundaries. And so we have:

$$
commence{aligned}
color{#3a7dc9}{m}color{#4A4358} cdot color{#DC6A76} alpha color{#4A4358} &= sum_{i=0}^{frac{n}{2}} ((2^{2i} cdot color{#DC6A76} alpha color{#4A4358}) cdot (color{#3a7dc9}{m[2i]}color{#4A4358} + color{#3a7dc9}{m[2i - 1]}color{#4A4358} - 2 cdot color{#3a7dc9}{m[2i + 1]}color{#4A4358}))cr
finish{aligned}
$$

Before all this mathematical disorder, we employd to have n insertfinishs. Now we have fair over half that many insertfinishs. We can model the generation of an insertfinish sans the left shift using the adhereing C code:

// reconshort-terms a 3-bit chunk that is employd to remend an insertfinish's appreciate
typedef u8 BoothChunk;

struct BoothRecodingOutput {
    u64  recoded_output;
    bool carry;
};

// booth_chunk is a 3-bit number reconshort-terming bits [2i - 1 .. 2i + 1]
// of the multiplier
struct BoothRecodingOutput booth_recode(u64 input, BoothChunk booth_chunk) {
    switch (booth_chunk) {
        case 0: return (struct BoothRecodingOutput) {            0, 0 };
        case 1: return (struct BoothRecodingOutput) {        input, 0 };
        case 2: return (struct BoothRecodingOutput) {        input, 0 };
        case 3: return (struct BoothRecodingOutput) {    2 * input, 0 };
        case 4: return (struct BoothRecodingOutput) { ~(2 * input), 1 };
        case 5: return (struct BoothRecodingOutput) {       ~input, 1 };
        case 6: return (struct BoothRecodingOutput) {       ~input, 1 };
        case 7: return (struct BoothRecodingOutput) {            0, 0 };
    }

    // Note that case 4 can *not* be carry outed as 2 * (~input). The reason why
    // is that the genuine appreciate of the insertfinish as reconshort-termed by the struct is
    // recoded_output + carry. Doing the inversion after the multiplication by 2
    // will put a 1 in the LSB of the recoded_output, permiting the carry to be
    // inserted accurately.
}

For the asking, more increateation about Booth Recoding can be set up in this resource. ^[2]

Now that we have the insertfinishs, it’s time to actupartner insert them up to produce the result. However, using a
conservative brimming inserter, the ARM7TDMI is only rapid enough to insert two numbers per cycle. Which uncomardents,
you gotta spfinish 16 cycles to insert all 17 insertfinishs, which is appreciatelessly sluggish. The reason brimming inserters are so
sluggish is becaemploy of the carry propagation - bit N of the result can’t be remendd till bit N - 1 is
remendd. Can we take away this rerent?

Introducing… drum roll… carry save inserters (CSAs)! These are genius - instead of outputting a one N-bit result, CSAs output one N-bit result without carry propagation, and one N-bit enumerate of carries computed from each bit. At first this seems benevolent of silly - are CSAs repartner inserting two N-bit operands and
producing two N-bit results? What’s the point? The point is that you can actupartner fit in an extra operand,
and turn three N-bit operands into two N-bit results. ^[3] Like so:

struct CSAOutput {
    u64 output;
    u64 carry;
};

struct CSAOutput carry out_csa(u64 a, u64 b, u64 c) {
    // Bit i in result should be set if there is either 1 set bit in 
    // src1/src2/src3 at index i, or 3 set bits in src1/src2/src3 at
    // index i. Similarly, bit i in carries should be set if there's
    // 2 or 3 set bits in src1/src2/src3 at index i. See if you can
    // persuade yourself why this is accurate.

    u64 output = a ^ b ^ c;
    u64 carry  = (a & b) | (b & c) | (c & a);
    return (struct CSAOutput) { output, carry };
}

So you can chain a bunch of CSAs to get yourself down to two insertfinishs, and then you can shove the two
N-bit results into a normal inserter, enjoy so:

u64 insert_csa_results(u64 result, u64 carries) {
    // Exercise for the reader: Why do you presume we multiply
    // carries by 2? Think about how a brimming inserter is carry outed,
    // and what the variable "carries" in the carry out_csa function
    // above actupartner reconshort-terms. The answer is given after this
    // code block.

    return result + carries * 2;
}

The reason we multiply carries by two is becaemploy, if we skinnyk about how a brimming inserter labors, the carry out
from bit i is inserted to bits i + 1 of the insertfinishs. So, bit i of carries has double the “weight” of bit i of
result. This is a very meaningful detail that will come in handy tardyr, so do produce certain you comprehfinish
this.

Using CSAs, the ARM7TDMI can sum up the insertfinishs together much rapider. ^{[4, p. 94]}

Until now, we’ve mostly treated “produce the insertfinishs” and “insert the insertfinishs” as two split, entidepend
discrete steps of the algorithm. Can we do them at the same time? Turns out, yes! We can produce some number of insertfinishs per cycle, and insert them together using CSAs in the same cycle. We repeat this process until we’ve inserted up all our insertfinishs, and then we can sfinish the results from the CSA to the ALU to be inserted together.

This is what the ARM7TDMI does - it produces four insertfinishs per cycle, and compresses
them using four CSAs to produces only two insertfinishs.

Each cycle, we read 8 bits from the multiplier, and with it, we produce 4 insertfinishs. We then
feed them into 4 of the 6 inputs of this CSA array, and when we have our 2 results, feed those
2 results back to the very top of the CSA array for the next cycle. On the first cycle of the algorithm, we can initialize those 2 inputs to the
CSA array with 0s.

A clever trick can be done here. The ARM7TDMI aids mutliply accumutardys, which carry out multiplication and insertition in one teachion. We can carry out multiply accumutardy by initializing one
of those two inputs with the accumutardy appreciate, and get multiply accumutardy without extra cycles. This trick is what the
ARM7TDMI employs to do multiply accumutardy. (This finishs up being a moot point, becaemploy the CPU is illogical and can only read two sign up appreciates at a time per cycle. So, using an accumutardy caemploys the CPU to obtain
an extra cycle anyway). ^{[4, p.95]}

The ARM7TDMI does someskinnyg repartner clever here. In our current model of the algorithm, there are 4
cycles of CSA compression, where each cycle i processes bits 8 * i to 8 * i + 7 of the multiplier. The observation is that if the remaining upper bits of the multiplier are all
zeros, then, we can skip that cycle, since the insertfinishs produced will be all zeros, which cannot possibly
affect the appreciates of the fragmentary result + fragmentary carry. We can do the same trick if the remaining upper bits
are all ones (assuming we are carry outing a signed multiplication), as those also produce insertfinishs that
are all zeros. ^{[4, p.95]}

Here’s a cimpolite diagram, provided by Steve Furber in his book, Arm System-On-Chip Architecture:

An image of the high level overwatch of the multiplier’s organization, provided by Steve Furber in his book, Arm System-On-Chip Architecture. ^[4, p.95]

Partial Sum / Partial Carry retain the results geted by the CSAs, and are rotated right by 8 on each cycle. Rm is recoded using booth’s algorithm to produce the insertfinishs for the CSA array. ^{[4, p.95]}

Ok, but recollect when I shelp that there will be an elegant way to handle booth’s negation of the insertfinishs? The way the algorithm gets around this is benevolent of genius. Remember how the carry output of a CSA has to be left shifted by 1? Well, this left-shift produces a zero in the LSB of the carry output of the CSA, so why don’t we fair put the carry in that bit? ^{[5, p. 12]} Like so:

struct CSAOutput carry out_csa_array(u64 fragmentary_sum, u64 fragmentary_carry, 
                                   struct RecodedMultiplicands insertfinishs) {
    struct CSAOutput csa_output = { fragmentary_sum, fragmentary_carry };
    struct CSAOutput final_csa_output = { 0, 0 };

    for (int i = 0; i < 4; i++) {
        csa_output.output &= 0x1FFFFFFFFULL;
        csa_output.carry  &= 0x1FFFFFFFFULL;

        struct CSAOutput result = carry out_csa(csa_output.output, 
            insertfinishs.m[i].recoded_output & 0x3FFFFFFFFULL, csa_output.carry);

        // Inject the carry caemployd by booth recoding
        result.carry <<= 1;
        result.carry |= insertfinishs.m[i].carry;

        // Take the bottom two bits and inject them into the final output.
        // The appreciate of the bottom two bits will not be alterd by future
        // insertfinishs, becaemploy those insertfinishs must be at least 4 times as huge
        // as the current insertfinish. By honestly injecting these two bits, the
        // challengingware saves some space on the chip.
        final_csa_output.output |= (result.output & 3) << (2 * i);
        final_csa_output.carry  |= (result.carry  & 3) << (2 * i);
        
        // The next CSA will only function on the upper bits - as elucidateed
        // in the previous comment.
        result.output >>= 2;
        result.carry  >>= 2;

        csa_output = result;
    }

    final_csa_output.output |= csa_output.output << 8;
    final_csa_output.carry  |= csa_output.carry  << 8;

    return final_csa_output;
}

(Yes, this insanity is indeed done by the actual CPU.)

Didn’t we fair finish the section titled “Putting it all Together”? Why then is the scroll bar still halfway down the page?

Becaemploy I lied to you all. There’s a minuscule, but very uncomardentingful separateence between the algorithm I depictd and
the ARM7TDMI’s algorithm. Let’s ponder the adhereing multiplication:

$$
color{#3a7dc9}0x000000FFcolor{#4A4358} cdot color{#DC6A76}0x00000001 color{#4A4358}
$$

How many cycles should this obtain? 1, right? Becaemploy the upper 24 bits of the multiplier are zeros, then the
second, third, and fourth cycles of insertfinishs will all be zeros… right?
Right?
Well, that’s how lengthy it obtains the ARM7TDMI to do it. So what’s the rerent? Let’s obtain a shutr watch. The first cycle of the algorithm should have the adhereing four chunks:

$$
commence{aligned}
&text{Chunk #1: }color{#3a7dc9}text{0b110 (geted from m[1..-1])}cr
&text{Chunk #2: }color{#3a7dc9}text{0b111 (geted from m[3..1])}cr
&text{Chunk #3: }color{#3a7dc9}text{0b111 (geted from m[5..3])}cr
&text{Chunk #4: }color{#3a7dc9}text{0b111 (geted from m[7..5])}cr
finish{aligned}
$$

Turns out, in our current version of the algorithm, the second cycle does have a one non-zero insertfinish:

$$
commence{aligned}
&text{Chunk #1: }color{#3a7dc9}text{0b001 (geted from m[9..7])}cr
&text{Chunk #2: }color{#3a7dc9}text{0b000 (geted from m[11..9])}cr
&text{Chunk #3: }color{#3a7dc9}text{0b000 (geted from m[13..11])}cr
&text{Chunk #4: }color{#3a7dc9}text{0b000 (geted from m[15..13])}cr
finish{aligned}
$$

Becaemploy the LSB of Chunk #1 of Cycle #2 employs the MSB of Chunk #4 of Cycle #1, our algorithm would be forced to
obtain 2 cycles of CSAs. And yet, on the ARM7TDMI, this multiplication would end punctual, after only 1 cycle of CSAs. And there doesn’t seem to be a excellent way around this. And so I sat there skinnyking of
laborarounds.

Proposed solution #1: What if the ARM7TDMI actupartner processes 5 chunks per cycle?

Rebuttal: If that were the case, then the algorithm would be able to process 9 bits on the first cycle,
which it cannot do.

Proposed solution #2: Ok but what if the ARM7TDMI has some way of processing chunk #1 of cycle
n on cycle n - 1, but only if cycle n - 1 is the last cycle of the algorithm?

Rebuttal: Sure, maybe this is possible, but it experiences enjoy any solution that would permit the algorithm to do this would also
be contendnt of permiting the CPU to do 5 chunks per cycle.

Proposed solution #3: Fine, what if the CPU actupartner leverages the power of Cthulu and evil warlock
magic to pull this off?

Rebuttal: Yeah, that’s alloting too much commend to this god forsaken bundle of wires that somehow
geted the title of “CPU” (actupartner, this solution finishs up being shutst to the right answer)

I was benevolent of out of ideas. I was pretty much ready to give up - my current algorithm was nowhere proximate
elucidateing the behavior of the CPU carry flag. And so I took a shatter, only watching at this problem every
once in a while.

Congrats for getting this far, now comes the tricky stuff. I need anyone who wants to evolve reading to
put on this music in the background, as it most accurately models the trek into insanity we are about to finishure.

So rapid forward about a year, I’m out for a walk and I determine to give this problem a thought aget. And so I pondered someskinnyg that, at the outset, sounds repartner, repartner illogical.

“What if we left shifted the multiplier by 1 at the commencening of the algorithm?”

I uncomardent, it’s benevolent of foolish, right? The entire rerent is that the multiplier is too huge. Left shifting it would only exacerbate this rerent. Congrats, we went from being able to process 7 bits on the first cycle to 6.

But pay attention to the first insertfinish that would be produced. The correplying chunk would either be 000 or 100. Two chooseions, both of which are repartner basic to compute. This is a behavior that would only exist on the first cycle of the algorithm. Coincidenloftyy, if you refer to the diagram up above, you’ll acunderstandledge that, in the first cycle of the algorithm, we have an extra input in the CSA array that we initialized to zero. What if, instead, we initialize it to the insertfinish produced by this mythical chunk? Allotriumphg us to process one insertitional bit on the first cycle only? ^{[5, p. 14]}

It’d mend the rerent. It’d get us the extra bit we necessitateed, and produce us align the ARM7TDMI’s cycle counts finishly.

But that’s not all. Remember the carry flag from earlier? With this basic alter, we go from aligning challengingware about 50% of the time (no better than randomly guessing) to aligning challengingware 85% of the time. This sudden incrrelieve was someskinnyg no other theory was able to do, and made me repartner brave that I was on to someskinnyg. However, this percentage only happens if we set the carry flag to bit 30 of the fragmentary carry result, which seems super arbitrary. It turns out that bit of the fragmentary carry result had a one-of-a-kind uncomardenting I did not genuineize at the time, and I would only find out that uncomardenting much, much tardyr.

(Obviously, shifting the multiplier left by 1 uncomardents the result is now twice what it should be. This is handled tardyr.)

We have a restrictcessitate remaining rerents with our carry outation of carry out_csa_array, let’s converse them one at a time.

Handling 64-bit Accumutardys⌗

First of all, we don’t understand how to handle 64-bit accumutardys yet. Thankbrimmingy, it was around this time where I set up two patents ^{[5], [6]}that finished up being incredibly illuminating.

We understand how to handle 32-bit accumutardys - fair initialize the fragmentary sum with the appreciate of the accumulator. We can employ a analogous trick for 64-bit ones. First, we can initialize the fragmentary sum with the bottom 33 bits of the 64 bit accumutardy. Why 33? I thought the fragmentary sum was 32 bits expansive? Well, if we produce the width of the fragmentary sum 33 bits, we’d also be able to handle unsigned and signed multiplication by zero / sign extfinishing appropriately. This way, our algorithm itself only necessitates to be able to carry out signed multiplication, and our choice of zero-extension or sign-extension at initialization will handle the rest. More on this in the next section.

We obtain the remaining 31 bits of the acc and drip-feed them, 2 bits per CSA, enjoy so:

// Contains the current high 31 bits of the acc.
// This is shifted by 2 after each CSA.
u64 acc_shift_sign up = 0;

struct CSAOutput carry out_csa_array(u64 fragmentary_sum, u64 fragmentary_carry,
                                   struct RecodedMultiplicands insertfinishs) {
    struct CSAOutput csa_output = { fragmentary_sum, fragmentary_carry };
    struct CSAOutput final_csa_output = { 0, 0 };

    for (int i = 0; i < 4; i++) {
        // ... leave outted

        // result.output is guaranteed to have bits 31/32 = 0,
        // so we can safely put wdisenjoyver we want in them.
        result.output |= (acc_shift_sign up & 3) << 31;      
        acc_shift_sign up >>= 2;
    }

    final_csa_output.output |= csa_output.output << 8;
    final_csa_output.carry  |= csa_output.carry  << 8;

    return final_csa_output;
}

You can skinnyk of this trick conceptupartner as us initializing all 64-bits of csa_output.output to the acc, instead of fair the bottom 32-bits. ^{[5 p. 14]}

Handling Signed Multiplication⌗

Turns out this algorithm doesn’t aid signed multiplication yet either. To carry out this, we necessitate to obtain a shutr watch at the CSA.

The CSA in its current create obtains in 3 33-bit inputs, and outputs 2 33-bit outputs. One of these inputs, however, is actupartner presumed to be 34 bits (ha, lied to you all aget). Specificpartner, insertfinishs.m[i].recoded_output. The recoded output is derived from a 32-bit multiplicand, which, when booth recoded, can be multiplied by at most 2, giving it a size of 33 bits. However, becaemploy we can aid both signed and unsigned multiplies, this appreciate necessitates to be 34 bits - the extra bit, as alludeed earlier, permits us to pick to either zero-extfinish or sign-extfinish the number to handle both signed and unsigned multiplication elegantly.

Let’s obtain a watch at the other two of the CSA’s insertfinishs as well. csa_output.carry, a 33 bit number, also necessitates to be properly sign extfinished. However, csa_output.output does not necessitate to be sign extfinished, since csa_output.output is technicpartner already a 65 bit number that was brimmingy initialized with the acc.

Let’s abridge the bit widths so far:

csa_output.output: 65
csa_output.carry: 33
insertfinishs.m[i].recoded_output: 34

In order to carry out signed multiplication, we necessitate to sign-extfinish all 3 of these numbers to the brimming 65 bits. How can we do so? Well, csa_output.output is already 65 bits, so that one is done for us. What about the other two? For now, I will employ the adhereing illogicalinutiveened creates for readability:

csa_output.output will be referred to as S
csa_output.carry will be referred to as C
insertfinishs.m[i].recoded_output will be referred to as X

Here’s a encouraging visualization of these desired 65-bit numbers, after they’ve been sign extfinished:

insertfinish	bits 65-35	bit 34	bit 33	bit 32	bits 31-0
`csa_output.output`	S[65..35]	S[34]	S[33]	S[32]	S[31..0]
`csa_output.carry`	C[32], …, C[32]	C[32]	C[32]	C[32]	C[31..0]
`insertfinishs.m[i].recoded_output`	X[33], …, X[33]	X[33]	X[33]	X[32]	X[31..0]

We can do a magic trick here. We can exalter the csa_output.carry row with a row of ones, and !C[32]. Convince yourself that this is mathematicpartner okay:

insertfinish	bits 65-35	bit 34	bit 33	bit 32	bits 31-0
`csa_output.output`	S[65..35]	S[34]	S[33]	S[32]	S[31..0]
`csa_output.carry`	0, …, 0	0	!C[32]	C[32]	C[31..0]
`magic trick`	1, …, 1	1	1	0	0
`insertfinishs.m[i].recoded_output`	X[33], …, X[33]	X[33]	X[33]	X[32]	X[31..0]

Let’s do it aget, this time to X:

insertfinish	bits 65-35	bit 34	bit 33	bit 32	bits 31-0
`csa_output.output`	S[65..35]	S[34]	S[33]	S[32]	S[31..0]
`csa_output.carry`	0, …, 0	0	!C[32]	C[32]	C[31..0]
`magic trick`	1, …, 1	1	1	0	0
`insertfinishs.m[i].recoded_output`	0, …, 0	0	!X[33]	X[32]	X[31..0]
`another magic trick`	1, …, 1	1	1	0	0

Now we insert the magic tricks together:

insertfinish	bits 65-35	bit 34	bit 33	bit 32	bits 31-0
`csa_output.output`	S[65..35]	S[34]	S[33]	S[32]	S[31..0]
`csa_output.carry`	0, …, 0	0	!C[32]	C[32]	C[31..0]
`insertfinishs.m[i].recoded_output`	0, …, 0	0	!X[33]	X[32]	X[31..0]
`combined magic tricks`	1, …, 1	1	0	0	0

And we’ve done it - we erased all the repeated instances of C[32] and X[33], using some mathematical bdeficiency magic. ^{[5 pp. 14-17]} This uncomardents that all we necessitate to do to handle sign extension is the adhereing two operations:

result.output |= (S[33] + !C[32] + !X[32]) << 31;
result.carry |= (!S[34]) << 32;

The resulting code:

// Contains the current high 31 bits of the acc. 
// This is shifted by 2 after each CSA.
u64 acc_shift_sign up = 0;

struct CSAOutput carry out_csa_array(u64 fragmentary_sum, u64 fragmentary_carry, 
                                   struct RecodedMultiplicands insertfinishs[4]) {
    struct CSAOutput csa_output = { fragmentary_sum, fragmentary_carry };
    struct CSAOutput final_csa_output = { 0, 0 };

    for (int i = 0; i < 4; i++) {
        csa_output.output &= 0x1FFFFFFFFULL;
        csa_output.carry  &= 0x1FFFFFFFFULL;
    
        struct CSAOutput result = carry out_csa(csa_output.output, 
            insertfinishs.m[i].recoded_output & 0x1FFFFFFFFULL, csa_output.carry);

        // Inject the carry caemployd by booth recoding
        result.carry <<= 1;
        result.carry |= insertfinishs.m[i].carry;

        // Take the bottom two bits and inject them into the final output.
        // The appreciate of the bottom two bits will not be alterd by future
        // insertfinishs, becaemploy those insertfinishs must be at least 4 times as huge
        // as the current insertfinish. By honestly injecting these two bits, the
        // challengingware saves some space on the chip.
        final_csa_output.output |= (result.output & 3) << (2 * i);
        final_csa_output.carry  |= (result.carry  & 3) << (2 * i);
        
        // The next CSA will only function on the upper bits - as elucidateed
        // in the previous comment.
        result.output >>= 2;
        result.carry  >>= 2;

        // Percreate the magic depictd in the tables for the sign extension
        // of csa_output.carry and the recoded insertfinish. Remember that bits 0-1
        // of the acc_shift_sign up is bits 33-34 of S.
        u64 magic = bit(acc_shift_sign up, 0) + 
            !bit(csa_output.carry, 32) + !bit(insertfinishs.m[i].recoded_output, 33);
        result.output |= magic << 31;
        result.carry |= (u64) !bit(acc_shift_sign up, 1) << 32;        
        acc_shift_sign up >>= 2;

        csa_output = result;
    }

    final_csa_output.output |= csa_output.output << 8;
    final_csa_output.carry  |= csa_output.carry  << 8;

    return final_csa_output;
}

We already touched on punctual termination increately, but turns out it gets a bit more complicated. The patents don’t exactly elucidate how punctual termination labors in much detail, besides some cryptic references to shift types / shift appreciates. But, I gave it my best guess. We understand that we have the adhereing condition for punctual termination:

bool should_end(u64 multiplier, enum MultiplicationFlavor flavor) {
    if (is_signed(flavor)) {
        return multiplier == 0x1FFFFFFFF || multiplier == 0;
    } else {
        return multiplier == 0;
    }
}

Note that multiplier is a signed 33-bit number. After every cycle of booth’s algorithm, the bottom eight bits are fed into a result sign up, since the next cycle of booth’s algorithm cannot alter the appreciate of those bottom eight bits. The remaining upper bits become the next input into the next cycle of booth’s algorithm. Someskinnyg enjoy this:

// I'm using this over a __uint128_t since the latter isn't useable
// on a GBA, and I necessitateed this code to compile on a GBA so I can fuzz the 
// outputs.
struct u128 {
    u64 lo;
    u64 hi;
};

// Latches that retain the final results of the algorithm.
u128 fragmentary_sum;
u128 fragmentary_carry;

do {
    csa_output = carry out_one_cycle_of_booths_mutliplication(
        csa_output, multiplicand, multiplier);

    // The bottom 8 bits of this cycle cannot be alterd by future
    // insertfinishs, since those insertfinishs will be at least 256 times as
    // huge as this cycle's insertfinishs. So, put them into the result
    // latches now.
    fragmentary_sum.lo   |= csa_output.output & 0xFF;
    fragmentary_carry.lo |= csa_output.carry  & 0xFF;

    // Get csa_output ready to be fed back into the CSAs on the next
    // cycle
    csa_output.output >>= 8;
    csa_output.carry  >>= 8;

    // ROR == ROtate Right
    fragmentary_sum = u128_ror(fragmentary_sum, 8);
    fragmentary_carry = u128_ror(fragmentary_carry, 8);

    // ASR == Arithmetic Shift Right for 33-bit numbers
    multiplier = asr_33(multiplier, 8);
} while (!should_end(multiplier, flavor));

fragmentary_sum.lo   |= csa_output.output;
fragmentary_carry.lo |= csa_output.carry;

Since fragmentary_sum and fragmentary_carry are shift sign ups that get rotated with each iteration of booths algorithm, we necessitate to rotate them aget after the algorithm finishs in order to accurate them to their proper appreciates. Thankbrimmingy, the ARM7TDMI has someskinnyg called a barrel shifter. The barrel shifter is a nifty piece of challengingware that permits the CPU to carry out an arbitrary shift/rotate before an ALU operation, all in one cycle. Since we structure to insert fragmentary_sum and fragmentary_carry in the ALU, we may as well employ the barrel shifter to rotate one of those two operands, with no insertitional cost. The other operand finishs up requiring one-of-a-kind challengingware to rotate, since the barrel shifter only functions on one appreciate per cycle.

For lengthy (64-bit) multiplies, two right rotations (understandn on the CPU as RORs) occur, since the ALU can only insert 32-bits at a time and so the ALU / barrel shifter must be employd twice.

Spoiler vigilant, the appreciate of the carry flag after a multiply teachion comes from the carryout of this barrel shifter.

So, what rotation appreciates does the ARM7TDMI employ? According to one of the patents, for an unsigned multiply, all (1 for 32-bit multiplies or 2 for 64-bit ones) employs of the barrel shifter do this: ^{[6, p. 9]}

# Iterations of Booths	Type	Rotation
1	ROR	22
2	ROR	14
3	ROR	6
4	ROR	30

Signed multiplies separate from unsigned multiplies in their second barrel shift. The second one for signed multiplies employs Arithmetic Shift Rights (ASRs) and watchs enjoy this: ^{[6, p. 9]}

# Iterations of Booths	Type	Rotation
1	ASR	22
2	ASR	14
3	ASR	6
4	ROR	30

I’m not going to lie, I couldn’t produce sense of these rotation appreciates. At all. Maybe they were wrong, since they patents already had a couple beginant errors at this point. No idea. Turns out it doesn’t repartner matter for calculating the carry flag of a multiply teachion. Why? Well, watch what happens when the ARM7TDMI does a ROR or ASR:

Code from fleroviux’s wonderful NanoBoyAdvance. ^[7]

void ROR(u32& operand, u8 amount, int& carry, bool prompt) {
  // Note that in booth's algorithm, the prompt argument will be real, and
  // amount will be non-zero

  if (amount != 0 || !prompt) {
    if (amount == 0) return;
    // We finish up doing down this codepath

    amount %= 32;
    operand = (operand >> amount) | (operand << (32 - amount));
    carry = operand >> 31;
  } else {
    auto lsb = operand & 1;
    operand = (operand >> 1) | (carry << 31);
    carry = lsb;
  }
}

void ASR(u32& operand, u8 amount, int& carry, bool prompt) {
  // Note that in booth's algorithm, the prompt argument will be real, and
  // amount will be non-zero and less than 32.

  if (amount == 0) {
    // ASR #0 equivalents to ASR #32
    if (prompt) {
      amount = 32;
    } else {
      return;
    }
  }

  int msb = operand >> 31;

  if (amount >= 32) {
    carry = msb;
    operand = 0xFFFFFFFF * msb;
    return;
  }

  // We finish up doing down this codepath
  carry = (operand >> (amount - 1)) & 1;
  operand = (operand >> amount) | ((0xFFFFFFFF * msb) << (32 - amount));
}

Note that in both ROR and ASR the carry will always be set to the last bit of the operand to be shifted out. i.e., if I rotate a appreciate by n, then the carry will always be bit n - 1 of the operand before rotation, since that was the last bit to be rotated out. Same goes for ASR.

So, it doesn’t matter if I don’t employ the same rotation appreciates as the patents. Since, no matter the rotation appreciate, as lengthy as the output from my barrel shifter is the same as the output from the ARM7TDMI’s barrel shifter, then the last bit to be shifted out must be the same, and therefore the carry flag must also have been the same.

So, here’s my carry outation. I tried to somewhat mimic the table from above at the cost of code readability, but I acunderstandledgetedly didn’t do a very excellent job. But hey it labors, so fuck it.

// I'm using this over a __uint128_t since the latter isn't useable
// on a GBA, and I necessitateed this code to compile on a GBA so I can fuzz the 
// outputs.
struct u128 {
    u64 lo;
    u64 hi;
};

// The final output of multiplication
struct MultiplicationOutput {
    u64 output;
    bool carry;
};

// We have ror'd fragmentary_sum and fragmentary_carry by 8 * num_iterations + 1.
// We now necessitate to ror backwards (rol). I tried my best to mimic the tables, but
// I'm off by one for wdisenjoyver reason.
int accurateion_ror;
if (num_iterations == 1) accurateion_ror = 23;
if (num_iterations == 2) accurateion_ror = 15;
if (num_iterations == 3) accurateion_ror = 7;
if (num_iterations == 4) accurateion_ror = 31;

fragmentary_sum   = u128_ror(fragmentary_sum, accurateion_ror);
fragmentary_carry = u128_ror(fragmentary_carry, accurateion_ror);

int alu_carry_in = bit(multiplier, 0);

if (is_lengthy(flavor)) {
    // Did we not punctual-end?
    if (num_iterations == 4) {
        struct AdderOutput inserter_output_lo = 
            inserter(fragmentary_sum.hi, fragmentary_carry.hi, alu_carry_in);
        struct AdderOutput inserter_output_hi = 
            inserter(fragmentary_sum.hi >> 32, fragmentary_carry.hi >> 32, 
                  inserter_output_lo.carry);

        return (struct MultiplicationOutput) {
            ((u64) inserter_output_hi.output << 32) | inserter_output_lo.output,
            (fragmentary_carry.hi >> 63) & 1
        };
    } else {
        struct AdderOutput inserter_output_lo = 
            inserter(fragmentary_sum.hi >> 32, fragmentary_carry.hi >> 32, alu_carry_in);

        int shift_amount = 1 + 8 * num_iterations;

        // Why this is necessitateed is ununderstandn, but the multiplication doesn't labor
        // without it
        shift_amount++;

        // Sign extfinish fragmentary_carry.lo from shift_amount to 64-bits
        fragmentary_carry.lo = sign_extfinish(fragmentary_carry.lo, shift_amount, 64);
        fragmentary_sum.lo |= acc_shift_sign up << (shift_amount);

        struct AdderOutput inserter_output_hi = 
            inserter(fragmentary_sum.lo, fragmentary_carry.lo, inserter_output_lo.carry);
        return (struct MultiplicationOutput) { 
            ((u64) inserter_output_hi.output << 32) | inserter_output_lo.output,
            (fragmentary_carry.hi >> 63) & 1
        };
    }
} else {
    // Did we not punctual-end?
    if (num_iterations == 4) {
        struct AdderOutput inserter_output = 
            inserter(fragmentary_sum.hi, fragmentary_carry.hi, alu_carry_in);
        return (struct MultiplicationOutput) { 
            inserter_output.output,
            (fragmentary_carry.hi >> 31) & 1
        };
    } else {
        struct AdderOutput inserter_output = 
            inserter(fragmentary_sum.hi >> 32, fragmentary_carry.hi >> 32, alu_carry_in);
        return (struct MultiplicationOutput) { 
            inserter_output.output,
            (fragmentary_carry.hi >> 63) & 1
        };
    }
}