Agnostic Math APU For the X16


I’ve been wanting to scratch an FPGA and RISCV itch, including for some music (but not X16) applications, such as my Waveboy Eurorack oscillator. While looking into all that, I realized some of these inexpensive FPGA dev solutions use the same FPGA as the VERA (the Lattice ICE40UP5K) and have enough pins to see the entire X16 bus. My original idea was to see about making a RISCV co-processor card but while that might be interesting, it would be far more complicated to implement, test, and perhaps use. So instead, I came up with this.

This is an agnostic math APU/API design. Why agnostic? It avoids platform specific details and function. It can be implemented in several ways. I plan on exploring using an FPGA and building up the logic elements making a sort of bespoke super simplistic APU of sorts. But it could be implemented using an RP2040 or a RISCV core on an FPGA running special software, etc.

There’s multiple paths to victory with pros and cons with the constant being how the user/programmer will interact with whatever hardware is chosen.

Memory Addresses

Though it is an agnostic API, there is an assumption most hardware will be sitting on an expansion card or cartridge. The registers was kept to 16-bytes for this reason. A hardware implementation could use an one of the IO# along with the lower 5 bits (really the lower 4 bits as configured at present) of the address which simplifies decoding and the number of pins used.

There are at present, 16 memory addresses which serve specific functions:

$9Fx0: Command / "Opcode"
$9Fx1: Width
$9Fx2: Control
$9Fx3: Status
$9Fx4-$9Fx7: I0-I3 (Operand(s) 1)
$9Fx8-$9FxB: J0-J3 (Operand(s) 2)
$9FxC-$9FxF: R0-R3 (Result(s))


Width controls either the precision of a single operation or the size of the bytes used for simultaneous operations.

Single Operations:

$00:     TBD (0 means nothing here so?)
$01:     Single 8-bit operation over I0 and J0 (should/how to handle carry?)
$02:     Single 16-bit operation over I[0:1] and J[0:1], results in R[0:1]
$03:     Single 24-bit operation over I[0:2] and J[0:2], results in R[0:2]
$04:     Single 32-bit operation over I[0:3] and J[0:3], results in R[0:3]

Simultaneous (SIMD/SSE/MMX Style) Operations:

$10:     TBD
$11:     8-bit operation carried out simultaneously on each of the 4 8-bit values in I and J
$12:     16-bit operation carried out simultaneously on each of the 2 8-bit values in I and J

If the operation is simultaneous, it means it operates on each value of the given width. So if Width = 11, and the instruction is ADD, it means do I0 + J0 = R0, I1 + J1 = R1, I2 + J2 = R2, I3 + J3 = R3

Operate on single bank vs all banks:

Bit 7: Operate on all banks

If set to 1, the instructions are carried out on the current I,J and R as well as those in the other register banks

Commands / Opcodes

In this design, Width controls the precision and whether the instruction is simultaneously carried out on the sub-parts of I and J

Thus, the instructions are much simpler. These re being treated as named instructions though in 6502 assembly, there would be a table of constants (e.g. APU_ADD) which map to specific values:

ADD: Add I and J, store into R
SUB: Subtract I from J, store into R
MUL: Multiply I and J, store into R
DIV: Divide I by J, store into R
AND: And I and J, store into R
IOR: Inclusive OR I and J, store into R
XOR: Exclusive XOR I and J, store into R
SWP: Swap top half of value in I with bottom half, store in R (J is unused)
ROL: Same as 6502
ROR: Same as 6502
ASL: Same as 6502, but no carry (maybe use overflow?)
ASR: Same as 6502, but no carry (maybe use overflow?)
RND: Store random number in R (requires external hardware, e.g. a transistor)

Not sure about these yet, as these seem a lot more complicated whereas the above operations don’t really even need a clock. These might be best implemented on a RISCV core?

FPA: Floating Point Add
FPS: Floating Point Subtract
FPM: Floating Point Multiply
FPD: Floating Point Divide

Placeholder for pointing out a chunk of the instruction space should be left for app-specific user instructions The FPGA/uC would have to be programmed as such so this might be best used as part of a cartridge, but nonetheless, good to leave space for room to grow.



Sets whether to pull the interrupt on completion of an operation (how to do this without a clock?) and clears the interrupt.

Bit 7: Enable Interrupt Bit 6: Clear Interrupt Status Bit 5: Unused Bit 4: Unused Bit 3-0: Bank Selection


Bit 7: Interrupt Status
Bit 6: Operation Complete
Bit 5: Unused
Bit 4: Single-Op Overflow R (WIDTH <= 4)
Bit 3: Multi-Op Overflow R0 (WIDTH > 10)
Bit 2: Multi-Op Overflow R1 (WIDTH > 10)
Bit 1: Multi-Op Overflow R2 (WIDTH > 10)
Bit 0: Multi-Op Overflow R3 (WIDTH > 10)

Operands and Result Registers

All registers are multi-sized depending on the width. If Width is set to $11 for example, then the registers are divided up into 4 independent sub-registers (I[0:3], J[0:3], R[0:3]). The desired instruction then operates on the pairs simultaneously.

I and J are inputs for the instructions. They are always inputs, meaning the hardware will never change these. They are only changed by issuing a STA from your program though they can be read from if desired.

R is the result instruction. It cannot be written to, only read from. In some implementations, R may change in value as an operation is being carried out.

Code Example

; We want to multiply 2 numbers

; Set 2 byte (16-bit) single operation (2 bytes)
lda #$02

; Load I
lda #$12
sta APU_I
lda #$34
sta APU_I + 1

; Load J
lda #$56
sta APU_J
lda #$67
sta APU_J + 1

; Wait for the results. This is a loop but interrupts could be used instead
    lda APU_STATUS
    beq @wait_for_result

; Do something with the results
lda APU_RES + 1