rose/devlog/2025-05-13-Fabric-logic.md

# Fabric Logic
Date: 2025-05-13

## Goals and expectations
Not much for expectations in this devlog.  I fell ill, probably the
spring flu.  But hopefully, I can write down what I've been planning.

I've decided to skip the part to implement sending back the reversed
string that the fabric received - not much use in asynchronous
communication.

## Reworked design
This is what I've been doing for the past few days - reworking some
design, or rather, solidifying what I have in my plan into designs.

### Rethinking the header
Initially, I designed ROSE's header to be 20 bytes, which includes a
32-bit integer for the size and some redundancy.  However, after
looking at some methods of organizing the FPGA's internal memory, to
prevent cross-block BRAM access and simplifying the logic, I figured I
only need some fixed sizes for the packets: 128B, 256B, 512B, and
1024B (or some lengths that align with the BRAM partitioning of the
FPGA).  Note that this could even lead to ultra-small packet sizes
like 16 bytes when the network needs them.

That decision was made with the existence of management ports in mind,
i.e. I expect that devices involved in the control of the workload on
a ROSE network would also have a management interface (e.g. Ethernet,
Wi-Fi, etc.).  So, there's no practical need for control packets that
are a few dozen bytes large to exist in a ROSE network, that can be
left to more mature network stacks even if they have higher latency.

And this can directly eat up 2 bits on the command byte of the header,
I'm confident that ROSE doesn't need 256 different commands for the
network, 64 is probably more than enough.

When booting up the network, the devices will send out a packet to
negotiate how big the packets are - all of the devices must share the
same packet size.  This choice arose from a workload point-of-view.
For a certain workload, the packet sizes will almost be consistent
across the entire job.

To take this further, I also have in mind to let devices declare the
sizes of their packet buffer slots.  With about 103KB of BRAM on the
Tang Primer 20K and 4 connected devices, I want the devices to each
have 16KB of TX queue.  That means at least 16 slots, and up to 128
slots per queue.  In a well designed system, a device should only
receive one size category of packets, (e.g. in a trading system,
there's specific devices that handle order book digestion, and expects
larger packets, while the devices that expects orders may only expect
smaller packets).  Of course, this idea can be rethought in the future
when THORN actually generates different loads targeting different
systems.

The above design would actually dramatically shrink the ROSE header: 1
byte for command + size, 1 byte for destination address, 1 byte for
source address, 4 bytes for a possible sequence number, and at the end
one byte for CRC-8 checksums.

After some more thought, the sequence number can be put into "feature
extensions" by utilizing the 64 commands I have.  Even the CRC byte
can be encoded the same way.  Which brings the total overhead of a
standard ROSE protocol to 4 bytes.

### Designing the router logic
The fabric needs some kind of internal buffer to handle a) asymmetric
interface speeds and b) multiple devices trying to send packets to the
same one.  So, there has to be some internal packet queue.

I want the internal routing logic to run at 135Mhz, while the SPI
interfaces capped at 50Mhz, and have the logic take in a byte at a
time instead of a bit.  This means that the internal logic runs at a
much faster speed than the interfaces.  This would enable the fabric
to handle simultaneous inputs from different devices.

The first thing about the design came right off the bat - multi-headed
queues.  I plan to use 3/4 heads for the TX queues, so that when
receiving from a higher-speed internal logic, the interfaces can
handle multiple input sources.

The first idea was to have a shared pool of memory for the queue,
which would handle congestion beautifully since it means that all TX
queues are dynamically allocated.  However, it would mean a disaster
for latency, since a FIFO queue doesn't exactly mean fairness.

Then I thought of the second idea: to have separate TX queues for each
interface.  Although this would mean less incast and burst resiliency,
it would perform wonderfully in fairness and latency, combined with
multi-headed queues and faster routing logic than interfaces.

In compensation for the loss of the central shared pool of memory,
each interface should also get their own RX buffer, enough to hold one
packet while it gets collected by the routing logic.

Matching the RX buffer, the interface can directly tell the routing
logic where the collected byte should be sent to.  This means a
dramatic decrease in the complexity of the routing logic (it doesn't
have to buffer nor parse headers), at the cost of an increase in the
interfaces' logic.

### Leftover question
What kind of congestion control can be implemented from this design
that mainly hardens incast and burst resiliency?

I already have in mind using credit-based control flows or using a
DCTCP style ECN.  But this would still be left for the future me to
decide.

For now, focus on making the logic happen and let the packets drop.

## Reflections
> Even ideas should be reflected and refined.

1. SPI is very limiting.  50Mhz (or 50Mbps of wire speed) is slow to
   be compared against gigabit Ethernet.  Hopefully the reduction of
   latency in the network and transport layers can make up for this.
2. I gave a lot of thought on how ROSE would scale to industrial-grade
   FPGAs, starting with replacing SPI with SERDES, and increasing the
   internal bus width from 8 bits (the bus through which packets are
   collected from the interfaces by the routing logic) to 256 bits.
   This would allow stable operations at the scale of 30Gps SERDES
   connections and 400Mhz FPGA clocks.  Which would make it comparable
   to modern-day Ethernet connections.
3. There was **a lot** of trade-offs when considering asymmetric
   interface clocks.  I'd like direct streaming from one interface to
   another to be possible, but clearly it won't be the case.  This
   meant 1 copy of packet data within the fabric itself, effectively
   doubling the latency for the fabric.  But this "trade-off" must be
   made, unless there's a magical way of syncing the interfaces (and
   that would mean a direct connection, not a network).

## Final thoughts
I've been thinking a lot about the initiative of ROSE: why build it on
such limiting hardware?  And I came across this conclusion:

> Design systems in thought, in code, in the garage.

The hardware for ROSE would stay on my desk, but the things I learned
by doing it would stay in my mind and potentially be put to use in the
industry.

Ideas should be scalable up and down.  If something works with cheap,
off-the-shelf FPGAs, I'd expect it to work on industrial-grade ones;
if some idea works in the industry, it should also be applicable (not
necessarily practical) on consumer-grade hardware.

I consider myself a scientist, I create ideas, and I'm not limited by
hardware or software stacks.

## Next goals
Implement the logic in sims.  I've already started, but it's time to
actually get some results.