Fabric Logic

Date: 2025-05-13

Goals and expectations

Not much for expectations in this devlog. I fell ill, probably the spring flu. But hopefully, I can write down what I've been planning.

I've decided to skip the part to implement sending back the reversed string that the fabric received - not much use in asynchronous communication.

Reworked design

This is what I've been doing for the past few days - reworking some design, or rather, solidifying what I have in my plan into designs.

Rethinking the header

Initially, I designed ROSE's header to be 20 bytes, which includes a 32-bit integer for the size and some redundancy. However, after looking at some methods of organizing the FPGA's internal memory, to prevent cross-block BRAM access and simplifying the logic, I figured I only need some fixed sizes for the packets: 128B, 256B, 512B, and 1024B (or some lengths that align with the BRAM partitioning of the FPGA). Note that this could even lead to ultra-small packet sizes like 16 bytes when the network needs them.

That decision was made with the existence of management ports in mind, i.e. I expect that devices involved in the control of the workload on a ROSE network would also have a management interface (e.g. Ethernet, Wi-Fi, etc.). So, there's no practical need for control packets that are a few dozen bytes large to exist in a ROSE network, that can be left to more mature network stacks even if they have higher latency.

And this can directly eat up 2 bits on the command byte of the header, I'm confident that ROSE doesn't need 256 different commands for the network, 64 is probably more than enough.

When booting up the network, the devices will send out a packet to negotiate how big the packets are - all of the devices must share the same packet size. This choice arose from a workload point-of-view. For a certain workload, the packet sizes will almost be consistent across the entire job.

To take this further, I also have in mind to let devices declare the sizes of their packet buffer slots. With about 103KB of BRAM on the Tang Primer 20K and 4 connected devices, I want the devices to each have 16KB of TX queue. That means at least 16 slots, and up to 128 slots per queue. In a well designed system, a device should only receive one size category of packets, (e.g. in a trading system, there's specific devices that handle order book digestion, and expects larger packets, while the devices that expects orders may only expect smaller packets). Of course, this idea can be rethought in the future when THORN actually generates different loads targeting different systems.

The above design would actually dramatically shrink the ROSE header: 1 byte for command + size, 1 byte for destination address, 1 byte for source address, 4 bytes for a possible sequence number, and at the end one byte for CRC-8 checksums.

After some more thought, the sequence number can be put into "feature extensions" by utilizing the 64 commands I have. Even the CRC byte can be encoded the same way. Which brings the total overhead of a standard ROSE protocol to 4 bytes.

Designing the router logic

The fabric needs some kind of internal buffer to handle a) asymmetric interface speeds and b) multiple devices trying to send packets to the same one. So, there has to be some internal packet queue.

I want the internal routing logic to run at 135Mhz, while the SPI interfaces capped at 50Mhz, and have the logic take in a byte at a time instead of a bit. This means that the internal logic runs at a much faster speed than the interfaces. This would enable the fabric to handle simultaneous inputs from different devices.

The first thing about the design came right off the bat - multi-headed queues. I plan to use 3/4 heads for the TX queues, so that when receiving from a higher-speed internal logic, the interfaces can handle multiple input sources.

The first idea was to have a shared pool of memory for the queue, which would handle congestion beautifully since it means that all TX queues are dynamically allocated. However, it would mean a disaster for latency, since a FIFO queue doesn't exactly mean fairness.

Then I thought of the second idea: to have separate TX queues for each interface. Although this would mean less incast and burst resiliency, it would perform wonderfully in fairness and latency, combined with multi-headed queues and faster routing logic than interfaces.

In compensation for the loss of the central shared pool of memory, each interface should also get their own RX buffer, enough to hold one packet while it gets collected by the routing logic.

Matching the RX buffer, the interface can directly tell the routing logic where the collected byte should be sent to. This means a dramatic decrease in the complexity of the routing logic (it doesn't have to buffer nor parse headers), at the cost of an increase in the interfaces' logic.

Leftover question

What kind of congestion control can be implemented from this design that mainly hardens incast and burst resiliency?

I already have in mind using credit-based control flows or using a DCTCP style ECN. But this would still be left for the future me to decide.

For now, focus on making the logic happen and let the packets drop.

Reflections

Even ideas should be reflected and refined.

SPI is very limiting. 50Mhz (or 50Mbps of wire speed) is slow to be compared against gigabit Ethernet. Hopefully the reduction of latency in the network and transport layers can make up for this.
I gave a lot of thought on how ROSE would scale to industrial-grade FPGAs, starting with replacing SPI with SERDES, and increasing the internal bus width from 8 bits (the bus through which packets are collected from the interfaces by the routing logic) to 256 bits. This would allow stable operations at the scale of 30Gps SERDES connections and 400Mhz FPGA clocks. Which would make it comparable to modern-day Ethernet connections.
There was a lot of trade-offs when considering asymmetric interface clocks. I'd like direct streaming from one interface to another to be possible, but clearly it won't be the case. This meant 1 copy of packet data within the fabric itself, effectively doubling the latency for the fabric. But this "trade-off" must be made, unless there's a magical way of syncing the interfaces (and that would mean a direct connection, not a network).

Final thoughts

I've been thinking a lot about the initiative of ROSE: why build it on such limiting hardware? And I came across this conclusion:

Design systems in thought, in code, in the garage.

The hardware for ROSE would stay on my desk, but the things I learned by doing it would stay in my mind and potentially be put to use in the industry.

Ideas should be scalable up and down. If something works with cheap, off-the-shelf FPGAs, I'd expect it to work on industrial-grade ones; if some idea works in the industry, it should also be applicable (not necessarily practical) on consumer-grade hardware.

I consider myself a scientist, I create ideas, and I'm not limited by hardware or software stacks.

Next goals

Implement the logic in sims. I've already started, but it's time to actually get some results.

7.1 KiB Raw Blame History