153 lines
7.1 KiB
Markdown
153 lines
7.1 KiB
Markdown
# Fabric Logic
|
|
Date: 2025-05-13
|
|
|
|
## Goals and expectations
|
|
Not much for expectations in this devlog. I fell ill, probably the
|
|
spring flu. But hopefully, I can write down what I've been planning.
|
|
|
|
I've decided to skip the part to implement sending back the reversed
|
|
string that the fabric received - not much use in asynchronous
|
|
communication.
|
|
|
|
## Reworked design
|
|
This is what I've been doing for the past few days - reworking some
|
|
design, or rather, solidifying what I have in my plan into designs.
|
|
|
|
### Rethinking the header
|
|
Initially, I designed ROSE's header to be 20 bytes, which includes a
|
|
32-bit integer for the size and some redundancy. However, after
|
|
looking at some methods of organizing the FPGA's internal memory, to
|
|
prevent cross-block BRAM access and simplifying the logic, I figured I
|
|
only need some fixed sizes for the packets: 128B, 256B, 512B, and
|
|
1024B (or some lengths that align with the BRAM partitioning of the
|
|
FPGA). Note that this could even lead to ultra-small packet sizes
|
|
like 16 bytes when the network needs them.
|
|
|
|
That decision was made with the existence of management ports in mind,
|
|
i.e. I expect that devices involved in the control of the workload on
|
|
a ROSE network would also have a management interface (e.g. Ethernet,
|
|
Wi-Fi, etc.). So, there's no practical need for control packets that
|
|
are a few dozen bytes large to exist in a ROSE network, that can be
|
|
left to more mature network stacks even if they have higher latency.
|
|
|
|
And this can directly eat up 2 bits on the command byte of the header,
|
|
I'm confident that ROSE doesn't need 256 different commands for the
|
|
network, 64 is probably more than enough.
|
|
|
|
When booting up the network, the devices will send out a packet to
|
|
negotiate how big the packets are - all of the devices must share the
|
|
same packet size. This choice arose from a workload point-of-view.
|
|
For a certain workload, the packet sizes will almost be consistent
|
|
across the entire job.
|
|
|
|
To take this further, I also have in mind to let devices declare the
|
|
sizes of their packet buffer slots. With about 103KB of BRAM on the
|
|
Tang Primer 20K and 4 connected devices, I want the devices to each
|
|
have 16KB of TX queue. That means at least 16 slots, and up to 128
|
|
slots per queue. In a well designed system, a device should only
|
|
receive one size category of packets, (e.g. in a trading system,
|
|
there's specific devices that handle order book digestion, and expects
|
|
larger packets, while the devices that expects orders may only expect
|
|
smaller packets). Of course, this idea can be rethought in the future
|
|
when THORN actually generates different loads targeting different
|
|
systems.
|
|
|
|
The above design would actually dramatically shrink the ROSE header: 1
|
|
byte for command + size, 1 byte for destination address, 1 byte for
|
|
source address, 4 bytes for a possible sequence number, and at the end
|
|
one byte for CRC-8 checksums.
|
|
|
|
After some more thought, the sequence number can be put into "feature
|
|
extensions" by utilizing the 64 commands I have. Even the CRC byte
|
|
can be encoded the same way. Which brings the total overhead of a
|
|
standard ROSE protocol to 4 bytes.
|
|
|
|
### Designing the router logic
|
|
The fabric needs some kind of internal buffer to handle a) asymmetric
|
|
interface speeds and b) multiple devices trying to send packets to the
|
|
same one. So, there has to be some internal packet queue.
|
|
|
|
I want the internal routing logic to run at 135Mhz, while the SPI
|
|
interfaces capped at 50Mhz, and have the logic take in a byte at a
|
|
time instead of a bit. This means that the internal logic runs at a
|
|
much faster speed than the interfaces. This would enable the fabric
|
|
to handle simultaneous inputs from different devices.
|
|
|
|
The first thing about the design came right off the bat - multi-headed
|
|
queues. I plan to use 3/4 heads for the TX queues, so that when
|
|
receiving from a higher-speed internal logic, the interfaces can
|
|
handle multiple input sources.
|
|
|
|
The first idea was to have a shared pool of memory for the queue,
|
|
which would handle congestion beautifully since it means that all TX
|
|
queues are dynamically allocated. However, it would mean a disaster
|
|
for latency, since a FIFO queue doesn't exactly mean fairness.
|
|
|
|
Then I thought of the second idea: to have separate TX queues for each
|
|
interface. Although this would mean less incast and burst resiliency,
|
|
it would perform wonderfully in fairness and latency, combined with
|
|
multi-headed queues and faster routing logic than interfaces.
|
|
|
|
In compensation for the loss of the central shared pool of memory,
|
|
each interface should also get their own RX buffer, enough to hold one
|
|
packet while it gets collected by the routing logic.
|
|
|
|
Matching the RX buffer, the interface can directly tell the routing
|
|
logic where the collected byte should be sent to. This means a
|
|
dramatic decrease in the complexity of the routing logic (it doesn't
|
|
have to buffer nor parse headers), at the cost of an increase in the
|
|
interfaces' logic.
|
|
|
|
### Leftover question
|
|
What kind of congestion control can be implemented from this design
|
|
that mainly hardens incast and burst resiliency?
|
|
|
|
I already have in mind using credit-based control flows or using a
|
|
DCTCP style ECN. But this would still be left for the future me to
|
|
decide.
|
|
|
|
For now, focus on making the logic happen and let the packets drop.
|
|
|
|
## Reflections
|
|
> Even ideas should be reflected and refined.
|
|
|
|
1. SPI is very limiting. 50Mhz (or 50Mbps of wire speed) is slow to
|
|
be compared against gigabit Ethernet. Hopefully the reduction of
|
|
latency in the network and transport layers can make up for this.
|
|
2. I gave a lot of thought on how ROSE would scale to industrial-grade
|
|
FPGAs, starting with replacing SPI with SERDES, and increasing the
|
|
internal bus width from 8 bits (the bus through which packets are
|
|
collected from the interfaces by the routing logic) to 256 bits.
|
|
This would allow stable operations at the scale of 30Gps SERDES
|
|
connections and 400Mhz FPGA clocks. Which would make it comparable
|
|
to modern-day Ethernet connections.
|
|
3. There was **a lot** of trade-offs when considering asymmetric
|
|
interface clocks. I'd like direct streaming from one interface to
|
|
another to be possible, but clearly it won't be the case. This
|
|
meant 1 copy of packet data within the fabric itself, effectively
|
|
doubling the latency for the fabric. But this "trade-off" must be
|
|
made, unless there's a magical way of syncing the interfaces (and
|
|
that would mean a direct connection, not a network).
|
|
|
|
## Final thoughts
|
|
I've been thinking a lot about the initiative of ROSE: why build it on
|
|
such limiting hardware? And I came across this conclusion:
|
|
|
|
> Design systems in thought, in code, in the garage.
|
|
|
|
The hardware for ROSE would stay on my desk, but the things I learned
|
|
by doing it would stay in my mind and potentially be put to use in the
|
|
industry.
|
|
|
|
Ideas should be scalable up and down. If something works with cheap,
|
|
off-the-shelf FPGAs, I'd expect it to work on industrial-grade ones;
|
|
if some idea works in the industry, it should also be applicable (not
|
|
necessarily practical) on consumer-grade hardware.
|
|
|
|
I consider myself a scientist, I create ideas, and I'm not limited by
|
|
hardware or software stacks.
|
|
|
|
## Next goals
|
|
Implement the logic in sims. I've already started, but it's time to
|
|
actually get some results.
|