began work on the central routing logic, updated some documentation
This commit is contained in:
152
devlog/2025-05-13-Fabric-logic.md
Normal file
152
devlog/2025-05-13-Fabric-logic.md
Normal file
@ -0,0 +1,152 @@
|
||||
# Fabric Logic
|
||||
Date: 2025-05-13
|
||||
|
||||
## Goals and expectations
|
||||
Not much for expectations in this devlog. I fell ill, probably the
|
||||
spring flu. But hopefully, I can write down what I've been planning.
|
||||
|
||||
I've decided to skip the part to implement sending back the reversed
|
||||
string that the fabric received - not much use in asynchronous
|
||||
communication.
|
||||
|
||||
## Reworked design
|
||||
This is what I've been doing for the past few days - reworking some
|
||||
design, or rather, solidifying what I have in my plan into designs.
|
||||
|
||||
### Rethinking the header
|
||||
Initially, I designed ROSE's header to be 20 bytes, which includes a
|
||||
32-bit integer for the size and some redundancy. However, after
|
||||
looking at some methods of organizing the FPGA's internal memory, to
|
||||
prevent cross-block BRAM access and simplifying the logic, I figured I
|
||||
only need some fixed sizes for the packets: 128B, 256B, 512B, and
|
||||
1024B (or some lengths that align with the BRAM partitioning of the
|
||||
FPGA). Note that this could even lead to ultra-small packet sizes
|
||||
like 16 bytes when the network needs them.
|
||||
|
||||
That decision was made with the existence of management ports in mind,
|
||||
i.e. I expect that devices involved in the control of the workload on
|
||||
a ROSE network would also have a management interface (e.g. Ethernet,
|
||||
Wi-Fi, etc.). So, there's no practical need for control packets that
|
||||
are a few dozen bytes large to exist in a ROSE network, that can be
|
||||
left to more mature network stacks even if they have higher latency.
|
||||
|
||||
And this can directly eat up 2 bits on the command byte of the header,
|
||||
I'm confident that ROSE doesn't need 256 different commands for the
|
||||
network, 64 is probably more than enough.
|
||||
|
||||
When booting up the network, the devices will send out a packet to
|
||||
negotiate how big the packets are - all of the devices must share the
|
||||
same packet size. This choice arose from a workload point-of-view.
|
||||
For a certain workload, the packet sizes will almost be consistent
|
||||
across the entire job.
|
||||
|
||||
To take this further, I also have in mind to let devices declare the
|
||||
sizes of their packet buffer slots. With about 103KB of BRAM on the
|
||||
Tang Primer 20K and 4 connected devices, I want the devices to each
|
||||
have 16KB of TX queue. That means at least 16 slots, and up to 128
|
||||
slots per queue. In a well designed system, a device should only
|
||||
receive one size category of packets, (e.g. in a trading system,
|
||||
there's specific devices that handle order book digestion, and expects
|
||||
larger packets, while the devices that expects orders may only expect
|
||||
smaller packets). Of course, this idea can be rethought in the future
|
||||
when THORN actually generates different loads targeting different
|
||||
systems.
|
||||
|
||||
The above design would actually dramatically shrink the ROSE header: 1
|
||||
byte for command + size, 1 byte for destination address, 1 byte for
|
||||
source address, 4 bytes for a possible sequence number, and at the end
|
||||
one byte for CRC-8 checksums.
|
||||
|
||||
After some more thought, the sequence number can be put into "feature
|
||||
extensions" by utilizing the 64 commands I have. Even the CRC byte
|
||||
can be encoded the same way. Which brings the total overhead of a
|
||||
standard ROSE protocol to 4 bytes.
|
||||
|
||||
### Designing the router logic
|
||||
The fabric needs some kind of internal buffer to handle a) asymmetric
|
||||
interface speeds and b) multiple devices trying to send packets to the
|
||||
same one. So, there has to be some internal packet queue.
|
||||
|
||||
I want the internal routing logic to run at 135Mhz, while the SPI
|
||||
interfaces capped at 50Mhz, and have the logic take in a byte at a
|
||||
time instead of a bit. This means that the internal logic runs at a
|
||||
much faster speed than the interfaces. This would enable the fabric
|
||||
to handle simultaneous inputs from different devices.
|
||||
|
||||
The first thing about the design came right off the bat - multi-headed
|
||||
queues. I plan to use 3/4 heads for the TX queues, so that when
|
||||
receiving from a higher-speed internal logic, the interfaces can
|
||||
handle multiple input sources.
|
||||
|
||||
The first idea was to have a shared pool of memory for the queue,
|
||||
which would handle congestion beautifully since it means that all TX
|
||||
queues are dynamically allocated. However, it would mean a disaster
|
||||
for latency, since a FIFO queue doesn't exactly mean fairness.
|
||||
|
||||
Then I thought of the second idea: to have separate TX queues for each
|
||||
interface. Although this would mean less incast and burst resiliency,
|
||||
it would perform wonderfully in fairness and latency, combined with
|
||||
multi-headed queues and faster routing logic than interfaces.
|
||||
|
||||
In compensation for the loss of the central shared pool of memory,
|
||||
each interface should also get their own RX buffer, enough to hold one
|
||||
packet while it gets collected by the routing logic.
|
||||
|
||||
Matching the RX buffer, the interface can directly tell the routing
|
||||
logic where the collected byte should be sent to. This means a
|
||||
dramatic decrease in the complexity of the routing logic (it doesn't
|
||||
have to buffer nor parse headers), at the cost of an increase in the
|
||||
interfaces' logic.
|
||||
|
||||
### Leftover question
|
||||
What kind of congestion control can be implemented from this design
|
||||
that mainly hardens incast and burst resiliency?
|
||||
|
||||
I already have in mind using credit-based control flows or using a
|
||||
DCTCP style ECN. But this would still be left for the future me to
|
||||
decide.
|
||||
|
||||
For now, focus on making the logic happen and let the packets drop.
|
||||
|
||||
## Reflections
|
||||
> Even ideas should be reflected and refined.
|
||||
|
||||
1. SPI is very limiting. 50Mhz (or 50Mbps of wire speed) is slow to
|
||||
be compared against gigabit Ethernet. Hopefully the reduction of
|
||||
latency in the network and transport layers can make up for this.
|
||||
2. I gave a lot of thought on how ROSE would scale to industrial-grade
|
||||
FPGAs, starting with replacing SPI with SERDES, and increasing the
|
||||
internal bus width from 8 bits (the bus through which packets are
|
||||
collected from the interfaces by the routing logic) to 256 bits.
|
||||
This would allow stable operations at the scale of 30Gps SERDES
|
||||
connections and 400Mhz FPGA clocks. Which would make it comparable
|
||||
to modern-day Ethernet connections.
|
||||
3. There was **a lot** of trade-offs when considering asymmetric
|
||||
interface clocks. I'd like direct streaming from one interface to
|
||||
another to be possible, but clearly it won't be the case. This
|
||||
meant 1 copy of packet data within the fabric itself, effectively
|
||||
doubling the latency for the fabric. But this "trade-off" must be
|
||||
made, unless there's a magical way of syncing the interfaces (and
|
||||
that would mean a direct connection, not a network).
|
||||
|
||||
## Final thoughts
|
||||
I've been thinking a lot about the initiative of ROSE: why build it on
|
||||
such limiting hardware? And I came across this conclusion:
|
||||
|
||||
> Design systems in thought, in code, in the garage.
|
||||
|
||||
The hardware for ROSE would stay on my desk, but the things I learned
|
||||
by doing it would stay in my mind and potentially be put to use in the
|
||||
industry.
|
||||
|
||||
Ideas should be scalable up and down. If something works with cheap,
|
||||
off-the-shelf FPGAs, I'd expect it to work on industrial-grade ones;
|
||||
if some idea works in the industry, it should also be applicable (not
|
||||
necessarily practical) on consumer-grade hardware.
|
||||
|
||||
I consider myself a scientist, I create ideas, and I'm not limited by
|
||||
hardware or software stacks.
|
||||
|
||||
## Next goals
|
||||
Implement the logic in sims. I've already started, but it's time to
|
||||
actually get some results.
|
Reference in New Issue
Block a user