began work on the central routing logic, updated some documentation

2025-05-14 22:27:40 -04:00
parent 24bf28db9d
commit b26a716ccf
5 changed files with 329 additions and 1 deletions
--- a/devlog/2025-05-13-Fabric-logic.md
+++ b/devlog/2025-05-13-Fabric-logic.md
@ -0,0 +1,152 @@
+# Fabric Logic
+Date: 2025-05-13
+
+## Goals and expectations
+Not much for expectations in this devlog.  I fell ill, probably the
+spring flu.  But hopefully, I can write down what I've been planning.
+
+I've decided to skip the part to implement sending back the reversed
+string that the fabric received - not much use in asynchronous
+communication.
+
+## Reworked design
+This is what I've been doing for the past few days - reworking some
+design, or rather, solidifying what I have in my plan into designs.
+
+### Rethinking the header
+Initially, I designed ROSE's header to be 20 bytes, which includes a
+32-bit integer for the size and some redundancy.  However, after
+looking at some methods of organizing the FPGA's internal memory, to
+prevent cross-block BRAM access and simplifying the logic, I figured I
+only need some fixed sizes for the packets: 128B, 256B, 512B, and
+1024B (or some lengths that align with the BRAM partitioning of the
+FPGA).  Note that this could even lead to ultra-small packet sizes
+like 16 bytes when the network needs them.
+
+That decision was made with the existence of management ports in mind,
+i.e. I expect that devices involved in the control of the workload on
+a ROSE network would also have a management interface (e.g. Ethernet,
+Wi-Fi, etc.).  So, there's no practical need for control packets that
+are a few dozen bytes large to exist in a ROSE network, that can be
+left to more mature network stacks even if they have higher latency.
+
+And this can directly eat up 2 bits on the command byte of the header,
+I'm confident that ROSE doesn't need 256 different commands for the
+network, 64 is probably more than enough.
+
+When booting up the network, the devices will send out a packet to
+negotiate how big the packets are - all of the devices must share the
+same packet size.  This choice arose from a workload point-of-view.
+For a certain workload, the packet sizes will almost be consistent
+across the entire job.
+
+To take this further, I also have in mind to let devices declare the
+sizes of their packet buffer slots.  With about 103KB of BRAM on the
+Tang Primer 20K and 4 connected devices, I want the devices to each
+have 16KB of TX queue.  That means at least 16 slots, and up to 128
+slots per queue.  In a well designed system, a device should only
+receive one size category of packets, (e.g. in a trading system,
+there's specific devices that handle order book digestion, and expects
+larger packets, while the devices that expects orders may only expect
+smaller packets).  Of course, this idea can be rethought in the future
+when THORN actually generates different loads targeting different
+systems.
+
+The above design would actually dramatically shrink the ROSE header: 1
+byte for command + size, 1 byte for destination address, 1 byte for
+source address, 4 bytes for a possible sequence number, and at the end
+one byte for CRC-8 checksums.
+
+After some more thought, the sequence number can be put into "feature
+extensions" by utilizing the 64 commands I have.  Even the CRC byte
+can be encoded the same way.  Which brings the total overhead of a
+standard ROSE protocol to 4 bytes.
+
+### Designing the router logic
+The fabric needs some kind of internal buffer to handle a) asymmetric
+interface speeds and b) multiple devices trying to send packets to the
+same one.  So, there has to be some internal packet queue.
+
+I want the internal routing logic to run at 135Mhz, while the SPI
+interfaces capped at 50Mhz, and have the logic take in a byte at a
+time instead of a bit.  This means that the internal logic runs at a
+much faster speed than the interfaces.  This would enable the fabric
+to handle simultaneous inputs from different devices.
+
+The first thing about the design came right off the bat - multi-headed
+queues.  I plan to use 3/4 heads for the TX queues, so that when
+receiving from a higher-speed internal logic, the interfaces can
+handle multiple input sources.
+
+The first idea was to have a shared pool of memory for the queue,
+which would handle congestion beautifully since it means that all TX
+queues are dynamically allocated.  However, it would mean a disaster
+for latency, since a FIFO queue doesn't exactly mean fairness.
+
+Then I thought of the second idea: to have separate TX queues for each
+interface.  Although this would mean less incast and burst resiliency,
+it would perform wonderfully in fairness and latency, combined with
+multi-headed queues and faster routing logic than interfaces.
+
+In compensation for the loss of the central shared pool of memory,
+each interface should also get their own RX buffer, enough to hold one
+packet while it gets collected by the routing logic.
+
+Matching the RX buffer, the interface can directly tell the routing
+logic where the collected byte should be sent to.  This means a
+dramatic decrease in the complexity of the routing logic (it doesn't
+have to buffer nor parse headers), at the cost of an increase in the
+interfaces' logic.
+
+### Leftover question
+What kind of congestion control can be implemented from this design
+that mainly hardens incast and burst resiliency?
+
+I already have in mind using credit-based control flows or using a
+DCTCP style ECN.  But this would still be left for the future me to
+decide.
+
+For now, focus on making the logic happen and let the packets drop.
+
+## Reflections
+> Even ideas should be reflected and refined.
+
+1. SPI is very limiting.  50Mhz (or 50Mbps of wire speed) is slow to
+   be compared against gigabit Ethernet.  Hopefully the reduction of
+   latency in the network and transport layers can make up for this.
+2. I gave a lot of thought on how ROSE would scale to industrial-grade
+   FPGAs, starting with replacing SPI with SERDES, and increasing the
+   internal bus width from 8 bits (the bus through which packets are
+   collected from the interfaces by the routing logic) to 256 bits.
+   This would allow stable operations at the scale of 30Gps SERDES
+   connections and 400Mhz FPGA clocks.  Which would make it comparable
+   to modern-day Ethernet connections.
+3. There was **a lot** of trade-offs when considering asymmetric
+   interface clocks.  I'd like direct streaming from one interface to
+   another to be possible, but clearly it won't be the case.  This
+   meant 1 copy of packet data within the fabric itself, effectively
+   doubling the latency for the fabric.  But this "trade-off" must be
+   made, unless there's a magical way of syncing the interfaces (and
+   that would mean a direct connection, not a network).
+
+## Final thoughts
+I've been thinking a lot about the initiative of ROSE: why build it on
+such limiting hardware?  And I came across this conclusion:
+
+> Design systems in thought, in code, in the garage.
+
+The hardware for ROSE would stay on my desk, but the things I learned
+by doing it would stay in my mind and potentially be put to use in the
+industry.
+
+Ideas should be scalable up and down.  If something works with cheap,
+off-the-shelf FPGAs, I'd expect it to work on industrial-grade ones;
+if some idea works in the industry, it should also be applicable (not
+necessarily practical) on consumer-grade hardware.
+
+I consider myself a scientist, I create ideas, and I'm not limited by
+hardware or software stacks.
+
+## Next goals
+Implement the logic in sims.  I've already started, but it's time to
+actually get some results.