From b26a716ccffff902bdb1d5a1f492426cf5dbe621 Mon Sep 17 00:00:00 2001 From: Peisong Xiao Date: Wed, 14 May 2025 22:27:40 -0400 Subject: [PATCH] began work on the central routing logic, updated some documentation --- README.md | 3 +- devlog/2025-05-13-Fabric-logic.md | 152 ++++++++++++++++++++++++++++++ fabric/src/mem_hub.sv | 50 ++++++++++ plan.md | 94 ++++++++++++++++++ protocol.md | 31 ++++++ 5 files changed, 329 insertions(+), 1 deletion(-) create mode 100644 devlog/2025-05-13-Fabric-logic.md create mode 100644 fabric/src/mem_hub.sv create mode 100644 plan.md create mode 100644 protocol.md diff --git a/README.md b/README.md index 6d3b606..4a76ffc 100644 --- a/README.md +++ b/README.md @@ -43,7 +43,8 @@ See `protocol.md` for details. ROSE was designed to embrace newer possibilities as development continues. ## The planning -See the plan in `plan.md`. +See the plan in `plan.md`. This file also contains short summaries of +what I did at each step. Most of ROSE's behaviors and features have been planned *before* the first source file was even created. A good plan serves both as a good diff --git a/devlog/2025-05-13-Fabric-logic.md b/devlog/2025-05-13-Fabric-logic.md new file mode 100644 index 0000000..8203816 --- /dev/null +++ b/devlog/2025-05-13-Fabric-logic.md @@ -0,0 +1,152 @@ +# Fabric Logic +Date: 2025-05-13 + +## Goals and expectations +Not much for expectations in this devlog. I fell ill, probably the +spring flu. But hopefully, I can write down what I've been planning. + +I've decided to skip the part to implement sending back the reversed +string that the fabric received - not much use in asynchronous +communication. + +## Reworked design +This is what I've been doing for the past few days - reworking some +design, or rather, solidifying what I have in my plan into designs. + +### Rethinking the header +Initially, I designed ROSE's header to be 20 bytes, which includes a +32-bit integer for the size and some redundancy. However, after +looking at some methods of organizing the FPGA's internal memory, to +prevent cross-block BRAM access and simplifying the logic, I figured I +only need some fixed sizes for the packets: 128B, 256B, 512B, and +1024B (or some lengths that align with the BRAM partitioning of the +FPGA). Note that this could even lead to ultra-small packet sizes +like 16 bytes when the network needs them. + +That decision was made with the existence of management ports in mind, +i.e. I expect that devices involved in the control of the workload on +a ROSE network would also have a management interface (e.g. Ethernet, +Wi-Fi, etc.). So, there's no practical need for control packets that +are a few dozen bytes large to exist in a ROSE network, that can be +left to more mature network stacks even if they have higher latency. + +And this can directly eat up 2 bits on the command byte of the header, +I'm confident that ROSE doesn't need 256 different commands for the +network, 64 is probably more than enough. + +When booting up the network, the devices will send out a packet to +negotiate how big the packets are - all of the devices must share the +same packet size. This choice arose from a workload point-of-view. +For a certain workload, the packet sizes will almost be consistent +across the entire job. + +To take this further, I also have in mind to let devices declare the +sizes of their packet buffer slots. With about 103KB of BRAM on the +Tang Primer 20K and 4 connected devices, I want the devices to each +have 16KB of TX queue. That means at least 16 slots, and up to 128 +slots per queue. In a well designed system, a device should only +receive one size category of packets, (e.g. in a trading system, +there's specific devices that handle order book digestion, and expects +larger packets, while the devices that expects orders may only expect +smaller packets). Of course, this idea can be rethought in the future +when THORN actually generates different loads targeting different +systems. + +The above design would actually dramatically shrink the ROSE header: 1 +byte for command + size, 1 byte for destination address, 1 byte for +source address, 4 bytes for a possible sequence number, and at the end +one byte for CRC-8 checksums. + +After some more thought, the sequence number can be put into "feature +extensions" by utilizing the 64 commands I have. Even the CRC byte +can be encoded the same way. Which brings the total overhead of a +standard ROSE protocol to 4 bytes. + +### Designing the router logic +The fabric needs some kind of internal buffer to handle a) asymmetric +interface speeds and b) multiple devices trying to send packets to the +same one. So, there has to be some internal packet queue. + +I want the internal routing logic to run at 135Mhz, while the SPI +interfaces capped at 50Mhz, and have the logic take in a byte at a +time instead of a bit. This means that the internal logic runs at a +much faster speed than the interfaces. This would enable the fabric +to handle simultaneous inputs from different devices. + +The first thing about the design came right off the bat - multi-headed +queues. I plan to use 3/4 heads for the TX queues, so that when +receiving from a higher-speed internal logic, the interfaces can +handle multiple input sources. + +The first idea was to have a shared pool of memory for the queue, +which would handle congestion beautifully since it means that all TX +queues are dynamically allocated. However, it would mean a disaster +for latency, since a FIFO queue doesn't exactly mean fairness. + +Then I thought of the second idea: to have separate TX queues for each +interface. Although this would mean less incast and burst resiliency, +it would perform wonderfully in fairness and latency, combined with +multi-headed queues and faster routing logic than interfaces. + +In compensation for the loss of the central shared pool of memory, +each interface should also get their own RX buffer, enough to hold one +packet while it gets collected by the routing logic. + +Matching the RX buffer, the interface can directly tell the routing +logic where the collected byte should be sent to. This means a +dramatic decrease in the complexity of the routing logic (it doesn't +have to buffer nor parse headers), at the cost of an increase in the +interfaces' logic. + +### Leftover question +What kind of congestion control can be implemented from this design +that mainly hardens incast and burst resiliency? + +I already have in mind using credit-based control flows or using a +DCTCP style ECN. But this would still be left for the future me to +decide. + +For now, focus on making the logic happen and let the packets drop. + +## Reflections +> Even ideas should be reflected and refined. + +1. SPI is very limiting. 50Mhz (or 50Mbps of wire speed) is slow to + be compared against gigabit Ethernet. Hopefully the reduction of + latency in the network and transport layers can make up for this. +2. I gave a lot of thought on how ROSE would scale to industrial-grade + FPGAs, starting with replacing SPI with SERDES, and increasing the + internal bus width from 8 bits (the bus through which packets are + collected from the interfaces by the routing logic) to 256 bits. + This would allow stable operations at the scale of 30Gps SERDES + connections and 400Mhz FPGA clocks. Which would make it comparable + to modern-day Ethernet connections. +3. There was **a lot** of trade-offs when considering asymmetric + interface clocks. I'd like direct streaming from one interface to + another to be possible, but clearly it won't be the case. This + meant 1 copy of packet data within the fabric itself, effectively + doubling the latency for the fabric. But this "trade-off" must be + made, unless there's a magical way of syncing the interfaces (and + that would mean a direct connection, not a network). + +## Final thoughts +I've been thinking a lot about the initiative of ROSE: why build it on +such limiting hardware? And I came across this conclusion: + +> Design systems in thought, in code, in the garage. + +The hardware for ROSE would stay on my desk, but the things I learned +by doing it would stay in my mind and potentially be put to use in the +industry. + +Ideas should be scalable up and down. If something works with cheap, +off-the-shelf FPGAs, I'd expect it to work on industrial-grade ones; +if some idea works in the industry, it should also be applicable (not +necessarily practical) on consumer-grade hardware. + +I consider myself a scientist, I create ideas, and I'm not limited by +hardware or software stacks. + +## Next goals +Implement the logic in sims. I've already started, but it's time to +actually get some results. diff --git a/fabric/src/mem_hub.sv b/fabric/src/mem_hub.sv new file mode 100644 index 0000000..45c6357 --- /dev/null +++ b/fabric/src/mem_hub.sv @@ -0,0 +1,50 @@ +module mem_hub (input logic rst, + input logic sys_clk, + input logic [3:0] connected_devices, // manually configured + input logic [3:0][7:0] rx_cmd, // for routing-related commands + input logic [3:0] rx_cmd_valid, + input logic [3:0][7:0] rx_byte, + input logic [3:0] rx_valid, + input logic [3:0][1:0] rx2tx_dest, // rx byte's destination + input logic [3:0] tx_read, // if tx_byte was read + output logic [3:0] rx_read, // if rx_byte was read + output logic [3:0][1:0] tx_src, // tell the tx where the stream is comming from + output logic [3:0][7:0] tx_byte, + output logic [3:0] tx_valid, + output logic [1:0] packet_size); // 4 states for 4 fixed packet sizes + timeunit 1ns; + timeprecision 1ps; + + // TBD: pre-agree on packet size + + // [index][rx_src] + logic [3:0][1:0] service_queue; + logic [3:0] in_queue; + + // [rx_src][tx_dest], might not be useful + logic [1:0][1:0] rx2tx_map; + + logic [2:0] i; + + always_ff @ (posedge sys_clk) begin + if (rst) begin + rx_read <= '0; + tx_src <= '0; + tx_valid <= '0; + packet_size <= '0; + service_queue <= '0; + in_queue <= '0; + rx2tx_map <= '0; + i <= 0; + end + + if (in_queue == 4'd0) begin // no one is in the queue yet + if (tx_valid != 4'd0) begin + for (i = 0; i < 3'd4; i++) begin + // TODO: write the logic for enqueuing + end + end + end else begin + end + end +endmodule // mem_hub diff --git a/plan.md b/plan.md new file mode 100644 index 0000000..e9eb07e --- /dev/null +++ b/plan.md @@ -0,0 +1,94 @@ +# The Plan/Roadmap for ROSE +> Plans turn fear into focus, risk into reach, and steps into a path. + +This plan has been modified in the course of the development of ROSE. +And that was also in the plan itself: you plan at every step. See the +end for the changes made to the plan. + +## The roadmap +This is a rough summary of what I did and what I plan to do. + +### [DONE] Learning RTL and HDL and getting familiar with SPI +Implement a functional SPI slave on the FPGA. Add small logic to +manipulate the data. Learn about cross-clock domain design and +implementation. + +### [TODO] Implement the routing logic along with the interfaces +This would be the core part of implementing ROSE on the fabric +side. This is a bare minimum implementation disregarding any +congestion control or inter-fabric routing. + +### [TODO] Test on a RPi-FPGA setup +Getting the code to run on sims is one thing, getting it to run on +actual hardware is another. This entire step will be to ship the code +onto my setup and deal with any kind of synthesis and place-and-route +problems that sims won't reveal. + +### [TODO] Implement logging to an external device via UART +This would lay the foundations for THORN and PETAL and also would come +in handy when analyzing congestion and other anomalies. + +### [TODO] Test on a RPi-FPGA-RPi setup +This is where THORN would branch off from ROSE. ROSE should keep some +minimal unit test testbenches, and have fully functional test suites +and toolchains be implemented in THORN. + +### [TODO] RPi's ROSE buffer implementation +`mmap` memory for the SPI drivers on RPi's to *simulate* zero-copy on +the RPi's. + +### [TODO] Modify the SPI kernel drivers for explicit DMA control +Allow ROSE's DMA to be implemented in the drivers. + +### [TODO] Abstract ROSE into APIs or kernel devices +Note: This may be implemented as development of THORN goes into +action, or be facilitated by it. + +### [TODO] Implement congestion control +When the logic for the fabric is mature enough, it should be upgraded. + +### [TODO] Implement mesh networks allowing inter-fabric routing +ROSE shouldn't be limited to only 1 fabric. + +## Changes to the plan +The plan is always changing, but it's important to remember what I +learned from every change. + +### Ditching dynamic routing +In a datacenter or HFT setup, it's rarely expected that the connected +devices will change. Hardcoded routing paths is very acceptable and +keeps up with the deterministic nature of ROSE. + +#### The lesson learned +Figure out the exact target of ROSE - it's not meant for generic +networks, so shave off any redundancy that it doesn't need. + +### Not reversing the input when testing out the SPI interface +A few things to note: + +1. Sending the bytes incremented by 1 back is sufficient to prove a + stable connection. +2. Reversing the input would result in a double-ended queue and + increasing the complexity of the logic without little benefits to + later steps. + +So, I've decided to ditch this idea. + +#### The lesson learned +Plan with the next step in mind, take actions with the next step in +mind. Know what is enough and what is too far. + +### Moving deployment onto the hardware until later on +Originally, I planned to deploy the logic and test with real hardware +as soon as I have a working SPI module. But that's not really +applicable. I'd be fixing the synthesis with every step thereafter. +Better to finalize the design in sims first, and then solve the +FPGA-specific problems as an entire step. + +I'd rather scratch my head over a bunch of problems at once than to +scratch my head every time I push an update to the logic. + +#### The lesson learned +Weight testing against the cost of time and efficiency. If testing +hinders development, then it should be separated from the development +cycle. diff --git a/protocol.md b/protocol.md new file mode 100644 index 0000000..2bcaf9a --- /dev/null +++ b/protocol.md @@ -0,0 +1,31 @@ +# The Specifications for ROSE +Extensions to the protocol may change the specifications, see the +devlogs for specific decisions on changes. + +## Packet specifications +### Header + +#### (1 byte) Command + packet size +- Packet sizes are chosen out of 4 predetermined sizes. Using only 2 + bits in this byte to be represented. +- Commands are 6 bits with 64 possibilities, see the **Commands** + section for details. + +#### (1 byte) Destination address +- This can refer to any end-device or fabric within the network + +#### (1 byte) Source address +- This can refer to any end-device or fabric within the network + +### Payload +Via commands, leading or trailing bytes in the payload can also be +repurposed to timestamps or other feature extensions. + +### (1 byte) CRC-8 +To ensure delivery. + +## Commands +TBD. +### Feature Extensions +#### [CMD: TBD] Include timestamp +#### [CMD: TBD] Include sequence number