From b26a716ccffff902bdb1d5a1f492426cf5dbe621 Mon Sep 17 00:00:00 2001
From: Peisong Xiao <peisong.xiao.xps@gmail.com>
Date: Wed, 14 May 2025 22:27:40 -0400
Subject: [PATCH] began work on the central routing logic, updated some
 documentation

---
 README.md                         |   3 +-
 devlog/2025-05-13-Fabric-logic.md | 152 ++++++++++++++++++++++++++++++
 fabric/src/mem_hub.sv             |  50 ++++++++++
 plan.md                           |  94 ++++++++++++++++++
 protocol.md                       |  31 ++++++
 5 files changed, 329 insertions(+), 1 deletion(-)
 create mode 100644 devlog/2025-05-13-Fabric-logic.md
 create mode 100644 fabric/src/mem_hub.sv
 create mode 100644 plan.md
 create mode 100644 protocol.md

diff --git a/README.md b/README.md
index 6d3b606..4a76ffc 100644
--- a/README.md
+++ b/README.md
@@ -43,7 +43,8 @@ See `protocol.md` for details.
 ROSE was designed to embrace newer possibilities as development continues.
 
 ## The planning
-See the plan in `plan.md`.
+See the plan in `plan.md`.  This file also contains short summaries of
+what I did at each step.
 
 Most of ROSE's behaviors and features have been planned *before* the
 first source file was even created.  A good plan serves both as a good
diff --git a/devlog/2025-05-13-Fabric-logic.md b/devlog/2025-05-13-Fabric-logic.md
new file mode 100644
index 0000000..8203816
--- /dev/null
+++ b/devlog/2025-05-13-Fabric-logic.md
@@ -0,0 +1,152 @@
+# Fabric Logic
+Date: 2025-05-13
+
+## Goals and expectations
+Not much for expectations in this devlog.  I fell ill, probably the
+spring flu.  But hopefully, I can write down what I've been planning.
+
+I've decided to skip the part to implement sending back the reversed
+string that the fabric received - not much use in asynchronous
+communication.
+
+## Reworked design
+This is what I've been doing for the past few days - reworking some
+design, or rather, solidifying what I have in my plan into designs.
+
+### Rethinking the header
+Initially, I designed ROSE's header to be 20 bytes, which includes a
+32-bit integer for the size and some redundancy.  However, after
+looking at some methods of organizing the FPGA's internal memory, to
+prevent cross-block BRAM access and simplifying the logic, I figured I
+only need some fixed sizes for the packets: 128B, 256B, 512B, and
+1024B (or some lengths that align with the BRAM partitioning of the
+FPGA).  Note that this could even lead to ultra-small packet sizes
+like 16 bytes when the network needs them.
+
+That decision was made with the existence of management ports in mind,
+i.e. I expect that devices involved in the control of the workload on
+a ROSE network would also have a management interface (e.g. Ethernet,
+Wi-Fi, etc.).  So, there's no practical need for control packets that
+are a few dozen bytes large to exist in a ROSE network, that can be
+left to more mature network stacks even if they have higher latency.
+
+And this can directly eat up 2 bits on the command byte of the header,
+I'm confident that ROSE doesn't need 256 different commands for the
+network, 64 is probably more than enough.
+
+When booting up the network, the devices will send out a packet to
+negotiate how big the packets are - all of the devices must share the
+same packet size.  This choice arose from a workload point-of-view.
+For a certain workload, the packet sizes will almost be consistent
+across the entire job.
+
+To take this further, I also have in mind to let devices declare the
+sizes of their packet buffer slots.  With about 103KB of BRAM on the
+Tang Primer 20K and 4 connected devices, I want the devices to each
+have 16KB of TX queue.  That means at least 16 slots, and up to 128
+slots per queue.  In a well designed system, a device should only
+receive one size category of packets, (e.g. in a trading system,
+there's specific devices that handle order book digestion, and expects
+larger packets, while the devices that expects orders may only expect
+smaller packets).  Of course, this idea can be rethought in the future
+when THORN actually generates different loads targeting different
+systems.
+
+The above design would actually dramatically shrink the ROSE header: 1
+byte for command + size, 1 byte for destination address, 1 byte for
+source address, 4 bytes for a possible sequence number, and at the end
+one byte for CRC-8 checksums.
+
+After some more thought, the sequence number can be put into "feature
+extensions" by utilizing the 64 commands I have.  Even the CRC byte
+can be encoded the same way.  Which brings the total overhead of a
+standard ROSE protocol to 4 bytes.
+
+### Designing the router logic
+The fabric needs some kind of internal buffer to handle a) asymmetric
+interface speeds and b) multiple devices trying to send packets to the
+same one.  So, there has to be some internal packet queue.
+
+I want the internal routing logic to run at 135Mhz, while the SPI
+interfaces capped at 50Mhz, and have the logic take in a byte at a
+time instead of a bit.  This means that the internal logic runs at a
+much faster speed than the interfaces.  This would enable the fabric
+to handle simultaneous inputs from different devices.
+
+The first thing about the design came right off the bat - multi-headed
+queues.  I plan to use 3/4 heads for the TX queues, so that when
+receiving from a higher-speed internal logic, the interfaces can
+handle multiple input sources.
+
+The first idea was to have a shared pool of memory for the queue,
+which would handle congestion beautifully since it means that all TX
+queues are dynamically allocated.  However, it would mean a disaster
+for latency, since a FIFO queue doesn't exactly mean fairness.
+
+Then I thought of the second idea: to have separate TX queues for each
+interface.  Although this would mean less incast and burst resiliency,
+it would perform wonderfully in fairness and latency, combined with
+multi-headed queues and faster routing logic than interfaces.
+
+In compensation for the loss of the central shared pool of memory,
+each interface should also get their own RX buffer, enough to hold one
+packet while it gets collected by the routing logic.
+
+Matching the RX buffer, the interface can directly tell the routing
+logic where the collected byte should be sent to.  This means a
+dramatic decrease in the complexity of the routing logic (it doesn't
+have to buffer nor parse headers), at the cost of an increase in the
+interfaces' logic.
+
+### Leftover question
+What kind of congestion control can be implemented from this design
+that mainly hardens incast and burst resiliency?
+
+I already have in mind using credit-based control flows or using a
+DCTCP style ECN.  But this would still be left for the future me to
+decide.
+
+For now, focus on making the logic happen and let the packets drop.
+
+## Reflections
+> Even ideas should be reflected and refined.
+
+1. SPI is very limiting.  50Mhz (or 50Mbps of wire speed) is slow to
+   be compared against gigabit Ethernet.  Hopefully the reduction of
+   latency in the network and transport layers can make up for this.
+2. I gave a lot of thought on how ROSE would scale to industrial-grade
+   FPGAs, starting with replacing SPI with SERDES, and increasing the
+   internal bus width from 8 bits (the bus through which packets are
+   collected from the interfaces by the routing logic) to 256 bits.
+   This would allow stable operations at the scale of 30Gps SERDES
+   connections and 400Mhz FPGA clocks.  Which would make it comparable
+   to modern-day Ethernet connections.
+3. There was **a lot** of trade-offs when considering asymmetric
+   interface clocks.  I'd like direct streaming from one interface to
+   another to be possible, but clearly it won't be the case.  This
+   meant 1 copy of packet data within the fabric itself, effectively
+   doubling the latency for the fabric.  But this "trade-off" must be
+   made, unless there's a magical way of syncing the interfaces (and
+   that would mean a direct connection, not a network).
+
+## Final thoughts
+I've been thinking a lot about the initiative of ROSE: why build it on
+such limiting hardware?  And I came across this conclusion:
+
+> Design systems in thought, in code, in the garage.
+
+The hardware for ROSE would stay on my desk, but the things I learned
+by doing it would stay in my mind and potentially be put to use in the
+industry.
+
+Ideas should be scalable up and down.  If something works with cheap,
+off-the-shelf FPGAs, I'd expect it to work on industrial-grade ones;
+if some idea works in the industry, it should also be applicable (not
+necessarily practical) on consumer-grade hardware.
+
+I consider myself a scientist, I create ideas, and I'm not limited by
+hardware or software stacks.
+
+## Next goals
+Implement the logic in sims.  I've already started, but it's time to
+actually get some results.
diff --git a/fabric/src/mem_hub.sv b/fabric/src/mem_hub.sv
new file mode 100644
index 0000000..45c6357
--- /dev/null
+++ b/fabric/src/mem_hub.sv
@@ -0,0 +1,50 @@
+module mem_hub (input logic             rst,
+                input logic             sys_clk,
+                input logic [3:0]       connected_devices, // manually configured
+                input logic [3:0][7:0]  rx_cmd,            // for routing-related commands
+                input logic [3:0]       rx_cmd_valid,
+                input logic [3:0][7:0]  rx_byte,
+                input logic [3:0]       rx_valid,
+                input logic [3:0][1:0]  rx2tx_dest,        // rx byte's destination
+                input logic [3:0]       tx_read,           // if tx_byte was read
+                output logic [3:0]      rx_read,           // if rx_byte was read
+                output logic [3:0][1:0] tx_src,            // tell the tx where the stream is comming from
+                output logic [3:0][7:0] tx_byte,
+                output logic [3:0]      tx_valid,
+                output logic [1:0]      packet_size);      // 4 states for 4 fixed packet sizes
+   timeunit 1ns;
+   timeprecision 1ps;
+   
+   // TBD: pre-agree on packet size
+
+   // [index][rx_src]
+   logic [3:0][1:0] service_queue;
+   logic [3:0]      in_queue;
+
+   // [rx_src][tx_dest], might not be useful
+   logic [1:0][1:0] rx2tx_map;
+   
+   logic [2:0]      i;
+
+   always_ff @ (posedge sys_clk) begin
+      if (rst) begin
+         rx_read <= '0;
+         tx_src <= '0;
+         tx_valid <= '0;
+         packet_size <= '0;
+         service_queue <= '0;
+         in_queue <= '0;
+         rx2tx_map <= '0;
+         i <= 0;
+      end
+      
+      if (in_queue == 4'd0) begin // no one is in the queue yet
+         if (tx_valid != 4'd0) begin
+            for (i = 0; i < 3'd4; i++) begin
+               // TODO: write the logic for enqueuing
+            end
+         end
+      end else begin
+      end
+   end
+endmodule // mem_hub
diff --git a/plan.md b/plan.md
new file mode 100644
index 0000000..e9eb07e
--- /dev/null
+++ b/plan.md
@@ -0,0 +1,94 @@
+# The Plan/Roadmap for ROSE
+> Plans turn fear into focus, risk into reach, and steps into a path.
+
+This plan has been modified in the course of the development of ROSE.
+And that was also in the plan itself: you plan at every step.  See the
+end for the changes made to the plan.
+
+## The roadmap
+This is a rough summary of what I did and what I plan to do.
+
+### [DONE] Learning RTL and HDL and getting familiar with SPI
+Implement a functional SPI slave on the FPGA.  Add small logic to
+manipulate the data.  Learn about cross-clock domain design and
+implementation.
+
+### [TODO] Implement the routing logic along with the interfaces
+This would be the core part of implementing ROSE on the fabric
+side. This is a bare minimum implementation disregarding any
+congestion control or inter-fabric routing.
+
+### [TODO] Test on a RPi-FPGA setup
+Getting the code to run on sims is one thing, getting it to run on
+actual hardware is another.  This entire step will be to ship the code
+onto my setup and deal with any kind of synthesis and place-and-route
+problems that sims won't reveal.
+
+### [TODO] Implement logging to an external device via UART
+This would lay the foundations for THORN and PETAL and also would come
+in handy when analyzing congestion and other anomalies.
+
+### [TODO] Test on a RPi-FPGA-RPi setup
+This is where THORN would branch off from ROSE.  ROSE should keep some
+minimal unit test testbenches, and have fully functional test suites
+and toolchains be implemented in THORN.
+
+### [TODO] RPi's ROSE buffer implementation
+`mmap` memory for the SPI drivers on RPi's to *simulate* zero-copy on
+the RPi's.
+
+### [TODO] Modify the SPI kernel drivers for explicit DMA control
+Allow ROSE's DMA to be implemented in the drivers.
+
+### [TODO] Abstract ROSE into APIs or kernel devices
+Note: This may be implemented as development of THORN goes into
+action, or be facilitated by it.
+
+### [TODO] Implement congestion control
+When the logic for the fabric is mature enough, it should be upgraded.
+
+### [TODO] Implement mesh networks allowing inter-fabric routing
+ROSE shouldn't be limited to only 1 fabric.
+
+## Changes to the plan
+The plan is always changing, but it's important to remember what I
+learned from every change.
+
+### Ditching dynamic routing
+In a datacenter or HFT setup, it's rarely expected that the connected
+devices will change.  Hardcoded routing paths is very acceptable and
+keeps up with the deterministic nature of ROSE.
+
+#### The lesson learned
+Figure out the exact target of ROSE - it's not meant for generic
+networks, so shave off any redundancy that it doesn't need.
+
+### Not reversing the input when testing out the SPI interface
+A few things to note:
+
+1. Sending the bytes incremented by 1 back is sufficient to prove a
+   stable connection.
+2. Reversing the input would result in a double-ended queue and
+   increasing the complexity of the logic without little benefits to
+   later steps.
+
+So, I've decided to ditch this idea.
+
+#### The lesson learned
+Plan with the next step in mind, take actions with the next step in
+mind.  Know what is enough and what is too far.
+
+### Moving deployment onto the hardware until later on
+Originally, I planned to deploy the logic and test with real hardware
+as soon as I have a working SPI module.  But that's not really
+applicable.  I'd be fixing the synthesis with every step thereafter.
+Better to finalize the design in sims first, and then solve the
+FPGA-specific problems as an entire step.
+
+I'd rather scratch my head over a bunch of problems at once than to
+scratch my head every time I push an update to the logic.
+
+#### The lesson learned
+Weight testing against the cost of time and efficiency.  If testing
+hinders development, then it should be separated from the development
+cycle.
diff --git a/protocol.md b/protocol.md
new file mode 100644
index 0000000..2bcaf9a
--- /dev/null
+++ b/protocol.md
@@ -0,0 +1,31 @@
+# The Specifications for ROSE
+Extensions to the protocol may change the specifications, see the
+devlogs for specific decisions on changes.
+
+## Packet specifications
+### Header
+
+#### (1 byte) Command + packet size
+- Packet sizes are chosen out of 4 predetermined sizes.  Using only 2
+  bits in this byte to be represented.
+- Commands are 6 bits with 64 possibilities, see the **Commands**
+  section for details.
+
+#### (1 byte) Destination address
+- This can refer to any end-device or fabric within the network
+
+#### (1 byte) Source address
+- This can refer to any end-device or fabric within the network
+
+### Payload
+Via commands, leading or trailing bytes in the payload can also be
+repurposed to timestamps or other feature extensions.
+
+### (1 byte) CRC-8
+To ensure delivery.
+
+## Commands
+TBD.
+### Feature Extensions
+#### [CMD: TBD] Include timestamp
+#### [CMD: TBD] Include sequence number