began work on the central routing logic, updated some documentation

2025-05-14 22:27:40 -04:00
parent 24bf28db9d
commit b26a716ccf
5 changed files with 329 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -43,7 +43,8 @@ See `protocol.md` for details.
 ROSE was designed to embrace newer possibilities as development continues.
 ## The planning
-See the plan in `plan.md`.
+See the plan in `plan.md`.  This file also contains short summaries of
 what I did at each step.
 Most of ROSE's behaviors and features have been planned *before* the
 first source file was even created.  A good plan serves both as a good
--- a/devlog/2025-05-13-Fabric-logic.md
+++ b/devlog/2025-05-13-Fabric-logic.md
@ -0,0 +1,152 @@
 # Fabric Logic
 Date: 2025-05-13
 ## Goals and expectations
 Not much for expectations in this devlog.  I fell ill, probably the
 spring flu.  But hopefully, I can write down what I've been planning.
 I've decided to skip the part to implement sending back the reversed
 string that the fabric received - not much use in asynchronous
 communication.
 ## Reworked design
 This is what I've been doing for the past few days - reworking some
 design, or rather, solidifying what I have in my plan into designs.
 ### Rethinking the header
 Initially, I designed ROSE's header to be 20 bytes, which includes a
 32-bit integer for the size and some redundancy.  However, after
 looking at some methods of organizing the FPGA's internal memory, to
 prevent cross-block BRAM access and simplifying the logic, I figured I
 only need some fixed sizes for the packets: 128B, 256B, 512B, and
 1024B (or some lengths that align with the BRAM partitioning of the
 FPGA).  Note that this could even lead to ultra-small packet sizes
 like 16 bytes when the network needs them.
 That decision was made with the existence of management ports in mind,
 i.e. I expect that devices involved in the control of the workload on
 a ROSE network would also have a management interface (e.g. Ethernet,
 Wi-Fi, etc.).  So, there's no practical need for control packets that
 are a few dozen bytes large to exist in a ROSE network, that can be
 left to more mature network stacks even if they have higher latency.
 And this can directly eat up 2 bits on the command byte of the header,
 I'm confident that ROSE doesn't need 256 different commands for the
 network, 64 is probably more than enough.
 When booting up the network, the devices will send out a packet to
 negotiate how big the packets are - all of the devices must share the
 same packet size.  This choice arose from a workload point-of-view.
 For a certain workload, the packet sizes will almost be consistent
 across the entire job.
 To take this further, I also have in mind to let devices declare the
 sizes of their packet buffer slots.  With about 103KB of BRAM on the
 Tang Primer 20K and 4 connected devices, I want the devices to each
 have 16KB of TX queue.  That means at least 16 slots, and up to 128
 slots per queue.  In a well designed system, a device should only
 receive one size category of packets, (e.g. in a trading system,
 there's specific devices that handle order book digestion, and expects
 larger packets, while the devices that expects orders may only expect
 smaller packets).  Of course, this idea can be rethought in the future
 when THORN actually generates different loads targeting different
 systems.
 The above design would actually dramatically shrink the ROSE header: 1
 byte for command + size, 1 byte for destination address, 1 byte for
 source address, 4 bytes for a possible sequence number, and at the end
 one byte for CRC-8 checksums.
 After some more thought, the sequence number can be put into "feature
 extensions" by utilizing the 64 commands I have.  Even the CRC byte
 can be encoded the same way.  Which brings the total overhead of a
 standard ROSE protocol to 4 bytes.
 ### Designing the router logic
 The fabric needs some kind of internal buffer to handle a) asymmetric
 interface speeds and b) multiple devices trying to send packets to the
 same one.  So, there has to be some internal packet queue.
 I want the internal routing logic to run at 135Mhz, while the SPI
 interfaces capped at 50Mhz, and have the logic take in a byte at a
 time instead of a bit.  This means that the internal logic runs at a
 much faster speed than the interfaces.  This would enable the fabric
 to handle simultaneous inputs from different devices.
 The first thing about the design came right off the bat - multi-headed
 queues.  I plan to use 3/4 heads for the TX queues, so that when
 receiving from a higher-speed internal logic, the interfaces can
 handle multiple input sources.
 The first idea was to have a shared pool of memory for the queue,
 which would handle congestion beautifully since it means that all TX
 queues are dynamically allocated.  However, it would mean a disaster
 for latency, since a FIFO queue doesn't exactly mean fairness.
 Then I thought of the second idea: to have separate TX queues for each
 interface.  Although this would mean less incast and burst resiliency,
 it would perform wonderfully in fairness and latency, combined with
 multi-headed queues and faster routing logic than interfaces.
 In compensation for the loss of the central shared pool of memory,
 each interface should also get their own RX buffer, enough to hold one
 packet while it gets collected by the routing logic.
 Matching the RX buffer, the interface can directly tell the routing
 logic where the collected byte should be sent to.  This means a
 dramatic decrease in the complexity of the routing logic (it doesn't
 have to buffer nor parse headers), at the cost of an increase in the
 interfaces' logic.
 ### Leftover question
 What kind of congestion control can be implemented from this design
 that mainly hardens incast and burst resiliency?
 I already have in mind using credit-based control flows or using a
 DCTCP style ECN.  But this would still be left for the future me to
 decide.
 For now, focus on making the logic happen and let the packets drop.
 ## Reflections
 > Even ideas should be reflected and refined.
 1. SPI is very limiting.  50Mhz (or 50Mbps of wire speed) is slow to
   be compared against gigabit Ethernet.  Hopefully the reduction of
   latency in the network and transport layers can make up for this.
 2. I gave a lot of thought on how ROSE would scale to industrial-grade
   FPGAs, starting with replacing SPI with SERDES, and increasing the
   internal bus width from 8 bits (the bus through which packets are
   collected from the interfaces by the routing logic) to 256 bits.
   This would allow stable operations at the scale of 30Gps SERDES
   connections and 400Mhz FPGA clocks.  Which would make it comparable
   to modern-day Ethernet connections.
 3. There was **a lot** of trade-offs when considering asymmetric
   interface clocks.  I'd like direct streaming from one interface to
   another to be possible, but clearly it won't be the case.  This
   meant 1 copy of packet data within the fabric itself, effectively
   doubling the latency for the fabric.  But this "trade-off" must be
   made, unless there's a magical way of syncing the interfaces (and
   that would mean a direct connection, not a network).
 ## Final thoughts
 I've been thinking a lot about the initiative of ROSE: why build it on
 such limiting hardware?  And I came across this conclusion:
 > Design systems in thought, in code, in the garage.
 The hardware for ROSE would stay on my desk, but the things I learned
 by doing it would stay in my mind and potentially be put to use in the
 industry.
 Ideas should be scalable up and down.  If something works with cheap,
 off-the-shelf FPGAs, I'd expect it to work on industrial-grade ones;
 if some idea works in the industry, it should also be applicable (not
 necessarily practical) on consumer-grade hardware.
 I consider myself a scientist, I create ideas, and I'm not limited by
 hardware or software stacks.
 ## Next goals
 Implement the logic in sims.  I've already started, but it's time to
 actually get some results.
--- a/fabric/src/mem_hub.sv
+++ b/fabric/src/mem_hub.sv
@ -0,0 +1,50 @@
 module mem_hub (input logic             rst,
                input logic             sys_clk,
                input logic [3:0]       connected_devices, // manually configured
                input logic [3:0][7:0]  rx_cmd,            // for routing-related commands
                input logic [3:0]       rx_cmd_valid,
                input logic [3:0][7:0]  rx_byte,
                input logic [3:0]       rx_valid,
                input logic [3:0][1:0]  rx2tx_dest,        // rx byte's destination
                input logic [3:0]       tx_read,           // if tx_byte was read
                output logic [3:0]      rx_read,           // if rx_byte was read
                output logic [3:0][1:0] tx_src,            // tell the tx where the stream is comming from
                output logic [3:0][7:0] tx_byte,
                output logic [3:0]      tx_valid,
                output logic [1:0]      packet_size);      // 4 states for 4 fixed packet sizes
   timeunit 1ns;
   timeprecision 1ps;
   // TBD: pre-agree on packet size
   // [index][rx_src]
   logic [3:0][1:0] service_queue;
   logic [3:0]      in_queue;
   // [rx_src][tx_dest], might not be useful
   logic [1:0][1:0] rx2tx_map;
   logic [2:0]      i;
   always_ff @ (posedge sys_clk) begin
      if (rst) begin
         rx_read <= '0;
         tx_src <= '0;
         tx_valid <= '0;
         packet_size <= '0;
         service_queue <= '0;
         in_queue <= '0;
         rx2tx_map <= '0;
         i <= 0;
      end
      if (in_queue == 4'd0) begin // no one is in the queue yet
         if (tx_valid != 4'd0) begin
            for (i = 0; i < 3'd4; i++) begin
               // TODO: write the logic for enqueuing
            end
         end
      end else begin
      end
   end
 endmodule // mem_hub
--- a/plan.md
+++ b/plan.md
@ -0,0 +1,94 @@
 # The Plan/Roadmap for ROSE
 > Plans turn fear into focus, risk into reach, and steps into a path.
 This plan has been modified in the course of the development of ROSE.
 And that was also in the plan itself: you plan at every step.  See the
 end for the changes made to the plan.
 ## The roadmap
 This is a rough summary of what I did and what I plan to do.
 ### [DONE] Learning RTL and HDL and getting familiar with SPI
 Implement a functional SPI slave on the FPGA.  Add small logic to
 manipulate the data.  Learn about cross-clock domain design and
 implementation.
 ### [TODO] Implement the routing logic along with the interfaces
 This would be the core part of implementing ROSE on the fabric
 side. This is a bare minimum implementation disregarding any
 congestion control or inter-fabric routing.
 ### [TODO] Test on a RPi-FPGA setup
 Getting the code to run on sims is one thing, getting it to run on
 actual hardware is another.  This entire step will be to ship the code
 onto my setup and deal with any kind of synthesis and place-and-route
 problems that sims won't reveal.
 ### [TODO] Implement logging to an external device via UART
 This would lay the foundations for THORN and PETAL and also would come
 in handy when analyzing congestion and other anomalies.
 ### [TODO] Test on a RPi-FPGA-RPi setup
 This is where THORN would branch off from ROSE.  ROSE should keep some
 minimal unit test testbenches, and have fully functional test suites
 and toolchains be implemented in THORN.
 ### [TODO] RPi's ROSE buffer implementation
 `mmap` memory for the SPI drivers on RPi's to *simulate* zero-copy on
 the RPi's.
 ### [TODO] Modify the SPI kernel drivers for explicit DMA control
 Allow ROSE's DMA to be implemented in the drivers.
 ### [TODO] Abstract ROSE into APIs or kernel devices
 Note: This may be implemented as development of THORN goes into
 action, or be facilitated by it.
 ### [TODO] Implement congestion control
 When the logic for the fabric is mature enough, it should be upgraded.
 ### [TODO] Implement mesh networks allowing inter-fabric routing
 ROSE shouldn't be limited to only 1 fabric.
 ## Changes to the plan
 The plan is always changing, but it's important to remember what I
 learned from every change.
 ### Ditching dynamic routing
 In a datacenter or HFT setup, it's rarely expected that the connected
 devices will change.  Hardcoded routing paths is very acceptable and
 keeps up with the deterministic nature of ROSE.
 #### The lesson learned
 Figure out the exact target of ROSE - it's not meant for generic
 networks, so shave off any redundancy that it doesn't need.
 ### Not reversing the input when testing out the SPI interface
 A few things to note:
 1. Sending the bytes incremented by 1 back is sufficient to prove a
   stable connection.
 2. Reversing the input would result in a double-ended queue and
   increasing the complexity of the logic without little benefits to
   later steps.
 So, I've decided to ditch this idea.
 #### The lesson learned
 Plan with the next step in mind, take actions with the next step in
 mind.  Know what is enough and what is too far.
 ### Moving deployment onto the hardware until later on
 Originally, I planned to deploy the logic and test with real hardware
 as soon as I have a working SPI module.  But that's not really
 applicable.  I'd be fixing the synthesis with every step thereafter.
 Better to finalize the design in sims first, and then solve the
 FPGA-specific problems as an entire step.
 I'd rather scratch my head over a bunch of problems at once than to
 scratch my head every time I push an update to the logic.
 #### The lesson learned
 Weight testing against the cost of time and efficiency.  If testing
 hinders development, then it should be separated from the development
 cycle.
--- a/protocol.md
+++ b/protocol.md
@ -0,0 +1,31 @@
 # The Specifications for ROSE
 Extensions to the protocol may change the specifications, see the
 devlogs for specific decisions on changes.
 ## Packet specifications
 ### Header
 #### (1 byte) Command + packet size
 - Packet sizes are chosen out of 4 predetermined sizes.  Using only 2
  bits in this byte to be represented.
 - Commands are 6 bits with 64 possibilities, see the **Commands**
  section for details.
 #### (1 byte) Destination address
 - This can refer to any end-device or fabric within the network
 #### (1 byte) Source address
 - This can refer to any end-device or fabric within the network
 ### Payload
 Via commands, leading or trailing bytes in the payload can also be
 repurposed to timestamps or other feature extensions.
 ### (1 byte) CRC-8
 To ensure delivery.
 ## Commands
 TBD.
 ### Feature Extensions
 #### [CMD: TBD] Include timestamp
 #### [CMD: TBD] Include sequence number