work in progress, major overhaul for design, see devlogs for details. Also added the first version of the style guide

2025-05-29 00:18:06 -04:00
parent dac3140829
commit f61de84b4a
6 changed files with 384 additions and 90 deletions
--- a/devlog/2025-05-21-Rethink-routing.md
+++ b/devlog/2025-05-21-Rethink-routing.md
@ -1,14 +1,14 @@
 # Rethinking the Routing Memory Pool
 Date: 2025-05-21
-## Goals and Expectations
+## Goals and expectations
 To finish the RX and TX queues.
 ## Results
 Nope.  I'm half way through the TX queue and I'm gonna rework the
 entire thing.
-## Thought Train
+## Thought train
 Separating the TX queue to be per-interface is amazing.  But making it
 a multi-headed queue is a disaster.  In this case, it doesn't simplify
 the logic, while taking away one of the benefits of a shared memory
--- a/devlog/2025-05-28-Queuing.md
+++ b/devlog/2025-05-28-Queuing.md
@ -0,0 +1,81 @@
 # Redesigned Internal Memory Pool
 Date: 2025-05-28
 ## Goals and expectations
 Lay the foundations in the hub for the new memory pool design.
 ## Thought train
 We're putting aside the support for AI clusters and focusing on the
 HFT side of things for the time being.  The have very different
 network workloads and one system to combine them both is not really a
 good option.
 This also means we can remove the plans for congestion control, that's
 mostly done at the application layer, not the network layer in HFT
 infrastructure.  This will speedup the development and let me focus on
 getting the most essential parts to function as intended.
 A single block of BRAM typically only allows two simultaneous
 operations to non-conflicting addresses.  This meant that servicing
 multiple interfaces at the same time is impractical.
 So, I decided to implement a round-robin approach for both reads and
 writes (feels like being back to the same spot a few weeks ago, but
 there are some differences).
 The approach is quite straight forward, every cycle, the hub selects
 one of the interfaces to service, and when servicing it, it checks
 for both RX and TX side transmissions.
 And note that it's up to the interfaces to keep track of the
 completion of receiving a packet, but left for the hub to collect
 the free slots.
 And the interfaces will have their own packet address queues to keep
 track of their outgoing packets.  Furthermore, this queue can be
 limited to a fraction of the packet queue's size to allow control over
 the maximum amount of packets per packet buffer.
 This centralized, dynamic memory allocation strategy should handle
 bursts well and ensure lightweight flows to be handled during a burst
 event.  Which is good for handling HFT-like workloads.
 ## Results
 A very good evening of coding, I finished the following:
 1. Reworked most of the hub's logic, implemented the RX side of things
   and left some TODO notes.
 2. Implemented the `free_queue` for allocating free queue slots for
   incoming packets and enqueue freed slots by the TX side logic.
 3. Implemented the `memory_pool` for the packet memory.
 4. Write the first draft of the FLORA/ROSE coding style guide.
 ## Reflections
 1. Focus.  Focus is the key to getting what you really want.
 2. Modularize.  Modularization will keep the work limited to more
   manageable chunks, which is much more important when developing
   alone.
 3. Write everything down.  Keep track of every thought, by handwritten
   notes, documentation, or even ChatGPT conversation history.  This
   will help when there's a few dozen things to keep in mind every
   day.
 4. Start doing things.  Start writing down thoughts, start discussions
   about future plans, start coding.  Start a momentum, and start
   keeping it alive.
 ## Final thoughts
 FPGAs are great tools.  And I've only began to scratch the surface of
 them.  Think implementing BRAM-based queues, I'd have to think about
 how to sync all the components so that everything I need would be
 ready exactly when I want them.
 I feel like I'm beginning the transformation from a sequential thinker
 that thinks in steps into a clock-aligned combinational thinker - I
 think when each step would happen, not in what order, but at what
 time.
 Also, explicitly knowing the hidden logic of `logic` implying
 ownership helped me structure my code better.
 ## Next steps
 Complete the hub, then move on to the interfaces.
--- a/fabric/src/hub.sv
+++ b/fabric/src/hub.sv
@ -1,96 +1,167 @@
 `include <params.sv>
 // IMPORTANT: interfaces are supposed to keep track of their own packet states
 module hub(
            input logic         rst,
            input logic                                               sys_clk,
-            input logic [31:0]  rx_cmd,       // for routing-related commands
+            input logic                                               rst,
-            input logic [3:0]   rx_cmd_valid,
+            input logic [INTERFACE_CNT - 1][PACKET_ADDR_LEN - 1:0]    rx_pkt_addr,
-            input logic [31:0]  rx_byte,
+            input logic [INTERFACE_CNT - 1:0][7:0]                    rx_byte,
-            input logic [3:0]   rx_valid,
+            input logic [INTERFACE_CNT - 1:0]                         rx_valid,
-            input logic [31:0]  rx2tx_dest,   // rx byte's destination
+            input logic [INTERFACE_CNT - 1:0]                         tx_ready,
-            input logic [3:0]   tx_ready,     // if tx_byte is ready to be read
+            input logic [INTERFACE_CNT - 1:0]                         tx_full,
-            output logic [3:0]  rx_ready,     // if rx_byte is ready to be read
+            input logic [INTERFACE_CNT - 1:0][PACKET_ADDR_LEN - 1:0]  tx_pkt_addr,
-            output logic [7:0]  tx_src,       // tell the tx where the stream is comming from
+            input logic [INTERFACE_CNT - 1:0]                         rx_new_packet,
-            output logic [31:0] tx_byte,
+            output logic [INTERFACE_CNT - 1:0]                        rx_ready,
-            output logic [3:0]  tx_valid,
+            output logic [INTERFACE_CNT - 1:0][PACKET_ADDR_LEN - 1:0] tx_queue_addr,
-            output logic [1:0]  packet_size); // 4 states for 4 fixed packet sizes
+            output logic [INTERFACE_CNT - 1:0]                        tx_queue_addr_valid,
            output logic [INTERFACE_CNT - 1:0][7:0]                   tx_byte,
            output logic [INTERFACE_CNT - 1:0]                        tx_valid);
    timeunit 1ns;
    timeprecision 1ps;
-    // TBD: pre-agree on packet size
+    logic [INTERFACE_CNT - 1:0]                  curr_service;
    logic                                        request_new_slot;
    logic [QUEUE_ADDR_LEN - 1:0]                 new_slot_addr;
    logic                                        free_queue_empty;
    logic [QUEUE_ADDR_LEN - 1:0]                 empty_slot_addr;
    logic [QUEUE_ADDR_LEN - 1:0]                 empty_slot_enqueue;
-    // use the round-robin strat to poll since the routing is much faster
+    free_queue fqueue(.sys_clk(sys_clk),
-    // NOTE: To expand to more connected_devices, use a hierarchical design
+                      .rst(rst),
-    logic [1:0]           curr_service = 0;
+                      .request_new_slot(request_new_slot),
-    logic [1:0]           last_dest = 0;
+                      .empty_slot_addr(empty_slot_addr),
                      .empty_slot_enqueue(empty_slot_enqueue),
                      .new_slot_addr(new_slot_addr),
                      .queue_empty(free_queue_empty));
-    // src dest byte
+    logic [INTERFACE_CNT - 1:0][MEMORY_ADDR_LEN - 1:0] rx_mem_addr;
-    typedef struct {
+    logic [MEMORY_POOL_ADDR_LEN - 1:0]                 mem_read_addr;
-        logic [1:0] dest;
+    logic [7:0]                                        mem_read_byte;
-        logic [7:0] payload;
+    logic [MEMORY_POOL_ADDR_LEN - 1:0]                 mem_write_addr;
-    } svc_buffer;
+    logic                                              mem_write_enable;
-    svc_buffer service_buffer [3:0];
+    logic [7:0]                                        mem_write_byte;
    svc_buffer curr_buffer;
    assign curr_buffer = service_buffer[curr_service];
    logic [3:0]           in_buffer;
    assign rx_ready = ~in_buffer;
-    always_ff @ (posedge sys_clk) begin
+    memory_pool mpool(.sys_clk(sys_clk),
                      .rst(rst),
                      .read_addr(mem_read_addr),
                      .write_addr(mem_write_addr),
                      .write_byte(mem_write_byte),
                      .write_enable(mem_write_enable),
                      .read_byte(mem_read_byte));
    always_ff @ (posedge sys_clk or rst) begin
        if (rst) begin
-            in_buffer <= '0;
+            tx_queue_addr <= '0;
-            tx_src <= '0;
+            tx_queue_addr_valid <= '0;
            tx_byte <= '0;
            tx_valid <= '0;
            packet_size <= '0;
            curr_service <= '0;
-            last_dest <= '0;
+            rx_ready <= '0;
-            for (int i = 0; i < 4; i++) begin
+            rx_mem_addr <= '0;
-                service_buffer[i] <= '0;
+            mem_read_addr <= '0;
-            end
+            mem_write_addr <= '0;
-        end else begin // if (rst)
+            mem_write_enable <= 0;
-            // Handle RX side logic
+            mem_write_byte <= '0;
-            for (int i = 0; i < 4; i++) begin
+        end else begin
-                if (rx_valid[i]) begin
+            // NOTE: signaled the servicing interface in the last cycle
-                    if (!in_buffer[i]) begin
+            rx_ready[curr_service] <= 0;
-                        service_buffer[i].dest <= get_hop(rx2tx_dest, i[1:0]);
+            rx_ready[curr_service + 1] <= 1;
                        service_buffer[i].payload <= get_byte(rx_byte, i[1:0]);
                        in_buffer[i] <= 1;
                    end
                end
            end
-            // Handle TX side logic
+            // IMPORTANT: interfaces should send the byte no matter what, rx_ready is to prevent sending a new byte
-            if (in_buffer[curr_service] && tx_ready[curr_buffer.dest]) begin
+            if (rx_valid[curr_service]) begin
-                tx_byte[{curr_buffer.dest, 3'b000} +: 8]
+                // IMPORTANT: memory_write_addr is ready on the next cycle
-                    <= curr_buffer.payload;
+                if (rx_new_packet[curr_service]) begin
-                tx_src[{curr_buffer.dest, 1'b0} +: 2]
+                    if (free_queue_empty) begin
-                    <= curr_service;
+                        // TODO: handle the drop logic
-                in_buffer[curr_service] <= 0;
+                    end else begin
-                tx_valid[curr_buffer.dest] <= 1;
+                        request_new_slot <= 1;
                        rx_mem_addr[{curr_service,
                                     MEMORY_POOL_ADDR_SHIFT'd0}
                                    +:MEMORY_POOL_ADDR_LEN
                                    ] <= {new_slot_addr, PACKET_ADDR_LEN'd0};
                        mem_write_addr <= {new_slot_addr, PACKET_ADDR_LEN'd0};
                    end
                end else begin // if (rx_new_packet[curr_service])
                    // NOTE: if memory 
                    mem_write_addr <= mem_write_addr + 1;
                    request_new_slot <= 0;
                end // else: !if(rx_new_packet[curr_service])
                mem_write_byte <= rx_byte[{curr_service, 3'd0}+:8];
                mem_write_enable <= 1;
            end else // if (rx_valid[curr_service])
                mem_write_enable <= 0;
        end
    end
            tx_valid[last_dest] <= 0;
            last_dest <= service_buffer[curr_service].dest;
            curr_service <= curr_service + 1;
        end // else: !if(rst)
    end // always_ff @ (posedge sys_clk)
 endmodule // hub
-function automatic logic [7:0] get_byte(input logic [31:0] byte_arr,
+// IMPORTANT: the current queue_addr is always valid unless queue_empty
-                                        input logic [1:0] idx);
+// REQUIRES: hub does not request a new slot when the queue is empty
-    return byte_arr[{idx, 3'b000} +: 8];
+module free_queue(input logic                         sys_clk,
-endfunction // get_byte
+                  input logic                         rst,
                  input logic                         request_new_slot,
                  input logic [QUEUE_ADDR_LEN - 1:0]  empty_slot_addr,
                  input logic                         empty_slot_enqueue,
                  output logic [QUEUE_ADDR_LEN - 1:0] new_slot_addr,
                  output logic                        queue_empty);
    timeunit 1ns;
    timeprecision 1ps;
-// NOTE: addr 0 is alway mapped to the fabric itself and caught before this
+    logic [QUEUE_ADDR_LEN - 1:0] fqueue [QUEUE_SIZE - 1:0];
-function automatic logic [1:0] get_hop(input logic [31:0] dest_map,
+    logic [QUEUE_ADDR_LEN - 1:0] head;
-                                       input logic [1:0]  idx);
+    logic [QUEUE_ADDR_LEN - 1:0] tail;
-    case (dest_map[{idx, 3'b000} +: 8])
+    shortint                     queue_size;
-        8'b00000001:
+
-            return 2'b00;
+    assign queue_empty = queue_size == 0;
-        8'b00000010:
+
-            return 2'b01;
+    initial begin
-        8'b00000011:
+        // TODO: pre-load the free queue with every slot possible
-            return 2'b10;
+    end
-        8'b00000100:
+
-            return 2'b11;
+    // IMPORTANT: rst must be held high for at least 2 sys_clk cycles
-        default:
+    always_ff @ (posedge sys_clk or rst) begin
-            return 0;
+        if (rst) begin
-    endcase // case (dest_map[{idx, 3'b000} +: 8])
+            head <= '0;
-endfunction // get_hop
+            tail <= QUEUE_ADDR_LEN'd1;
            queue_size = QUEUE_SIZE;
            new_slot_addr <= '0;
        end else begin
            if (request_new_slot) begin
                head <= head + 1;
                queue_size <= queue_size - 1;
            end
            new_slot_addr <= fqueue[head];
            if (empty_slot_enqueue) begin
                fqueue[tail] <= empty_slot_addr;
                tail <= tail + 1;
                queue_size <= queue_size + 1;
            end
        end
    end
 endmodule // free_queue
 module memory_pool(input logic                              sys_clk,
                   input logic                              rst,
                   input logic [MEMORY_POOL_ADDR_LEN - 1:0] read_addr,
                   input logic [MEMORY_POOL_ADDR_LEN - 1:0] write_addr,
                   input logic [7:0]                        write_byte,
                   input logic                              write_enable,
                   output logic [7:0]                       read_byte);
    timeunit 1ns;
    timeprecision 1ps;
    logic [7:0] mem_pool[MEMORY_POOL_SIZE - 1:0];
    always_ff @ (posedge sys_clk or rst) begin
        if (rst) begin
            read_byte <= 8'hFF;
        end else begin
            if (write_enable)
                mem_pool[write_addr] <= write_byte;
            read_byte <= mem_pool[read_addr];
        end
    end
 endmodule // memory_pool
--- a/fabric/src/params.sv
+++ b/fabric/src/params.sv
@ -0,0 +1,11 @@
 parameter int PACKET_SIZE = 64;
 parameter int PACKET_ADDR_LEN = 6;
 parameter int QUEUE_SIZE = 1024;
 parameter int QUEUE_ADDR_LEN = 10;
 parameter int MEMORY_POOL_SIZE = QUEUE_SIZE * PACKET_SIZE;
 parameter int MEMORY_POOL_ADDR_LEN = QUEUE_ADDR_LEN + PACKET_ADDR_LEN;
 parameter int MEMORY_POOL_ADDR_SHIFT = 4;
 parameter int INTERFACE_QUEUE_SIZE = 512;
 parameter int INTERFACE_QUEUE_ADDR_LEN = 9;
 parameter int INTERFACE_CNT = 4;
 parameter int CRC_BITS = 8;
--- a/plan.md
+++ b/plan.md
@ -44,9 +44,6 @@ Allow ROSE's DMA to be implemented in the drivers.
 Note: This may be implemented as development of THORN goes into
 action, or be facilitated by it.
 ### [TODO] Implement congestion control
 When the logic for the fabric is mature enough, it should be upgraded.
 ### [TODO] Implement mesh networks allowing inter-fabric routing
 ROSE shouldn't be limited to only 1 fabric.
@ -92,3 +89,12 @@ scratch my head every time I push an update to the logic.
 Weight testing against the cost of time and efficiency.  If testing
 hinders development, then it should be separated from the development
 cycle.
 ### Ditching features
 I ditched the plans for supporting AI clusters, along with the plans
 for congestion control.  Focus on reducing latency and an
 implementation that's elegant and simple.
 #### The lesson learned
 Focus. Know what ROSE really stand for, and stop spending thoughts on
 unnecessary things like trying to dual-wield AI and HFT workloads.
--- a/style.md
+++ b/style.md
@ -0,0 +1,125 @@
 # Style Guide for ROSE (and other FLORA projects)
 Coding style matters a lot. Good coding styles makes the code look
 better to the eye, and can help mitigate some pitfalls and confusions.
 ## Indentation
 For all indentation, use **spaces**, not tabs.
 The rationale behind this is to avoid different indent width settings
 in different editors.  It's a great trade-off of making your source
 file a little bigger for portability to different editors.
 ### C
 Use 8 spaces.  This is not only to adhere to the Linux kernel's coding
 style, but also to prevent your indentation levels from getting too
 big.
 ### Verilog/SystemVerilog
 Use 4 spaces.  Unlike C, HDL is more on the combinational logic
 side, so we can expect some more `if-else` clauses embedded together.
 **IMPORTANT: If the indentation is blowing lines off the 80-char
 width, you should probably consider refactoring the logic.**
 ### Python
 Use 4 spaces.  This is enough for scripts, and a choice by the people
 behind python.
 ### Shell Scripts
 Use 4 spaces.  There might be arguments to make it 2, but 4 is the
 minimum if you want to spot something appearing in an incorrect level
 when you've been staring at the screen for 15 hours.
 ### Line width
 80 characters is preferred, but it can be extended by 20 characters or
 so to accommodate longer identifiers.
 If it breaches 80 characters, consider breaking it into multiple lines.
 However, it is important to note that when passing many
 parameters/logic, it should always be broken into logical chunks for
 each line.
 ## Avoid magic numbers
 Unless it's the bit-length of a byte or something that's commonly
 known and obvious at first glance, use a constant to store it.
 ## Naming schemes
 Names are only meaningful to humans, and the rationale behind the
 following guidelines is to allow anyone reading the code to know what
 an identifier refers to without scrolling back to its definition or
 other references.
 ### Snake case or camel case?
 Snake case.
 ### Scoping
 For all identifiers, it's important to note the scope of their usage.
 Names are there to avoid confusion, not add to them, and the
 considerations about confusion should fall in the same scope as their
 usage.
 ### Abbreviating
 Using abbreviations is okay and a good idea under the right
 circumstances.
 As a general rule of thumb, the aggressiveness of abbreviating words
 is inversely proportional to the size of the scope.  But it's a **bad
 idea** to abbreviate global identifiers that are not commonly used.
 ### Constants
 For all constants, use **ALL_CAPS**.
 ### Global identifiers
 Use **FULL NAMES** unless it's something pre-agreed on or by
 specifications like `mosi` or `sys_clk`.
 ## Commenting
 Comments are great, but don't over-comment, they are there
 for exactly two things:
 1. Tell people **what** the code does
 2. Give a signal for future development (e.g. implementation notes,
   usage warnings, required guarantees)
 If you need to explain how your code does something using comments,
 it's a better idea to re-write the code.
 ### Signals
 Comment signals should always be contained in the same line so that
 you can `grep` for them, the only exception to this is within
 documentation, where you usually search for them.
 1. `TODO`: something to be done in the future
 2. `NOTE`: keep note of something when using/running the code
 3. `IMPORTANT`: knowing this is crucial to using/running the code
 4. `REQUIRES`: guarantees for the code to run properly
 5. `GUARANTEES`: guarantees that the code has this feature when ran
 ## Output Messages
 Like comment signals, messages should also be in complete lines and
 `grep`-friendly.
 All messages should use capitalized signals denoting what type of
 message it is (e.g. `ERROR`, `WARNING`, `INFO`) and enclosed in square
 brackets ('[' and ']') so they can be easily processed by `sed` or
 `awk`.
 If there is the need for a timestamp, put the timestamp after the
 signal but within the closing bracket, leave no spaces between the
 signal and the timestamp, and separate the two parts with a colon ':'.
 ## Tricks and workarounds
 Don't try to write "smart" code, instead, write code that everyone can
 understand without too much explanation.
 ## Styles specific to Verilog/SystemVerilog
 ### Always use `logic`
 Unless absolutely necessary, use `logic` or types built on top of
 `logic`.  This is to incorporate the idea of ownership into the code.
 Every bit of data should have only one unique driver.
 ### Avoid inferring latches
 Every bit of data should be verbosely passed to other blocks of code.