From f61de84b4aa6df4702f520bd4da474d417f764d7 Mon Sep 17 00:00:00 2001
From: Peisong Xiao <peisong.xiao.xps@gmail.com>
Date: Thu, 29 May 2025 00:18:06 -0400
Subject: [PATCH] work in progress, major overhaul for design, see devlogs for
 details. Also added the first version of the style guide

---
 devlog/2025-05-21-Rethink-routing.md |   4 +-
 devlog/2025-05-28-Queuing.md         |  81 +++++++++
 fabric/src/hub.sv                    | 241 +++++++++++++++++----------
 fabric/src/params.sv                 |  11 ++
 plan.md                              |  12 +-
 style.md                             | 125 ++++++++++++++
 6 files changed, 384 insertions(+), 90 deletions(-)
 create mode 100644 devlog/2025-05-28-Queuing.md
 create mode 100644 fabric/src/params.sv
 create mode 100644 style.md

diff --git a/devlog/2025-05-21-Rethink-routing.md b/devlog/2025-05-21-Rethink-routing.md
index 1bd5eb8..ffb4fbb 100644
--- a/devlog/2025-05-21-Rethink-routing.md
+++ b/devlog/2025-05-21-Rethink-routing.md
@@ -1,14 +1,14 @@
 # Rethinking the Routing Memory Pool
 Date: 2025-05-21
 
-## Goals and Expectations
+## Goals and expectations
 To finish the RX and TX queues.
 
 ## Results
 Nope.  I'm half way through the TX queue and I'm gonna rework the
 entire thing.
 
-## Thought Train
+## Thought train
 Separating the TX queue to be per-interface is amazing.  But making it
 a multi-headed queue is a disaster.  In this case, it doesn't simplify
 the logic, while taking away one of the benefits of a shared memory
diff --git a/devlog/2025-05-28-Queuing.md b/devlog/2025-05-28-Queuing.md
new file mode 100644
index 0000000..74763cb
--- /dev/null
+++ b/devlog/2025-05-28-Queuing.md
@@ -0,0 +1,81 @@
+# Redesigned Internal Memory Pool
+Date: 2025-05-28
+
+## Goals and expectations
+Lay the foundations in the hub for the new memory pool design.
+
+## Thought train
+We're putting aside the support for AI clusters and focusing on the
+HFT side of things for the time being.  The have very different
+network workloads and one system to combine them both is not really a
+good option.
+
+This also means we can remove the plans for congestion control, that's
+mostly done at the application layer, not the network layer in HFT
+infrastructure.  This will speedup the development and let me focus on
+getting the most essential parts to function as intended.
+
+A single block of BRAM typically only allows two simultaneous
+operations to non-conflicting addresses.  This meant that servicing
+multiple interfaces at the same time is impractical.
+
+So, I decided to implement a round-robin approach for both reads and
+writes (feels like being back to the same spot a few weeks ago, but
+there are some differences).
+
+The approach is quite straight forward, every cycle, the hub selects
+one of the interfaces to service, and when servicing it, it checks
+for both RX and TX side transmissions.
+
+And note that it's up to the interfaces to keep track of the
+completion of receiving a packet, but left for the hub to collect
+the free slots.
+
+And the interfaces will have their own packet address queues to keep
+track of their outgoing packets.  Furthermore, this queue can be
+limited to a fraction of the packet queue's size to allow control over
+the maximum amount of packets per packet buffer.
+
+This centralized, dynamic memory allocation strategy should handle
+bursts well and ensure lightweight flows to be handled during a burst
+event.  Which is good for handling HFT-like workloads.
+
+## Results
+A very good evening of coding, I finished the following:
+
+1. Reworked most of the hub's logic, implemented the RX side of things
+   and left some TODO notes.
+2. Implemented the `free_queue` for allocating free queue slots for
+   incoming packets and enqueue freed slots by the TX side logic.
+3. Implemented the `memory_pool` for the packet memory.
+4. Write the first draft of the FLORA/ROSE coding style guide.
+
+## Reflections
+1. Focus.  Focus is the key to getting what you really want.
+2. Modularize.  Modularization will keep the work limited to more
+   manageable chunks, which is much more important when developing
+   alone.
+3. Write everything down.  Keep track of every thought, by handwritten
+   notes, documentation, or even ChatGPT conversation history.  This
+   will help when there's a few dozen things to keep in mind every
+   day.
+4. Start doing things.  Start writing down thoughts, start discussions
+   about future plans, start coding.  Start a momentum, and start
+   keeping it alive.
+
+## Final thoughts
+FPGAs are great tools.  And I've only began to scratch the surface of
+them.  Think implementing BRAM-based queues, I'd have to think about
+how to sync all the components so that everything I need would be
+ready exactly when I want them.
+
+I feel like I'm beginning the transformation from a sequential thinker
+that thinks in steps into a clock-aligned combinational thinker - I
+think when each step would happen, not in what order, but at what
+time.
+
+Also, explicitly knowing the hidden logic of `logic` implying
+ownership helped me structure my code better.
+
+## Next steps
+Complete the hub, then move on to the interfaces.
diff --git a/fabric/src/hub.sv b/fabric/src/hub.sv
index d8585fb..ad19d87 100644
--- a/fabric/src/hub.sv
+++ b/fabric/src/hub.sv
@@ -1,96 +1,167 @@
-module hub (
-            input logic         rst,
-            input logic         sys_clk,
-            input logic [31:0]  rx_cmd,       // for routing-related commands
-            input logic [3:0]   rx_cmd_valid,
-            input logic [31:0]  rx_byte,
-            input logic [3:0]   rx_valid,
-            input logic [31:0]  rx2tx_dest,   // rx byte's destination
-            input logic [3:0]   tx_ready,     // if tx_byte is ready to be read
-            output logic [3:0]  rx_ready,     // if rx_byte is ready to be read
-            output logic [7:0]  tx_src,       // tell the tx where the stream is comming from
-            output logic [31:0] tx_byte,
-            output logic [3:0]  tx_valid,
-            output logic [1:0]  packet_size); // 4 states for 4 fixed packet sizes
+`include <params.sv>
+
+// IMPORTANT: interfaces are supposed to keep track of their own packet states
+module hub(
+            input logic                                               sys_clk,
+            input logic                                               rst,
+            input logic [INTERFACE_CNT - 1][PACKET_ADDR_LEN - 1:0]    rx_pkt_addr,
+            input logic [INTERFACE_CNT - 1:0][7:0]                    rx_byte,
+            input logic [INTERFACE_CNT - 1:0]                         rx_valid,
+            input logic [INTERFACE_CNT - 1:0]                         tx_ready,
+            input logic [INTERFACE_CNT - 1:0]                         tx_full,
+            input logic [INTERFACE_CNT - 1:0][PACKET_ADDR_LEN - 1:0]  tx_pkt_addr,
+            input logic [INTERFACE_CNT - 1:0]                         rx_new_packet,
+            output logic [INTERFACE_CNT - 1:0]                        rx_ready,
+            output logic [INTERFACE_CNT - 1:0][PACKET_ADDR_LEN - 1:0] tx_queue_addr,
+            output logic [INTERFACE_CNT - 1:0]                        tx_queue_addr_valid,
+            output logic [INTERFACE_CNT - 1:0][7:0]                   tx_byte,
+            output logic [INTERFACE_CNT - 1:0]                        tx_valid);
     timeunit 1ns;
     timeprecision 1ps;
+
+    logic [INTERFACE_CNT - 1:0]                  curr_service;
+    logic                                        request_new_slot;
+    logic [QUEUE_ADDR_LEN - 1:0]                 new_slot_addr;
+    logic                                        free_queue_empty;
+    logic [QUEUE_ADDR_LEN - 1:0]                 empty_slot_addr;
+    logic [QUEUE_ADDR_LEN - 1:0]                 empty_slot_enqueue;
+
+    free_queue fqueue(.sys_clk(sys_clk),
+                      .rst(rst),
+                      .request_new_slot(request_new_slot),
+                      .empty_slot_addr(empty_slot_addr),
+                      .empty_slot_enqueue(empty_slot_enqueue),
+                      .new_slot_addr(new_slot_addr),
+                      .queue_empty(free_queue_empty));
     
-    // TBD: pre-agree on packet size
+    logic [INTERFACE_CNT - 1:0][MEMORY_ADDR_LEN - 1:0] rx_mem_addr;
+    logic [MEMORY_POOL_ADDR_LEN - 1:0]                 mem_read_addr;
+    logic [7:0]                                        mem_read_byte;
+    logic [MEMORY_POOL_ADDR_LEN - 1:0]                 mem_write_addr;
+    logic                                              mem_write_enable;
+    logic [7:0]                                        mem_write_byte;
 
-    // use the round-robin strat to poll since the routing is much faster
-    // NOTE: To expand to more connected_devices, use a hierarchical design
-    logic [1:0]           curr_service = 0;
-    logic [1:0]           last_dest = 0;
-
-    // src dest byte
-    typedef struct {
-        logic [1:0] dest;
-        logic [7:0] payload;
-    } svc_buffer;
-    svc_buffer service_buffer [3:0];
-    svc_buffer curr_buffer;
-    assign curr_buffer = service_buffer[curr_service];
-    logic [3:0]           in_buffer;
-    assign rx_ready = ~in_buffer;
-
-    always_ff @ (posedge sys_clk) begin
+    memory_pool mpool(.sys_clk(sys_clk),
+                      .rst(rst),
+                      .read_addr(mem_read_addr),
+                      .write_addr(mem_write_addr),
+                      .write_byte(mem_write_byte),
+                      .write_enable(mem_write_enable),
+                      .read_byte(mem_read_byte));
+    
+    
+    
+    always_ff @ (posedge sys_clk or rst) begin
         if (rst) begin
-            in_buffer <= '0;
-            tx_src <= '0;
+            tx_queue_addr <= '0;
+            tx_queue_addr_valid <= '0;
+            tx_byte <= '0;
             tx_valid <= '0;
-            packet_size <= '0;
             curr_service <= '0;
-            last_dest <= '0;
-            for (int i = 0; i < 4; i++) begin
-                service_buffer[i] <= '0;
-            end
-        end else begin // if (rst)
-            // Handle RX side logic
-            for (int i = 0; i < 4; i++) begin
-                if (rx_valid[i]) begin
-                    if (!in_buffer[i]) begin
-                        service_buffer[i].dest <= get_hop(rx2tx_dest, i[1:0]);
-                        service_buffer[i].payload <= get_byte(rx_byte, i[1:0]);
-                        in_buffer[i] <= 1;
+            rx_ready <= '0;
+            rx_mem_addr <= '0;
+            mem_read_addr <= '0;
+            mem_write_addr <= '0;
+            mem_write_enable <= 0;
+            mem_write_byte <= '0;
+        end else begin
+            // NOTE: signaled the servicing interface in the last cycle
+            rx_ready[curr_service] <= 0;
+            rx_ready[curr_service + 1] <= 1;
+
+            // IMPORTANT: interfaces should send the byte no matter what, rx_ready is to prevent sending a new byte
+            if (rx_valid[curr_service]) begin
+                // IMPORTANT: memory_write_addr is ready on the next cycle
+                if (rx_new_packet[curr_service]) begin
+                    if (free_queue_empty) begin
+                        // TODO: handle the drop logic
+                    end else begin
+                        request_new_slot <= 1;
+                        rx_mem_addr[{curr_service,
+                                     MEMORY_POOL_ADDR_SHIFT'd0}
+                                    +:MEMORY_POOL_ADDR_LEN
+                                    ] <= {new_slot_addr, PACKET_ADDR_LEN'd0};
+                        mem_write_addr <= {new_slot_addr, PACKET_ADDR_LEN'd0};
                     end
-                end
-            end
-
-            // Handle TX side logic
-            if (in_buffer[curr_service] && tx_ready[curr_buffer.dest]) begin
-                tx_byte[{curr_buffer.dest, 3'b000} +: 8]
-                    <= curr_buffer.payload;
-                tx_src[{curr_buffer.dest, 1'b0} +: 2]
-                    <= curr_service;
-                in_buffer[curr_service] <= 0;
-                tx_valid[curr_buffer.dest] <= 1;
-            end
-            tx_valid[last_dest] <= 0;
-            last_dest <= service_buffer[curr_service].dest;
-            curr_service <= curr_service + 1;
-        end // else: !if(rst)
-    end // always_ff @ (posedge sys_clk)
-
+                end else begin // if (rx_new_packet[curr_service])
+                    // NOTE: if memory 
+                    mem_write_addr <= mem_write_addr + 1;
+                    request_new_slot <= 0;
+                end // else: !if(rx_new_packet[curr_service])
+                mem_write_byte <= rx_byte[{curr_service, 3'd0}+:8];
+                mem_write_enable <= 1;
+            end else // if (rx_valid[curr_service])
+                mem_write_enable <= 0;
+        end
+    end
 endmodule // hub
 
-function automatic logic [7:0] get_byte(input logic [31:0] byte_arr,
-                                        input logic [1:0] idx);
-    return byte_arr[{idx, 3'b000} +: 8];
-endfunction // get_byte
+// IMPORTANT: the current queue_addr is always valid unless queue_empty
+// REQUIRES: hub does not request a new slot when the queue is empty
+module free_queue(input logic                         sys_clk,
+                  input logic                         rst,
+                  input logic                         request_new_slot,
+                  input logic [QUEUE_ADDR_LEN - 1:0]  empty_slot_addr,
+                  input logic                         empty_slot_enqueue,
+                  output logic [QUEUE_ADDR_LEN - 1:0] new_slot_addr,
+                  output logic                        queue_empty);
+    timeunit 1ns;
+    timeprecision 1ps;
 
-// NOTE: addr 0 is alway mapped to the fabric itself and caught before this
-function automatic logic [1:0] get_hop(input logic [31:0] dest_map,
-                                       input logic [1:0]  idx);
-    case (dest_map[{idx, 3'b000} +: 8])
-        8'b00000001:
-            return 2'b00;
-        8'b00000010:
-            return 2'b01;
-        8'b00000011:
-            return 2'b10;
-        8'b00000100:
-            return 2'b11;
-        default:
-            return 0;
-    endcase // case (dest_map[{idx, 3'b000} +: 8])
-endfunction // get_hop
+    logic [QUEUE_ADDR_LEN - 1:0] fqueue [QUEUE_SIZE - 1:0];
+    logic [QUEUE_ADDR_LEN - 1:0] head;
+    logic [QUEUE_ADDR_LEN - 1:0] tail;
+    shortint                     queue_size;
+
+    assign queue_empty = queue_size == 0;
+
+    initial begin
+        // TODO: pre-load the free queue with every slot possible
+    end
+
+    // IMPORTANT: rst must be held high for at least 2 sys_clk cycles
+    always_ff @ (posedge sys_clk or rst) begin
+        if (rst) begin
+            head <= '0;
+            tail <= QUEUE_ADDR_LEN'd1;
+            queue_size = QUEUE_SIZE;
+            new_slot_addr <= '0;
+        end else begin
+            if (request_new_slot) begin
+                head <= head + 1;
+                queue_size <= queue_size - 1;
+            end
+            new_slot_addr <= fqueue[head];
+
+            if (empty_slot_enqueue) begin
+                fqueue[tail] <= empty_slot_addr;
+                tail <= tail + 1;
+                queue_size <= queue_size + 1;
+            end
+        end
+    end
+    
+endmodule // free_queue
+
+module memory_pool(input logic                              sys_clk,
+                   input logic                              rst,
+                   input logic [MEMORY_POOL_ADDR_LEN - 1:0] read_addr,
+                   input logic [MEMORY_POOL_ADDR_LEN - 1:0] write_addr,
+                   input logic [7:0]                        write_byte,
+                   input logic                              write_enable,
+                   output logic [7:0]                       read_byte);
+    timeunit 1ns;
+    timeprecision 1ps;
+
+    logic [7:0] mem_pool[MEMORY_POOL_SIZE - 1:0];
+
+    always_ff @ (posedge sys_clk or rst) begin
+        if (rst) begin
+            read_byte <= 8'hFF;
+        end else begin
+            if (write_enable)
+                mem_pool[write_addr] <= write_byte;
+            read_byte <= mem_pool[read_addr];
+        end
+    end
+endmodule // memory_pool
diff --git a/fabric/src/params.sv b/fabric/src/params.sv
new file mode 100644
index 0000000..3fa9909
--- /dev/null
+++ b/fabric/src/params.sv
@@ -0,0 +1,11 @@
+parameter int PACKET_SIZE = 64;
+parameter int PACKET_ADDR_LEN = 6;
+parameter int QUEUE_SIZE = 1024;
+parameter int QUEUE_ADDR_LEN = 10;
+parameter int MEMORY_POOL_SIZE = QUEUE_SIZE * PACKET_SIZE;
+parameter int MEMORY_POOL_ADDR_LEN = QUEUE_ADDR_LEN + PACKET_ADDR_LEN;
+parameter int MEMORY_POOL_ADDR_SHIFT = 4;
+parameter int INTERFACE_QUEUE_SIZE = 512;
+parameter int INTERFACE_QUEUE_ADDR_LEN = 9;
+parameter int INTERFACE_CNT = 4;
+parameter int CRC_BITS = 8;
diff --git a/plan.md b/plan.md
index e9eb07e..8f58f51 100644
--- a/plan.md
+++ b/plan.md
@@ -44,9 +44,6 @@ Allow ROSE's DMA to be implemented in the drivers.
 Note: This may be implemented as development of THORN goes into
 action, or be facilitated by it.
 
-### [TODO] Implement congestion control
-When the logic for the fabric is mature enough, it should be upgraded.
-
 ### [TODO] Implement mesh networks allowing inter-fabric routing
 ROSE shouldn't be limited to only 1 fabric.
 
@@ -92,3 +89,12 @@ scratch my head every time I push an update to the logic.
 Weight testing against the cost of time and efficiency.  If testing
 hinders development, then it should be separated from the development
 cycle.
+
+### Ditching features
+I ditched the plans for supporting AI clusters, along with the plans
+for congestion control.  Focus on reducing latency and an
+implementation that's elegant and simple.
+
+#### The lesson learned
+Focus. Know what ROSE really stand for, and stop spending thoughts on
+unnecessary things like trying to dual-wield AI and HFT workloads.
diff --git a/style.md b/style.md
new file mode 100644
index 0000000..7aa39df
--- /dev/null
+++ b/style.md
@@ -0,0 +1,125 @@
+# Style Guide for ROSE (and other FLORA projects)
+Coding style matters a lot. Good coding styles makes the code look
+better to the eye, and can help mitigate some pitfalls and confusions.
+
+## Indentation
+For all indentation, use **spaces**, not tabs.
+
+The rationale behind this is to avoid different indent width settings
+in different editors.  It's a great trade-off of making your source
+file a little bigger for portability to different editors.
+
+### C
+Use 8 spaces.  This is not only to adhere to the Linux kernel's coding
+style, but also to prevent your indentation levels from getting too
+big.
+
+### Verilog/SystemVerilog
+Use 4 spaces.  Unlike C, HDL is more on the combinational logic
+side, so we can expect some more `if-else` clauses embedded together.
+
+**IMPORTANT: If the indentation is blowing lines off the 80-char
+width, you should probably consider refactoring the logic.**
+
+### Python
+Use 4 spaces.  This is enough for scripts, and a choice by the people
+behind python.
+
+### Shell Scripts
+Use 4 spaces.  There might be arguments to make it 2, but 4 is the
+minimum if you want to spot something appearing in an incorrect level
+when you've been staring at the screen for 15 hours.
+
+### Line width
+80 characters is preferred, but it can be extended by 20 characters or
+so to accommodate longer identifiers.
+
+If it breaches 80 characters, consider breaking it into multiple lines.
+
+However, it is important to note that when passing many
+parameters/logic, it should always be broken into logical chunks for
+each line.
+
+## Avoid magic numbers
+Unless it's the bit-length of a byte or something that's commonly
+known and obvious at first glance, use a constant to store it.
+
+## Naming schemes
+Names are only meaningful to humans, and the rationale behind the
+following guidelines is to allow anyone reading the code to know what
+an identifier refers to without scrolling back to its definition or
+other references.
+
+### Snake case or camel case?
+Snake case.
+
+### Scoping
+For all identifiers, it's important to note the scope of their usage.
+Names are there to avoid confusion, not add to them, and the
+considerations about confusion should fall in the same scope as their
+usage.
+
+### Abbreviating
+Using abbreviations is okay and a good idea under the right
+circumstances.
+
+As a general rule of thumb, the aggressiveness of abbreviating words
+is inversely proportional to the size of the scope.  But it's a **bad
+idea** to abbreviate global identifiers that are not commonly used.
+
+### Constants
+For all constants, use **ALL_CAPS**.
+
+### Global identifiers
+Use **FULL NAMES** unless it's something pre-agreed on or by
+specifications like `mosi` or `sys_clk`.
+
+## Commenting
+Comments are great, but don't over-comment, they are there
+for exactly two things:
+
+1. Tell people **what** the code does
+2. Give a signal for future development (e.g. implementation notes,
+   usage warnings, required guarantees)
+   
+If you need to explain how your code does something using comments,
+it's a better idea to re-write the code.
+
+### Signals
+Comment signals should always be contained in the same line so that
+you can `grep` for them, the only exception to this is within
+documentation, where you usually search for them.
+
+1. `TODO`: something to be done in the future
+2. `NOTE`: keep note of something when using/running the code
+3. `IMPORTANT`: knowing this is crucial to using/running the code
+4. `REQUIRES`: guarantees for the code to run properly
+5. `GUARANTEES`: guarantees that the code has this feature when ran
+
+## Output Messages
+Like comment signals, messages should also be in complete lines and
+`grep`-friendly.
+
+All messages should use capitalized signals denoting what type of
+message it is (e.g. `ERROR`, `WARNING`, `INFO`) and enclosed in square
+brackets ('[' and ']') so they can be easily processed by `sed` or
+`awk`.
+
+If there is the need for a timestamp, put the timestamp after the
+signal but within the closing bracket, leave no spaces between the
+signal and the timestamp, and separate the two parts with a colon ':'.
+
+## Tricks and workarounds
+Don't try to write "smart" code, instead, write code that everyone can
+understand without too much explanation.
+
+## Styles specific to Verilog/SystemVerilog
+
+### Always use `logic`
+Unless absolutely necessary, use `logic` or types built on top of
+`logic`.  This is to incorporate the idea of ownership into the code.
+
+Every bit of data should have only one unique driver.
+
+### Avoid inferring latches
+Every bit of data should be verbosely passed to other blocks of code.