From f61de84b4aa6df4702f520bd4da474d417f764d7 Mon Sep 17 00:00:00 2001 From: Peisong Xiao Date: Thu, 29 May 2025 00:18:06 -0400 Subject: [PATCH] work in progress, major overhaul for design, see devlogs for details. Also added the first version of the style guide --- devlog/2025-05-21-Rethink-routing.md | 4 +- devlog/2025-05-28-Queuing.md | 81 +++++++++ fabric/src/hub.sv | 241 +++++++++++++++++---------- fabric/src/params.sv | 11 ++ plan.md | 12 +- style.md | 125 ++++++++++++++ 6 files changed, 384 insertions(+), 90 deletions(-) create mode 100644 devlog/2025-05-28-Queuing.md create mode 100644 fabric/src/params.sv create mode 100644 style.md diff --git a/devlog/2025-05-21-Rethink-routing.md b/devlog/2025-05-21-Rethink-routing.md index 1bd5eb8..ffb4fbb 100644 --- a/devlog/2025-05-21-Rethink-routing.md +++ b/devlog/2025-05-21-Rethink-routing.md @@ -1,14 +1,14 @@ # Rethinking the Routing Memory Pool Date: 2025-05-21 -## Goals and Expectations +## Goals and expectations To finish the RX and TX queues. ## Results Nope. I'm half way through the TX queue and I'm gonna rework the entire thing. -## Thought Train +## Thought train Separating the TX queue to be per-interface is amazing. But making it a multi-headed queue is a disaster. In this case, it doesn't simplify the logic, while taking away one of the benefits of a shared memory diff --git a/devlog/2025-05-28-Queuing.md b/devlog/2025-05-28-Queuing.md new file mode 100644 index 0000000..74763cb --- /dev/null +++ b/devlog/2025-05-28-Queuing.md @@ -0,0 +1,81 @@ +# Redesigned Internal Memory Pool +Date: 2025-05-28 + +## Goals and expectations +Lay the foundations in the hub for the new memory pool design. + +## Thought train +We're putting aside the support for AI clusters and focusing on the +HFT side of things for the time being. The have very different +network workloads and one system to combine them both is not really a +good option. + +This also means we can remove the plans for congestion control, that's +mostly done at the application layer, not the network layer in HFT +infrastructure. This will speedup the development and let me focus on +getting the most essential parts to function as intended. + +A single block of BRAM typically only allows two simultaneous +operations to non-conflicting addresses. This meant that servicing +multiple interfaces at the same time is impractical. + +So, I decided to implement a round-robin approach for both reads and +writes (feels like being back to the same spot a few weeks ago, but +there are some differences). + +The approach is quite straight forward, every cycle, the hub selects +one of the interfaces to service, and when servicing it, it checks +for both RX and TX side transmissions. + +And note that it's up to the interfaces to keep track of the +completion of receiving a packet, but left for the hub to collect +the free slots. + +And the interfaces will have their own packet address queues to keep +track of their outgoing packets. Furthermore, this queue can be +limited to a fraction of the packet queue's size to allow control over +the maximum amount of packets per packet buffer. + +This centralized, dynamic memory allocation strategy should handle +bursts well and ensure lightweight flows to be handled during a burst +event. Which is good for handling HFT-like workloads. + +## Results +A very good evening of coding, I finished the following: + +1. Reworked most of the hub's logic, implemented the RX side of things + and left some TODO notes. +2. Implemented the `free_queue` for allocating free queue slots for + incoming packets and enqueue freed slots by the TX side logic. +3. Implemented the `memory_pool` for the packet memory. +4. Write the first draft of the FLORA/ROSE coding style guide. + +## Reflections +1. Focus. Focus is the key to getting what you really want. +2. Modularize. Modularization will keep the work limited to more + manageable chunks, which is much more important when developing + alone. +3. Write everything down. Keep track of every thought, by handwritten + notes, documentation, or even ChatGPT conversation history. This + will help when there's a few dozen things to keep in mind every + day. +4. Start doing things. Start writing down thoughts, start discussions + about future plans, start coding. Start a momentum, and start + keeping it alive. + +## Final thoughts +FPGAs are great tools. And I've only began to scratch the surface of +them. Think implementing BRAM-based queues, I'd have to think about +how to sync all the components so that everything I need would be +ready exactly when I want them. + +I feel like I'm beginning the transformation from a sequential thinker +that thinks in steps into a clock-aligned combinational thinker - I +think when each step would happen, not in what order, but at what +time. + +Also, explicitly knowing the hidden logic of `logic` implying +ownership helped me structure my code better. + +## Next steps +Complete the hub, then move on to the interfaces. diff --git a/fabric/src/hub.sv b/fabric/src/hub.sv index d8585fb..ad19d87 100644 --- a/fabric/src/hub.sv +++ b/fabric/src/hub.sv @@ -1,96 +1,167 @@ -module hub ( - input logic rst, - input logic sys_clk, - input logic [31:0] rx_cmd, // for routing-related commands - input logic [3:0] rx_cmd_valid, - input logic [31:0] rx_byte, - input logic [3:0] rx_valid, - input logic [31:0] rx2tx_dest, // rx byte's destination - input logic [3:0] tx_ready, // if tx_byte is ready to be read - output logic [3:0] rx_ready, // if rx_byte is ready to be read - output logic [7:0] tx_src, // tell the tx where the stream is comming from - output logic [31:0] tx_byte, - output logic [3:0] tx_valid, - output logic [1:0] packet_size); // 4 states for 4 fixed packet sizes +`include + +// IMPORTANT: interfaces are supposed to keep track of their own packet states +module hub( + input logic sys_clk, + input logic rst, + input logic [INTERFACE_CNT - 1][PACKET_ADDR_LEN - 1:0] rx_pkt_addr, + input logic [INTERFACE_CNT - 1:0][7:0] rx_byte, + input logic [INTERFACE_CNT - 1:0] rx_valid, + input logic [INTERFACE_CNT - 1:0] tx_ready, + input logic [INTERFACE_CNT - 1:0] tx_full, + input logic [INTERFACE_CNT - 1:0][PACKET_ADDR_LEN - 1:0] tx_pkt_addr, + input logic [INTERFACE_CNT - 1:0] rx_new_packet, + output logic [INTERFACE_CNT - 1:0] rx_ready, + output logic [INTERFACE_CNT - 1:0][PACKET_ADDR_LEN - 1:0] tx_queue_addr, + output logic [INTERFACE_CNT - 1:0] tx_queue_addr_valid, + output logic [INTERFACE_CNT - 1:0][7:0] tx_byte, + output logic [INTERFACE_CNT - 1:0] tx_valid); timeunit 1ns; timeprecision 1ps; + + logic [INTERFACE_CNT - 1:0] curr_service; + logic request_new_slot; + logic [QUEUE_ADDR_LEN - 1:0] new_slot_addr; + logic free_queue_empty; + logic [QUEUE_ADDR_LEN - 1:0] empty_slot_addr; + logic [QUEUE_ADDR_LEN - 1:0] empty_slot_enqueue; + + free_queue fqueue(.sys_clk(sys_clk), + .rst(rst), + .request_new_slot(request_new_slot), + .empty_slot_addr(empty_slot_addr), + .empty_slot_enqueue(empty_slot_enqueue), + .new_slot_addr(new_slot_addr), + .queue_empty(free_queue_empty)); - // TBD: pre-agree on packet size + logic [INTERFACE_CNT - 1:0][MEMORY_ADDR_LEN - 1:0] rx_mem_addr; + logic [MEMORY_POOL_ADDR_LEN - 1:0] mem_read_addr; + logic [7:0] mem_read_byte; + logic [MEMORY_POOL_ADDR_LEN - 1:0] mem_write_addr; + logic mem_write_enable; + logic [7:0] mem_write_byte; - // use the round-robin strat to poll since the routing is much faster - // NOTE: To expand to more connected_devices, use a hierarchical design - logic [1:0] curr_service = 0; - logic [1:0] last_dest = 0; - - // src dest byte - typedef struct { - logic [1:0] dest; - logic [7:0] payload; - } svc_buffer; - svc_buffer service_buffer [3:0]; - svc_buffer curr_buffer; - assign curr_buffer = service_buffer[curr_service]; - logic [3:0] in_buffer; - assign rx_ready = ~in_buffer; - - always_ff @ (posedge sys_clk) begin + memory_pool mpool(.sys_clk(sys_clk), + .rst(rst), + .read_addr(mem_read_addr), + .write_addr(mem_write_addr), + .write_byte(mem_write_byte), + .write_enable(mem_write_enable), + .read_byte(mem_read_byte)); + + + + always_ff @ (posedge sys_clk or rst) begin if (rst) begin - in_buffer <= '0; - tx_src <= '0; + tx_queue_addr <= '0; + tx_queue_addr_valid <= '0; + tx_byte <= '0; tx_valid <= '0; - packet_size <= '0; curr_service <= '0; - last_dest <= '0; - for (int i = 0; i < 4; i++) begin - service_buffer[i] <= '0; - end - end else begin // if (rst) - // Handle RX side logic - for (int i = 0; i < 4; i++) begin - if (rx_valid[i]) begin - if (!in_buffer[i]) begin - service_buffer[i].dest <= get_hop(rx2tx_dest, i[1:0]); - service_buffer[i].payload <= get_byte(rx_byte, i[1:0]); - in_buffer[i] <= 1; + rx_ready <= '0; + rx_mem_addr <= '0; + mem_read_addr <= '0; + mem_write_addr <= '0; + mem_write_enable <= 0; + mem_write_byte <= '0; + end else begin + // NOTE: signaled the servicing interface in the last cycle + rx_ready[curr_service] <= 0; + rx_ready[curr_service + 1] <= 1; + + // IMPORTANT: interfaces should send the byte no matter what, rx_ready is to prevent sending a new byte + if (rx_valid[curr_service]) begin + // IMPORTANT: memory_write_addr is ready on the next cycle + if (rx_new_packet[curr_service]) begin + if (free_queue_empty) begin + // TODO: handle the drop logic + end else begin + request_new_slot <= 1; + rx_mem_addr[{curr_service, + MEMORY_POOL_ADDR_SHIFT'd0} + +:MEMORY_POOL_ADDR_LEN + ] <= {new_slot_addr, PACKET_ADDR_LEN'd0}; + mem_write_addr <= {new_slot_addr, PACKET_ADDR_LEN'd0}; end - end - end - - // Handle TX side logic - if (in_buffer[curr_service] && tx_ready[curr_buffer.dest]) begin - tx_byte[{curr_buffer.dest, 3'b000} +: 8] - <= curr_buffer.payload; - tx_src[{curr_buffer.dest, 1'b0} +: 2] - <= curr_service; - in_buffer[curr_service] <= 0; - tx_valid[curr_buffer.dest] <= 1; - end - tx_valid[last_dest] <= 0; - last_dest <= service_buffer[curr_service].dest; - curr_service <= curr_service + 1; - end // else: !if(rst) - end // always_ff @ (posedge sys_clk) - + end else begin // if (rx_new_packet[curr_service]) + // NOTE: if memory + mem_write_addr <= mem_write_addr + 1; + request_new_slot <= 0; + end // else: !if(rx_new_packet[curr_service]) + mem_write_byte <= rx_byte[{curr_service, 3'd0}+:8]; + mem_write_enable <= 1; + end else // if (rx_valid[curr_service]) + mem_write_enable <= 0; + end + end endmodule // hub -function automatic logic [7:0] get_byte(input logic [31:0] byte_arr, - input logic [1:0] idx); - return byte_arr[{idx, 3'b000} +: 8]; -endfunction // get_byte +// IMPORTANT: the current queue_addr is always valid unless queue_empty +// REQUIRES: hub does not request a new slot when the queue is empty +module free_queue(input logic sys_clk, + input logic rst, + input logic request_new_slot, + input logic [QUEUE_ADDR_LEN - 1:0] empty_slot_addr, + input logic empty_slot_enqueue, + output logic [QUEUE_ADDR_LEN - 1:0] new_slot_addr, + output logic queue_empty); + timeunit 1ns; + timeprecision 1ps; -// NOTE: addr 0 is alway mapped to the fabric itself and caught before this -function automatic logic [1:0] get_hop(input logic [31:0] dest_map, - input logic [1:0] idx); - case (dest_map[{idx, 3'b000} +: 8]) - 8'b00000001: - return 2'b00; - 8'b00000010: - return 2'b01; - 8'b00000011: - return 2'b10; - 8'b00000100: - return 2'b11; - default: - return 0; - endcase // case (dest_map[{idx, 3'b000} +: 8]) -endfunction // get_hop + logic [QUEUE_ADDR_LEN - 1:0] fqueue [QUEUE_SIZE - 1:0]; + logic [QUEUE_ADDR_LEN - 1:0] head; + logic [QUEUE_ADDR_LEN - 1:0] tail; + shortint queue_size; + + assign queue_empty = queue_size == 0; + + initial begin + // TODO: pre-load the free queue with every slot possible + end + + // IMPORTANT: rst must be held high for at least 2 sys_clk cycles + always_ff @ (posedge sys_clk or rst) begin + if (rst) begin + head <= '0; + tail <= QUEUE_ADDR_LEN'd1; + queue_size = QUEUE_SIZE; + new_slot_addr <= '0; + end else begin + if (request_new_slot) begin + head <= head + 1; + queue_size <= queue_size - 1; + end + new_slot_addr <= fqueue[head]; + + if (empty_slot_enqueue) begin + fqueue[tail] <= empty_slot_addr; + tail <= tail + 1; + queue_size <= queue_size + 1; + end + end + end + +endmodule // free_queue + +module memory_pool(input logic sys_clk, + input logic rst, + input logic [MEMORY_POOL_ADDR_LEN - 1:0] read_addr, + input logic [MEMORY_POOL_ADDR_LEN - 1:0] write_addr, + input logic [7:0] write_byte, + input logic write_enable, + output logic [7:0] read_byte); + timeunit 1ns; + timeprecision 1ps; + + logic [7:0] mem_pool[MEMORY_POOL_SIZE - 1:0]; + + always_ff @ (posedge sys_clk or rst) begin + if (rst) begin + read_byte <= 8'hFF; + end else begin + if (write_enable) + mem_pool[write_addr] <= write_byte; + read_byte <= mem_pool[read_addr]; + end + end +endmodule // memory_pool diff --git a/fabric/src/params.sv b/fabric/src/params.sv new file mode 100644 index 0000000..3fa9909 --- /dev/null +++ b/fabric/src/params.sv @@ -0,0 +1,11 @@ +parameter int PACKET_SIZE = 64; +parameter int PACKET_ADDR_LEN = 6; +parameter int QUEUE_SIZE = 1024; +parameter int QUEUE_ADDR_LEN = 10; +parameter int MEMORY_POOL_SIZE = QUEUE_SIZE * PACKET_SIZE; +parameter int MEMORY_POOL_ADDR_LEN = QUEUE_ADDR_LEN + PACKET_ADDR_LEN; +parameter int MEMORY_POOL_ADDR_SHIFT = 4; +parameter int INTERFACE_QUEUE_SIZE = 512; +parameter int INTERFACE_QUEUE_ADDR_LEN = 9; +parameter int INTERFACE_CNT = 4; +parameter int CRC_BITS = 8; diff --git a/plan.md b/plan.md index e9eb07e..8f58f51 100644 --- a/plan.md +++ b/plan.md @@ -44,9 +44,6 @@ Allow ROSE's DMA to be implemented in the drivers. Note: This may be implemented as development of THORN goes into action, or be facilitated by it. -### [TODO] Implement congestion control -When the logic for the fabric is mature enough, it should be upgraded. - ### [TODO] Implement mesh networks allowing inter-fabric routing ROSE shouldn't be limited to only 1 fabric. @@ -92,3 +89,12 @@ scratch my head every time I push an update to the logic. Weight testing against the cost of time and efficiency. If testing hinders development, then it should be separated from the development cycle. + +### Ditching features +I ditched the plans for supporting AI clusters, along with the plans +for congestion control. Focus on reducing latency and an +implementation that's elegant and simple. + +#### The lesson learned +Focus. Know what ROSE really stand for, and stop spending thoughts on +unnecessary things like trying to dual-wield AI and HFT workloads. diff --git a/style.md b/style.md new file mode 100644 index 0000000..7aa39df --- /dev/null +++ b/style.md @@ -0,0 +1,125 @@ +# Style Guide for ROSE (and other FLORA projects) +Coding style matters a lot. Good coding styles makes the code look +better to the eye, and can help mitigate some pitfalls and confusions. + +## Indentation +For all indentation, use **spaces**, not tabs. + +The rationale behind this is to avoid different indent width settings +in different editors. It's a great trade-off of making your source +file a little bigger for portability to different editors. + +### C +Use 8 spaces. This is not only to adhere to the Linux kernel's coding +style, but also to prevent your indentation levels from getting too +big. + +### Verilog/SystemVerilog +Use 4 spaces. Unlike C, HDL is more on the combinational logic +side, so we can expect some more `if-else` clauses embedded together. + +**IMPORTANT: If the indentation is blowing lines off the 80-char +width, you should probably consider refactoring the logic.** + +### Python +Use 4 spaces. This is enough for scripts, and a choice by the people +behind python. + +### Shell Scripts +Use 4 spaces. There might be arguments to make it 2, but 4 is the +minimum if you want to spot something appearing in an incorrect level +when you've been staring at the screen for 15 hours. + +### Line width +80 characters is preferred, but it can be extended by 20 characters or +so to accommodate longer identifiers. + +If it breaches 80 characters, consider breaking it into multiple lines. + +However, it is important to note that when passing many +parameters/logic, it should always be broken into logical chunks for +each line. + +## Avoid magic numbers +Unless it's the bit-length of a byte or something that's commonly +known and obvious at first glance, use a constant to store it. + +## Naming schemes +Names are only meaningful to humans, and the rationale behind the +following guidelines is to allow anyone reading the code to know what +an identifier refers to without scrolling back to its definition or +other references. + +### Snake case or camel case? +Snake case. + +### Scoping +For all identifiers, it's important to note the scope of their usage. +Names are there to avoid confusion, not add to them, and the +considerations about confusion should fall in the same scope as their +usage. + +### Abbreviating +Using abbreviations is okay and a good idea under the right +circumstances. + +As a general rule of thumb, the aggressiveness of abbreviating words +is inversely proportional to the size of the scope. But it's a **bad +idea** to abbreviate global identifiers that are not commonly used. + +### Constants +For all constants, use **ALL_CAPS**. + +### Global identifiers +Use **FULL NAMES** unless it's something pre-agreed on or by +specifications like `mosi` or `sys_clk`. + +## Commenting +Comments are great, but don't over-comment, they are there +for exactly two things: + +1. Tell people **what** the code does +2. Give a signal for future development (e.g. implementation notes, + usage warnings, required guarantees) + +If you need to explain how your code does something using comments, +it's a better idea to re-write the code. + +### Signals +Comment signals should always be contained in the same line so that +you can `grep` for them, the only exception to this is within +documentation, where you usually search for them. + +1. `TODO`: something to be done in the future +2. `NOTE`: keep note of something when using/running the code +3. `IMPORTANT`: knowing this is crucial to using/running the code +4. `REQUIRES`: guarantees for the code to run properly +5. `GUARANTEES`: guarantees that the code has this feature when ran + +## Output Messages +Like comment signals, messages should also be in complete lines and +`grep`-friendly. + +All messages should use capitalized signals denoting what type of +message it is (e.g. `ERROR`, `WARNING`, `INFO`) and enclosed in square +brackets ('[' and ']') so they can be easily processed by `sed` or +`awk`. + +If there is the need for a timestamp, put the timestamp after the +signal but within the closing bracket, leave no spaces between the +signal and the timestamp, and separate the two parts with a colon ':'. + +## Tricks and workarounds +Don't try to write "smart" code, instead, write code that everyone can +understand without too much explanation. + +## Styles specific to Verilog/SystemVerilog + +### Always use `logic` +Unless absolutely necessary, use `logic` or types built on top of +`logic`. This is to incorporate the idea of ownership into the code. + +Every bit of data should have only one unique driver. + +### Avoid inferring latches +Every bit of data should be verbosely passed to other blocks of code.