work in progress, major overhaul for design, see devlogs for details. Also added the first version of the style guide

This commit is contained in:
2025-05-29 00:18:06 -04:00
parent dac3140829
commit f61de84b4a
6 changed files with 384 additions and 90 deletions

View File

@ -1,14 +1,14 @@
# Rethinking the Routing Memory Pool # Rethinking the Routing Memory Pool
Date: 2025-05-21 Date: 2025-05-21
## Goals and Expectations ## Goals and expectations
To finish the RX and TX queues. To finish the RX and TX queues.
## Results ## Results
Nope. I'm half way through the TX queue and I'm gonna rework the Nope. I'm half way through the TX queue and I'm gonna rework the
entire thing. entire thing.
## Thought Train ## Thought train
Separating the TX queue to be per-interface is amazing. But making it Separating the TX queue to be per-interface is amazing. But making it
a multi-headed queue is a disaster. In this case, it doesn't simplify a multi-headed queue is a disaster. In this case, it doesn't simplify
the logic, while taking away one of the benefits of a shared memory the logic, while taking away one of the benefits of a shared memory

View File

@ -0,0 +1,81 @@
# Redesigned Internal Memory Pool
Date: 2025-05-28
## Goals and expectations
Lay the foundations in the hub for the new memory pool design.
## Thought train
We're putting aside the support for AI clusters and focusing on the
HFT side of things for the time being. The have very different
network workloads and one system to combine them both is not really a
good option.
This also means we can remove the plans for congestion control, that's
mostly done at the application layer, not the network layer in HFT
infrastructure. This will speedup the development and let me focus on
getting the most essential parts to function as intended.
A single block of BRAM typically only allows two simultaneous
operations to non-conflicting addresses. This meant that servicing
multiple interfaces at the same time is impractical.
So, I decided to implement a round-robin approach for both reads and
writes (feels like being back to the same spot a few weeks ago, but
there are some differences).
The approach is quite straight forward, every cycle, the hub selects
one of the interfaces to service, and when servicing it, it checks
for both RX and TX side transmissions.
And note that it's up to the interfaces to keep track of the
completion of receiving a packet, but left for the hub to collect
the free slots.
And the interfaces will have their own packet address queues to keep
track of their outgoing packets. Furthermore, this queue can be
limited to a fraction of the packet queue's size to allow control over
the maximum amount of packets per packet buffer.
This centralized, dynamic memory allocation strategy should handle
bursts well and ensure lightweight flows to be handled during a burst
event. Which is good for handling HFT-like workloads.
## Results
A very good evening of coding, I finished the following:
1. Reworked most of the hub's logic, implemented the RX side of things
and left some TODO notes.
2. Implemented the `free_queue` for allocating free queue slots for
incoming packets and enqueue freed slots by the TX side logic.
3. Implemented the `memory_pool` for the packet memory.
4. Write the first draft of the FLORA/ROSE coding style guide.
## Reflections
1. Focus. Focus is the key to getting what you really want.
2. Modularize. Modularization will keep the work limited to more
manageable chunks, which is much more important when developing
alone.
3. Write everything down. Keep track of every thought, by handwritten
notes, documentation, or even ChatGPT conversation history. This
will help when there's a few dozen things to keep in mind every
day.
4. Start doing things. Start writing down thoughts, start discussions
about future plans, start coding. Start a momentum, and start
keeping it alive.
## Final thoughts
FPGAs are great tools. And I've only began to scratch the surface of
them. Think implementing BRAM-based queues, I'd have to think about
how to sync all the components so that everything I need would be
ready exactly when I want them.
I feel like I'm beginning the transformation from a sequential thinker
that thinks in steps into a clock-aligned combinational thinker - I
think when each step would happen, not in what order, but at what
time.
Also, explicitly knowing the hidden logic of `logic` implying
ownership helped me structure my code better.
## Next steps
Complete the hub, then move on to the interfaces.

View File

@ -1,96 +1,167 @@
`include <params.sv>
// IMPORTANT: interfaces are supposed to keep track of their own packet states
module hub( module hub(
input logic rst,
input logic sys_clk, input logic sys_clk,
input logic [31:0] rx_cmd, // for routing-related commands input logic rst,
input logic [3:0] rx_cmd_valid, input logic [INTERFACE_CNT - 1][PACKET_ADDR_LEN - 1:0] rx_pkt_addr,
input logic [31:0] rx_byte, input logic [INTERFACE_CNT - 1:0][7:0] rx_byte,
input logic [3:0] rx_valid, input logic [INTERFACE_CNT - 1:0] rx_valid,
input logic [31:0] rx2tx_dest, // rx byte's destination input logic [INTERFACE_CNT - 1:0] tx_ready,
input logic [3:0] tx_ready, // if tx_byte is ready to be read input logic [INTERFACE_CNT - 1:0] tx_full,
output logic [3:0] rx_ready, // if rx_byte is ready to be read input logic [INTERFACE_CNT - 1:0][PACKET_ADDR_LEN - 1:0] tx_pkt_addr,
output logic [7:0] tx_src, // tell the tx where the stream is comming from input logic [INTERFACE_CNT - 1:0] rx_new_packet,
output logic [31:0] tx_byte, output logic [INTERFACE_CNT - 1:0] rx_ready,
output logic [3:0] tx_valid, output logic [INTERFACE_CNT - 1:0][PACKET_ADDR_LEN - 1:0] tx_queue_addr,
output logic [1:0] packet_size); // 4 states for 4 fixed packet sizes output logic [INTERFACE_CNT - 1:0] tx_queue_addr_valid,
output logic [INTERFACE_CNT - 1:0][7:0] tx_byte,
output logic [INTERFACE_CNT - 1:0] tx_valid);
timeunit 1ns; timeunit 1ns;
timeprecision 1ps; timeprecision 1ps;
// TBD: pre-agree on packet size logic [INTERFACE_CNT - 1:0] curr_service;
logic request_new_slot;
logic [QUEUE_ADDR_LEN - 1:0] new_slot_addr;
logic free_queue_empty;
logic [QUEUE_ADDR_LEN - 1:0] empty_slot_addr;
logic [QUEUE_ADDR_LEN - 1:0] empty_slot_enqueue;
// use the round-robin strat to poll since the routing is much faster free_queue fqueue(.sys_clk(sys_clk),
// NOTE: To expand to more connected_devices, use a hierarchical design .rst(rst),
logic [1:0] curr_service = 0; .request_new_slot(request_new_slot),
logic [1:0] last_dest = 0; .empty_slot_addr(empty_slot_addr),
.empty_slot_enqueue(empty_slot_enqueue),
.new_slot_addr(new_slot_addr),
.queue_empty(free_queue_empty));
// src dest byte logic [INTERFACE_CNT - 1:0][MEMORY_ADDR_LEN - 1:0] rx_mem_addr;
typedef struct { logic [MEMORY_POOL_ADDR_LEN - 1:0] mem_read_addr;
logic [1:0] dest; logic [7:0] mem_read_byte;
logic [7:0] payload; logic [MEMORY_POOL_ADDR_LEN - 1:0] mem_write_addr;
} svc_buffer; logic mem_write_enable;
svc_buffer service_buffer [3:0]; logic [7:0] mem_write_byte;
svc_buffer curr_buffer;
assign curr_buffer = service_buffer[curr_service];
logic [3:0] in_buffer;
assign rx_ready = ~in_buffer;
always_ff @ (posedge sys_clk) begin memory_pool mpool(.sys_clk(sys_clk),
.rst(rst),
.read_addr(mem_read_addr),
.write_addr(mem_write_addr),
.write_byte(mem_write_byte),
.write_enable(mem_write_enable),
.read_byte(mem_read_byte));
always_ff @ (posedge sys_clk or rst) begin
if (rst) begin if (rst) begin
in_buffer <= '0; tx_queue_addr <= '0;
tx_src <= '0; tx_queue_addr_valid <= '0;
tx_byte <= '0;
tx_valid <= '0; tx_valid <= '0;
packet_size <= '0;
curr_service <= '0; curr_service <= '0;
last_dest <= '0; rx_ready <= '0;
for (int i = 0; i < 4; i++) begin rx_mem_addr <= '0;
service_buffer[i] <= '0; mem_read_addr <= '0;
end mem_write_addr <= '0;
end else begin // if (rst) mem_write_enable <= 0;
// Handle RX side logic mem_write_byte <= '0;
for (int i = 0; i < 4; i++) begin end else begin
if (rx_valid[i]) begin // NOTE: signaled the servicing interface in the last cycle
if (!in_buffer[i]) begin rx_ready[curr_service] <= 0;
service_buffer[i].dest <= get_hop(rx2tx_dest, i[1:0]); rx_ready[curr_service + 1] <= 1;
service_buffer[i].payload <= get_byte(rx_byte, i[1:0]);
in_buffer[i] <= 1;
end
end
end
// Handle TX side logic // IMPORTANT: interfaces should send the byte no matter what, rx_ready is to prevent sending a new byte
if (in_buffer[curr_service] && tx_ready[curr_buffer.dest]) begin if (rx_valid[curr_service]) begin
tx_byte[{curr_buffer.dest, 3'b000} +: 8] // IMPORTANT: memory_write_addr is ready on the next cycle
<= curr_buffer.payload; if (rx_new_packet[curr_service]) begin
tx_src[{curr_buffer.dest, 1'b0} +: 2] if (free_queue_empty) begin
<= curr_service; // TODO: handle the drop logic
in_buffer[curr_service] <= 0; end else begin
tx_valid[curr_buffer.dest] <= 1; request_new_slot <= 1;
rx_mem_addr[{curr_service,
MEMORY_POOL_ADDR_SHIFT'd0}
+:MEMORY_POOL_ADDR_LEN
] <= {new_slot_addr, PACKET_ADDR_LEN'd0};
mem_write_addr <= {new_slot_addr, PACKET_ADDR_LEN'd0};
end
end else begin // if (rx_new_packet[curr_service])
// NOTE: if memory
mem_write_addr <= mem_write_addr + 1;
request_new_slot <= 0;
end // else: !if(rx_new_packet[curr_service])
mem_write_byte <= rx_byte[{curr_service, 3'd0}+:8];
mem_write_enable <= 1;
end else // if (rx_valid[curr_service])
mem_write_enable <= 0;
end
end end
tx_valid[last_dest] <= 0;
last_dest <= service_buffer[curr_service].dest;
curr_service <= curr_service + 1;
end // else: !if(rst)
end // always_ff @ (posedge sys_clk)
endmodule // hub endmodule // hub
function automatic logic [7:0] get_byte(input logic [31:0] byte_arr, // IMPORTANT: the current queue_addr is always valid unless queue_empty
input logic [1:0] idx); // REQUIRES: hub does not request a new slot when the queue is empty
return byte_arr[{idx, 3'b000} +: 8]; module free_queue(input logic sys_clk,
endfunction // get_byte input logic rst,
input logic request_new_slot,
input logic [QUEUE_ADDR_LEN - 1:0] empty_slot_addr,
input logic empty_slot_enqueue,
output logic [QUEUE_ADDR_LEN - 1:0] new_slot_addr,
output logic queue_empty);
timeunit 1ns;
timeprecision 1ps;
// NOTE: addr 0 is alway mapped to the fabric itself and caught before this logic [QUEUE_ADDR_LEN - 1:0] fqueue [QUEUE_SIZE - 1:0];
function automatic logic [1:0] get_hop(input logic [31:0] dest_map, logic [QUEUE_ADDR_LEN - 1:0] head;
input logic [1:0] idx); logic [QUEUE_ADDR_LEN - 1:0] tail;
case (dest_map[{idx, 3'b000} +: 8]) shortint queue_size;
8'b00000001:
return 2'b00; assign queue_empty = queue_size == 0;
8'b00000010:
return 2'b01; initial begin
8'b00000011: // TODO: pre-load the free queue with every slot possible
return 2'b10; end
8'b00000100:
return 2'b11; // IMPORTANT: rst must be held high for at least 2 sys_clk cycles
default: always_ff @ (posedge sys_clk or rst) begin
return 0; if (rst) begin
endcase // case (dest_map[{idx, 3'b000} +: 8]) head <= '0;
endfunction // get_hop tail <= QUEUE_ADDR_LEN'd1;
queue_size = QUEUE_SIZE;
new_slot_addr <= '0;
end else begin
if (request_new_slot) begin
head <= head + 1;
queue_size <= queue_size - 1;
end
new_slot_addr <= fqueue[head];
if (empty_slot_enqueue) begin
fqueue[tail] <= empty_slot_addr;
tail <= tail + 1;
queue_size <= queue_size + 1;
end
end
end
endmodule // free_queue
module memory_pool(input logic sys_clk,
input logic rst,
input logic [MEMORY_POOL_ADDR_LEN - 1:0] read_addr,
input logic [MEMORY_POOL_ADDR_LEN - 1:0] write_addr,
input logic [7:0] write_byte,
input logic write_enable,
output logic [7:0] read_byte);
timeunit 1ns;
timeprecision 1ps;
logic [7:0] mem_pool[MEMORY_POOL_SIZE - 1:0];
always_ff @ (posedge sys_clk or rst) begin
if (rst) begin
read_byte <= 8'hFF;
end else begin
if (write_enable)
mem_pool[write_addr] <= write_byte;
read_byte <= mem_pool[read_addr];
end
end
endmodule // memory_pool

11
fabric/src/params.sv Normal file
View File

@ -0,0 +1,11 @@
parameter int PACKET_SIZE = 64;
parameter int PACKET_ADDR_LEN = 6;
parameter int QUEUE_SIZE = 1024;
parameter int QUEUE_ADDR_LEN = 10;
parameter int MEMORY_POOL_SIZE = QUEUE_SIZE * PACKET_SIZE;
parameter int MEMORY_POOL_ADDR_LEN = QUEUE_ADDR_LEN + PACKET_ADDR_LEN;
parameter int MEMORY_POOL_ADDR_SHIFT = 4;
parameter int INTERFACE_QUEUE_SIZE = 512;
parameter int INTERFACE_QUEUE_ADDR_LEN = 9;
parameter int INTERFACE_CNT = 4;
parameter int CRC_BITS = 8;

12
plan.md
View File

@ -44,9 +44,6 @@ Allow ROSE's DMA to be implemented in the drivers.
Note: This may be implemented as development of THORN goes into Note: This may be implemented as development of THORN goes into
action, or be facilitated by it. action, or be facilitated by it.
### [TODO] Implement congestion control
When the logic for the fabric is mature enough, it should be upgraded.
### [TODO] Implement mesh networks allowing inter-fabric routing ### [TODO] Implement mesh networks allowing inter-fabric routing
ROSE shouldn't be limited to only 1 fabric. ROSE shouldn't be limited to only 1 fabric.
@ -92,3 +89,12 @@ scratch my head every time I push an update to the logic.
Weight testing against the cost of time and efficiency. If testing Weight testing against the cost of time and efficiency. If testing
hinders development, then it should be separated from the development hinders development, then it should be separated from the development
cycle. cycle.
### Ditching features
I ditched the plans for supporting AI clusters, along with the plans
for congestion control. Focus on reducing latency and an
implementation that's elegant and simple.
#### The lesson learned
Focus. Know what ROSE really stand for, and stop spending thoughts on
unnecessary things like trying to dual-wield AI and HFT workloads.

125
style.md Normal file
View File

@ -0,0 +1,125 @@
# Style Guide for ROSE (and other FLORA projects)
Coding style matters a lot. Good coding styles makes the code look
better to the eye, and can help mitigate some pitfalls and confusions.
## Indentation
For all indentation, use **spaces**, not tabs.
The rationale behind this is to avoid different indent width settings
in different editors. It's a great trade-off of making your source
file a little bigger for portability to different editors.
### C
Use 8 spaces. This is not only to adhere to the Linux kernel's coding
style, but also to prevent your indentation levels from getting too
big.
### Verilog/SystemVerilog
Use 4 spaces. Unlike C, HDL is more on the combinational logic
side, so we can expect some more `if-else` clauses embedded together.
**IMPORTANT: If the indentation is blowing lines off the 80-char
width, you should probably consider refactoring the logic.**
### Python
Use 4 spaces. This is enough for scripts, and a choice by the people
behind python.
### Shell Scripts
Use 4 spaces. There might be arguments to make it 2, but 4 is the
minimum if you want to spot something appearing in an incorrect level
when you've been staring at the screen for 15 hours.
### Line width
80 characters is preferred, but it can be extended by 20 characters or
so to accommodate longer identifiers.
If it breaches 80 characters, consider breaking it into multiple lines.
However, it is important to note that when passing many
parameters/logic, it should always be broken into logical chunks for
each line.
## Avoid magic numbers
Unless it's the bit-length of a byte or something that's commonly
known and obvious at first glance, use a constant to store it.
## Naming schemes
Names are only meaningful to humans, and the rationale behind the
following guidelines is to allow anyone reading the code to know what
an identifier refers to without scrolling back to its definition or
other references.
### Snake case or camel case?
Snake case.
### Scoping
For all identifiers, it's important to note the scope of their usage.
Names are there to avoid confusion, not add to them, and the
considerations about confusion should fall in the same scope as their
usage.
### Abbreviating
Using abbreviations is okay and a good idea under the right
circumstances.
As a general rule of thumb, the aggressiveness of abbreviating words
is inversely proportional to the size of the scope. But it's a **bad
idea** to abbreviate global identifiers that are not commonly used.
### Constants
For all constants, use **ALL_CAPS**.
### Global identifiers
Use **FULL NAMES** unless it's something pre-agreed on or by
specifications like `mosi` or `sys_clk`.
## Commenting
Comments are great, but don't over-comment, they are there
for exactly two things:
1. Tell people **what** the code does
2. Give a signal for future development (e.g. implementation notes,
usage warnings, required guarantees)
If you need to explain how your code does something using comments,
it's a better idea to re-write the code.
### Signals
Comment signals should always be contained in the same line so that
you can `grep` for them, the only exception to this is within
documentation, where you usually search for them.
1. `TODO`: something to be done in the future
2. `NOTE`: keep note of something when using/running the code
3. `IMPORTANT`: knowing this is crucial to using/running the code
4. `REQUIRES`: guarantees for the code to run properly
5. `GUARANTEES`: guarantees that the code has this feature when ran
## Output Messages
Like comment signals, messages should also be in complete lines and
`grep`-friendly.
All messages should use capitalized signals denoting what type of
message it is (e.g. `ERROR`, `WARNING`, `INFO`) and enclosed in square
brackets ('[' and ']') so they can be easily processed by `sed` or
`awk`.
If there is the need for a timestamp, put the timestamp after the
signal but within the closing bracket, leave no spaces between the
signal and the timestamp, and separate the two parts with a colon ':'.
## Tricks and workarounds
Don't try to write "smart" code, instead, write code that everyone can
understand without too much explanation.
## Styles specific to Verilog/SystemVerilog
### Always use `logic`
Unless absolutely necessary, use `logic` or types built on top of
`logic`. This is to incorporate the idea of ownership into the code.
Every bit of data should have only one unique driver.
### Avoid inferring latches
Every bit of data should be verbosely passed to other blocks of code.