began work on the central routing logic, updated some documentation

This commit is contained in:
2025-05-14 22:27:40 -04:00
parent 24bf28db9d
commit b26a716ccf
5 changed files with 329 additions and 1 deletions

View File

@ -43,7 +43,8 @@ See `protocol.md` for details.
ROSE was designed to embrace newer possibilities as development continues.
## The planning
See the plan in `plan.md`.
See the plan in `plan.md`. This file also contains short summaries of
what I did at each step.
Most of ROSE's behaviors and features have been planned *before* the
first source file was even created. A good plan serves both as a good

View File

@ -0,0 +1,152 @@
# Fabric Logic
Date: 2025-05-13
## Goals and expectations
Not much for expectations in this devlog. I fell ill, probably the
spring flu. But hopefully, I can write down what I've been planning.
I've decided to skip the part to implement sending back the reversed
string that the fabric received - not much use in asynchronous
communication.
## Reworked design
This is what I've been doing for the past few days - reworking some
design, or rather, solidifying what I have in my plan into designs.
### Rethinking the header
Initially, I designed ROSE's header to be 20 bytes, which includes a
32-bit integer for the size and some redundancy. However, after
looking at some methods of organizing the FPGA's internal memory, to
prevent cross-block BRAM access and simplifying the logic, I figured I
only need some fixed sizes for the packets: 128B, 256B, 512B, and
1024B (or some lengths that align with the BRAM partitioning of the
FPGA). Note that this could even lead to ultra-small packet sizes
like 16 bytes when the network needs them.
That decision was made with the existence of management ports in mind,
i.e. I expect that devices involved in the control of the workload on
a ROSE network would also have a management interface (e.g. Ethernet,
Wi-Fi, etc.). So, there's no practical need for control packets that
are a few dozen bytes large to exist in a ROSE network, that can be
left to more mature network stacks even if they have higher latency.
And this can directly eat up 2 bits on the command byte of the header,
I'm confident that ROSE doesn't need 256 different commands for the
network, 64 is probably more than enough.
When booting up the network, the devices will send out a packet to
negotiate how big the packets are - all of the devices must share the
same packet size. This choice arose from a workload point-of-view.
For a certain workload, the packet sizes will almost be consistent
across the entire job.
To take this further, I also have in mind to let devices declare the
sizes of their packet buffer slots. With about 103KB of BRAM on the
Tang Primer 20K and 4 connected devices, I want the devices to each
have 16KB of TX queue. That means at least 16 slots, and up to 128
slots per queue. In a well designed system, a device should only
receive one size category of packets, (e.g. in a trading system,
there's specific devices that handle order book digestion, and expects
larger packets, while the devices that expects orders may only expect
smaller packets). Of course, this idea can be rethought in the future
when THORN actually generates different loads targeting different
systems.
The above design would actually dramatically shrink the ROSE header: 1
byte for command + size, 1 byte for destination address, 1 byte for
source address, 4 bytes for a possible sequence number, and at the end
one byte for CRC-8 checksums.
After some more thought, the sequence number can be put into "feature
extensions" by utilizing the 64 commands I have. Even the CRC byte
can be encoded the same way. Which brings the total overhead of a
standard ROSE protocol to 4 bytes.
### Designing the router logic
The fabric needs some kind of internal buffer to handle a) asymmetric
interface speeds and b) multiple devices trying to send packets to the
same one. So, there has to be some internal packet queue.
I want the internal routing logic to run at 135Mhz, while the SPI
interfaces capped at 50Mhz, and have the logic take in a byte at a
time instead of a bit. This means that the internal logic runs at a
much faster speed than the interfaces. This would enable the fabric
to handle simultaneous inputs from different devices.
The first thing about the design came right off the bat - multi-headed
queues. I plan to use 3/4 heads for the TX queues, so that when
receiving from a higher-speed internal logic, the interfaces can
handle multiple input sources.
The first idea was to have a shared pool of memory for the queue,
which would handle congestion beautifully since it means that all TX
queues are dynamically allocated. However, it would mean a disaster
for latency, since a FIFO queue doesn't exactly mean fairness.
Then I thought of the second idea: to have separate TX queues for each
interface. Although this would mean less incast and burst resiliency,
it would perform wonderfully in fairness and latency, combined with
multi-headed queues and faster routing logic than interfaces.
In compensation for the loss of the central shared pool of memory,
each interface should also get their own RX buffer, enough to hold one
packet while it gets collected by the routing logic.
Matching the RX buffer, the interface can directly tell the routing
logic where the collected byte should be sent to. This means a
dramatic decrease in the complexity of the routing logic (it doesn't
have to buffer nor parse headers), at the cost of an increase in the
interfaces' logic.
### Leftover question
What kind of congestion control can be implemented from this design
that mainly hardens incast and burst resiliency?
I already have in mind using credit-based control flows or using a
DCTCP style ECN. But this would still be left for the future me to
decide.
For now, focus on making the logic happen and let the packets drop.
## Reflections
> Even ideas should be reflected and refined.
1. SPI is very limiting. 50Mhz (or 50Mbps of wire speed) is slow to
be compared against gigabit Ethernet. Hopefully the reduction of
latency in the network and transport layers can make up for this.
2. I gave a lot of thought on how ROSE would scale to industrial-grade
FPGAs, starting with replacing SPI with SERDES, and increasing the
internal bus width from 8 bits (the bus through which packets are
collected from the interfaces by the routing logic) to 256 bits.
This would allow stable operations at the scale of 30Gps SERDES
connections and 400Mhz FPGA clocks. Which would make it comparable
to modern-day Ethernet connections.
3. There was **a lot** of trade-offs when considering asymmetric
interface clocks. I'd like direct streaming from one interface to
another to be possible, but clearly it won't be the case. This
meant 1 copy of packet data within the fabric itself, effectively
doubling the latency for the fabric. But this "trade-off" must be
made, unless there's a magical way of syncing the interfaces (and
that would mean a direct connection, not a network).
## Final thoughts
I've been thinking a lot about the initiative of ROSE: why build it on
such limiting hardware? And I came across this conclusion:
> Design systems in thought, in code, in the garage.
The hardware for ROSE would stay on my desk, but the things I learned
by doing it would stay in my mind and potentially be put to use in the
industry.
Ideas should be scalable up and down. If something works with cheap,
off-the-shelf FPGAs, I'd expect it to work on industrial-grade ones;
if some idea works in the industry, it should also be applicable (not
necessarily practical) on consumer-grade hardware.
I consider myself a scientist, I create ideas, and I'm not limited by
hardware or software stacks.
## Next goals
Implement the logic in sims. I've already started, but it's time to
actually get some results.

50
fabric/src/mem_hub.sv Normal file
View File

@ -0,0 +1,50 @@
module mem_hub (input logic rst,
input logic sys_clk,
input logic [3:0] connected_devices, // manually configured
input logic [3:0][7:0] rx_cmd, // for routing-related commands
input logic [3:0] rx_cmd_valid,
input logic [3:0][7:0] rx_byte,
input logic [3:0] rx_valid,
input logic [3:0][1:0] rx2tx_dest, // rx byte's destination
input logic [3:0] tx_read, // if tx_byte was read
output logic [3:0] rx_read, // if rx_byte was read
output logic [3:0][1:0] tx_src, // tell the tx where the stream is comming from
output logic [3:0][7:0] tx_byte,
output logic [3:0] tx_valid,
output logic [1:0] packet_size); // 4 states for 4 fixed packet sizes
timeunit 1ns;
timeprecision 1ps;
// TBD: pre-agree on packet size
// [index][rx_src]
logic [3:0][1:0] service_queue;
logic [3:0] in_queue;
// [rx_src][tx_dest], might not be useful
logic [1:0][1:0] rx2tx_map;
logic [2:0] i;
always_ff @ (posedge sys_clk) begin
if (rst) begin
rx_read <= '0;
tx_src <= '0;
tx_valid <= '0;
packet_size <= '0;
service_queue <= '0;
in_queue <= '0;
rx2tx_map <= '0;
i <= 0;
end
if (in_queue == 4'd0) begin // no one is in the queue yet
if (tx_valid != 4'd0) begin
for (i = 0; i < 3'd4; i++) begin
// TODO: write the logic for enqueuing
end
end
end else begin
end
end
endmodule // mem_hub

94
plan.md Normal file
View File

@ -0,0 +1,94 @@
# The Plan/Roadmap for ROSE
> Plans turn fear into focus, risk into reach, and steps into a path.
This plan has been modified in the course of the development of ROSE.
And that was also in the plan itself: you plan at every step. See the
end for the changes made to the plan.
## The roadmap
This is a rough summary of what I did and what I plan to do.
### [DONE] Learning RTL and HDL and getting familiar with SPI
Implement a functional SPI slave on the FPGA. Add small logic to
manipulate the data. Learn about cross-clock domain design and
implementation.
### [TODO] Implement the routing logic along with the interfaces
This would be the core part of implementing ROSE on the fabric
side. This is a bare minimum implementation disregarding any
congestion control or inter-fabric routing.
### [TODO] Test on a RPi-FPGA setup
Getting the code to run on sims is one thing, getting it to run on
actual hardware is another. This entire step will be to ship the code
onto my setup and deal with any kind of synthesis and place-and-route
problems that sims won't reveal.
### [TODO] Implement logging to an external device via UART
This would lay the foundations for THORN and PETAL and also would come
in handy when analyzing congestion and other anomalies.
### [TODO] Test on a RPi-FPGA-RPi setup
This is where THORN would branch off from ROSE. ROSE should keep some
minimal unit test testbenches, and have fully functional test suites
and toolchains be implemented in THORN.
### [TODO] RPi's ROSE buffer implementation
`mmap` memory for the SPI drivers on RPi's to *simulate* zero-copy on
the RPi's.
### [TODO] Modify the SPI kernel drivers for explicit DMA control
Allow ROSE's DMA to be implemented in the drivers.
### [TODO] Abstract ROSE into APIs or kernel devices
Note: This may be implemented as development of THORN goes into
action, or be facilitated by it.
### [TODO] Implement congestion control
When the logic for the fabric is mature enough, it should be upgraded.
### [TODO] Implement mesh networks allowing inter-fabric routing
ROSE shouldn't be limited to only 1 fabric.
## Changes to the plan
The plan is always changing, but it's important to remember what I
learned from every change.
### Ditching dynamic routing
In a datacenter or HFT setup, it's rarely expected that the connected
devices will change. Hardcoded routing paths is very acceptable and
keeps up with the deterministic nature of ROSE.
#### The lesson learned
Figure out the exact target of ROSE - it's not meant for generic
networks, so shave off any redundancy that it doesn't need.
### Not reversing the input when testing out the SPI interface
A few things to note:
1. Sending the bytes incremented by 1 back is sufficient to prove a
stable connection.
2. Reversing the input would result in a double-ended queue and
increasing the complexity of the logic without little benefits to
later steps.
So, I've decided to ditch this idea.
#### The lesson learned
Plan with the next step in mind, take actions with the next step in
mind. Know what is enough and what is too far.
### Moving deployment onto the hardware until later on
Originally, I planned to deploy the logic and test with real hardware
as soon as I have a working SPI module. But that's not really
applicable. I'd be fixing the synthesis with every step thereafter.
Better to finalize the design in sims first, and then solve the
FPGA-specific problems as an entire step.
I'd rather scratch my head over a bunch of problems at once than to
scratch my head every time I push an update to the logic.
#### The lesson learned
Weight testing against the cost of time and efficiency. If testing
hinders development, then it should be separated from the development
cycle.

31
protocol.md Normal file
View File

@ -0,0 +1,31 @@
# The Specifications for ROSE
Extensions to the protocol may change the specifications, see the
devlogs for specific decisions on changes.
## Packet specifications
### Header
#### (1 byte) Command + packet size
- Packet sizes are chosen out of 4 predetermined sizes. Using only 2
bits in this byte to be represented.
- Commands are 6 bits with 64 possibilities, see the **Commands**
section for details.
#### (1 byte) Destination address
- This can refer to any end-device or fabric within the network
#### (1 byte) Source address
- This can refer to any end-device or fabric within the network
### Payload
Via commands, leading or trailing bytes in the payload can also be
repurposed to timestamps or other feature extensions.
### (1 byte) CRC-8
To ensure delivery.
## Commands
TBD.
### Feature Extensions
#### [CMD: TBD] Include timestamp
#### [CMD: TBD] Include sequence number