began work on the central routing logic, updated some documentation
This commit is contained in:
@ -43,7 +43,8 @@ See `protocol.md` for details.
|
|||||||
ROSE was designed to embrace newer possibilities as development continues.
|
ROSE was designed to embrace newer possibilities as development continues.
|
||||||
|
|
||||||
## The planning
|
## The planning
|
||||||
See the plan in `plan.md`.
|
See the plan in `plan.md`. This file also contains short summaries of
|
||||||
|
what I did at each step.
|
||||||
|
|
||||||
Most of ROSE's behaviors and features have been planned *before* the
|
Most of ROSE's behaviors and features have been planned *before* the
|
||||||
first source file was even created. A good plan serves both as a good
|
first source file was even created. A good plan serves both as a good
|
||||||
|
152
devlog/2025-05-13-Fabric-logic.md
Normal file
152
devlog/2025-05-13-Fabric-logic.md
Normal file
@ -0,0 +1,152 @@
|
|||||||
|
# Fabric Logic
|
||||||
|
Date: 2025-05-13
|
||||||
|
|
||||||
|
## Goals and expectations
|
||||||
|
Not much for expectations in this devlog. I fell ill, probably the
|
||||||
|
spring flu. But hopefully, I can write down what I've been planning.
|
||||||
|
|
||||||
|
I've decided to skip the part to implement sending back the reversed
|
||||||
|
string that the fabric received - not much use in asynchronous
|
||||||
|
communication.
|
||||||
|
|
||||||
|
## Reworked design
|
||||||
|
This is what I've been doing for the past few days - reworking some
|
||||||
|
design, or rather, solidifying what I have in my plan into designs.
|
||||||
|
|
||||||
|
### Rethinking the header
|
||||||
|
Initially, I designed ROSE's header to be 20 bytes, which includes a
|
||||||
|
32-bit integer for the size and some redundancy. However, after
|
||||||
|
looking at some methods of organizing the FPGA's internal memory, to
|
||||||
|
prevent cross-block BRAM access and simplifying the logic, I figured I
|
||||||
|
only need some fixed sizes for the packets: 128B, 256B, 512B, and
|
||||||
|
1024B (or some lengths that align with the BRAM partitioning of the
|
||||||
|
FPGA). Note that this could even lead to ultra-small packet sizes
|
||||||
|
like 16 bytes when the network needs them.
|
||||||
|
|
||||||
|
That decision was made with the existence of management ports in mind,
|
||||||
|
i.e. I expect that devices involved in the control of the workload on
|
||||||
|
a ROSE network would also have a management interface (e.g. Ethernet,
|
||||||
|
Wi-Fi, etc.). So, there's no practical need for control packets that
|
||||||
|
are a few dozen bytes large to exist in a ROSE network, that can be
|
||||||
|
left to more mature network stacks even if they have higher latency.
|
||||||
|
|
||||||
|
And this can directly eat up 2 bits on the command byte of the header,
|
||||||
|
I'm confident that ROSE doesn't need 256 different commands for the
|
||||||
|
network, 64 is probably more than enough.
|
||||||
|
|
||||||
|
When booting up the network, the devices will send out a packet to
|
||||||
|
negotiate how big the packets are - all of the devices must share the
|
||||||
|
same packet size. This choice arose from a workload point-of-view.
|
||||||
|
For a certain workload, the packet sizes will almost be consistent
|
||||||
|
across the entire job.
|
||||||
|
|
||||||
|
To take this further, I also have in mind to let devices declare the
|
||||||
|
sizes of their packet buffer slots. With about 103KB of BRAM on the
|
||||||
|
Tang Primer 20K and 4 connected devices, I want the devices to each
|
||||||
|
have 16KB of TX queue. That means at least 16 slots, and up to 128
|
||||||
|
slots per queue. In a well designed system, a device should only
|
||||||
|
receive one size category of packets, (e.g. in a trading system,
|
||||||
|
there's specific devices that handle order book digestion, and expects
|
||||||
|
larger packets, while the devices that expects orders may only expect
|
||||||
|
smaller packets). Of course, this idea can be rethought in the future
|
||||||
|
when THORN actually generates different loads targeting different
|
||||||
|
systems.
|
||||||
|
|
||||||
|
The above design would actually dramatically shrink the ROSE header: 1
|
||||||
|
byte for command + size, 1 byte for destination address, 1 byte for
|
||||||
|
source address, 4 bytes for a possible sequence number, and at the end
|
||||||
|
one byte for CRC-8 checksums.
|
||||||
|
|
||||||
|
After some more thought, the sequence number can be put into "feature
|
||||||
|
extensions" by utilizing the 64 commands I have. Even the CRC byte
|
||||||
|
can be encoded the same way. Which brings the total overhead of a
|
||||||
|
standard ROSE protocol to 4 bytes.
|
||||||
|
|
||||||
|
### Designing the router logic
|
||||||
|
The fabric needs some kind of internal buffer to handle a) asymmetric
|
||||||
|
interface speeds and b) multiple devices trying to send packets to the
|
||||||
|
same one. So, there has to be some internal packet queue.
|
||||||
|
|
||||||
|
I want the internal routing logic to run at 135Mhz, while the SPI
|
||||||
|
interfaces capped at 50Mhz, and have the logic take in a byte at a
|
||||||
|
time instead of a bit. This means that the internal logic runs at a
|
||||||
|
much faster speed than the interfaces. This would enable the fabric
|
||||||
|
to handle simultaneous inputs from different devices.
|
||||||
|
|
||||||
|
The first thing about the design came right off the bat - multi-headed
|
||||||
|
queues. I plan to use 3/4 heads for the TX queues, so that when
|
||||||
|
receiving from a higher-speed internal logic, the interfaces can
|
||||||
|
handle multiple input sources.
|
||||||
|
|
||||||
|
The first idea was to have a shared pool of memory for the queue,
|
||||||
|
which would handle congestion beautifully since it means that all TX
|
||||||
|
queues are dynamically allocated. However, it would mean a disaster
|
||||||
|
for latency, since a FIFO queue doesn't exactly mean fairness.
|
||||||
|
|
||||||
|
Then I thought of the second idea: to have separate TX queues for each
|
||||||
|
interface. Although this would mean less incast and burst resiliency,
|
||||||
|
it would perform wonderfully in fairness and latency, combined with
|
||||||
|
multi-headed queues and faster routing logic than interfaces.
|
||||||
|
|
||||||
|
In compensation for the loss of the central shared pool of memory,
|
||||||
|
each interface should also get their own RX buffer, enough to hold one
|
||||||
|
packet while it gets collected by the routing logic.
|
||||||
|
|
||||||
|
Matching the RX buffer, the interface can directly tell the routing
|
||||||
|
logic where the collected byte should be sent to. This means a
|
||||||
|
dramatic decrease in the complexity of the routing logic (it doesn't
|
||||||
|
have to buffer nor parse headers), at the cost of an increase in the
|
||||||
|
interfaces' logic.
|
||||||
|
|
||||||
|
### Leftover question
|
||||||
|
What kind of congestion control can be implemented from this design
|
||||||
|
that mainly hardens incast and burst resiliency?
|
||||||
|
|
||||||
|
I already have in mind using credit-based control flows or using a
|
||||||
|
DCTCP style ECN. But this would still be left for the future me to
|
||||||
|
decide.
|
||||||
|
|
||||||
|
For now, focus on making the logic happen and let the packets drop.
|
||||||
|
|
||||||
|
## Reflections
|
||||||
|
> Even ideas should be reflected and refined.
|
||||||
|
|
||||||
|
1. SPI is very limiting. 50Mhz (or 50Mbps of wire speed) is slow to
|
||||||
|
be compared against gigabit Ethernet. Hopefully the reduction of
|
||||||
|
latency in the network and transport layers can make up for this.
|
||||||
|
2. I gave a lot of thought on how ROSE would scale to industrial-grade
|
||||||
|
FPGAs, starting with replacing SPI with SERDES, and increasing the
|
||||||
|
internal bus width from 8 bits (the bus through which packets are
|
||||||
|
collected from the interfaces by the routing logic) to 256 bits.
|
||||||
|
This would allow stable operations at the scale of 30Gps SERDES
|
||||||
|
connections and 400Mhz FPGA clocks. Which would make it comparable
|
||||||
|
to modern-day Ethernet connections.
|
||||||
|
3. There was **a lot** of trade-offs when considering asymmetric
|
||||||
|
interface clocks. I'd like direct streaming from one interface to
|
||||||
|
another to be possible, but clearly it won't be the case. This
|
||||||
|
meant 1 copy of packet data within the fabric itself, effectively
|
||||||
|
doubling the latency for the fabric. But this "trade-off" must be
|
||||||
|
made, unless there's a magical way of syncing the interfaces (and
|
||||||
|
that would mean a direct connection, not a network).
|
||||||
|
|
||||||
|
## Final thoughts
|
||||||
|
I've been thinking a lot about the initiative of ROSE: why build it on
|
||||||
|
such limiting hardware? And I came across this conclusion:
|
||||||
|
|
||||||
|
> Design systems in thought, in code, in the garage.
|
||||||
|
|
||||||
|
The hardware for ROSE would stay on my desk, but the things I learned
|
||||||
|
by doing it would stay in my mind and potentially be put to use in the
|
||||||
|
industry.
|
||||||
|
|
||||||
|
Ideas should be scalable up and down. If something works with cheap,
|
||||||
|
off-the-shelf FPGAs, I'd expect it to work on industrial-grade ones;
|
||||||
|
if some idea works in the industry, it should also be applicable (not
|
||||||
|
necessarily practical) on consumer-grade hardware.
|
||||||
|
|
||||||
|
I consider myself a scientist, I create ideas, and I'm not limited by
|
||||||
|
hardware or software stacks.
|
||||||
|
|
||||||
|
## Next goals
|
||||||
|
Implement the logic in sims. I've already started, but it's time to
|
||||||
|
actually get some results.
|
50
fabric/src/mem_hub.sv
Normal file
50
fabric/src/mem_hub.sv
Normal file
@ -0,0 +1,50 @@
|
|||||||
|
module mem_hub (input logic rst,
|
||||||
|
input logic sys_clk,
|
||||||
|
input logic [3:0] connected_devices, // manually configured
|
||||||
|
input logic [3:0][7:0] rx_cmd, // for routing-related commands
|
||||||
|
input logic [3:0] rx_cmd_valid,
|
||||||
|
input logic [3:0][7:0] rx_byte,
|
||||||
|
input logic [3:0] rx_valid,
|
||||||
|
input logic [3:0][1:0] rx2tx_dest, // rx byte's destination
|
||||||
|
input logic [3:0] tx_read, // if tx_byte was read
|
||||||
|
output logic [3:0] rx_read, // if rx_byte was read
|
||||||
|
output logic [3:0][1:0] tx_src, // tell the tx where the stream is comming from
|
||||||
|
output logic [3:0][7:0] tx_byte,
|
||||||
|
output logic [3:0] tx_valid,
|
||||||
|
output logic [1:0] packet_size); // 4 states for 4 fixed packet sizes
|
||||||
|
timeunit 1ns;
|
||||||
|
timeprecision 1ps;
|
||||||
|
|
||||||
|
// TBD: pre-agree on packet size
|
||||||
|
|
||||||
|
// [index][rx_src]
|
||||||
|
logic [3:0][1:0] service_queue;
|
||||||
|
logic [3:0] in_queue;
|
||||||
|
|
||||||
|
// [rx_src][tx_dest], might not be useful
|
||||||
|
logic [1:0][1:0] rx2tx_map;
|
||||||
|
|
||||||
|
logic [2:0] i;
|
||||||
|
|
||||||
|
always_ff @ (posedge sys_clk) begin
|
||||||
|
if (rst) begin
|
||||||
|
rx_read <= '0;
|
||||||
|
tx_src <= '0;
|
||||||
|
tx_valid <= '0;
|
||||||
|
packet_size <= '0;
|
||||||
|
service_queue <= '0;
|
||||||
|
in_queue <= '0;
|
||||||
|
rx2tx_map <= '0;
|
||||||
|
i <= 0;
|
||||||
|
end
|
||||||
|
|
||||||
|
if (in_queue == 4'd0) begin // no one is in the queue yet
|
||||||
|
if (tx_valid != 4'd0) begin
|
||||||
|
for (i = 0; i < 3'd4; i++) begin
|
||||||
|
// TODO: write the logic for enqueuing
|
||||||
|
end
|
||||||
|
end
|
||||||
|
end else begin
|
||||||
|
end
|
||||||
|
end
|
||||||
|
endmodule // mem_hub
|
94
plan.md
Normal file
94
plan.md
Normal file
@ -0,0 +1,94 @@
|
|||||||
|
# The Plan/Roadmap for ROSE
|
||||||
|
> Plans turn fear into focus, risk into reach, and steps into a path.
|
||||||
|
|
||||||
|
This plan has been modified in the course of the development of ROSE.
|
||||||
|
And that was also in the plan itself: you plan at every step. See the
|
||||||
|
end for the changes made to the plan.
|
||||||
|
|
||||||
|
## The roadmap
|
||||||
|
This is a rough summary of what I did and what I plan to do.
|
||||||
|
|
||||||
|
### [DONE] Learning RTL and HDL and getting familiar with SPI
|
||||||
|
Implement a functional SPI slave on the FPGA. Add small logic to
|
||||||
|
manipulate the data. Learn about cross-clock domain design and
|
||||||
|
implementation.
|
||||||
|
|
||||||
|
### [TODO] Implement the routing logic along with the interfaces
|
||||||
|
This would be the core part of implementing ROSE on the fabric
|
||||||
|
side. This is a bare minimum implementation disregarding any
|
||||||
|
congestion control or inter-fabric routing.
|
||||||
|
|
||||||
|
### [TODO] Test on a RPi-FPGA setup
|
||||||
|
Getting the code to run on sims is one thing, getting it to run on
|
||||||
|
actual hardware is another. This entire step will be to ship the code
|
||||||
|
onto my setup and deal with any kind of synthesis and place-and-route
|
||||||
|
problems that sims won't reveal.
|
||||||
|
|
||||||
|
### [TODO] Implement logging to an external device via UART
|
||||||
|
This would lay the foundations for THORN and PETAL and also would come
|
||||||
|
in handy when analyzing congestion and other anomalies.
|
||||||
|
|
||||||
|
### [TODO] Test on a RPi-FPGA-RPi setup
|
||||||
|
This is where THORN would branch off from ROSE. ROSE should keep some
|
||||||
|
minimal unit test testbenches, and have fully functional test suites
|
||||||
|
and toolchains be implemented in THORN.
|
||||||
|
|
||||||
|
### [TODO] RPi's ROSE buffer implementation
|
||||||
|
`mmap` memory for the SPI drivers on RPi's to *simulate* zero-copy on
|
||||||
|
the RPi's.
|
||||||
|
|
||||||
|
### [TODO] Modify the SPI kernel drivers for explicit DMA control
|
||||||
|
Allow ROSE's DMA to be implemented in the drivers.
|
||||||
|
|
||||||
|
### [TODO] Abstract ROSE into APIs or kernel devices
|
||||||
|
Note: This may be implemented as development of THORN goes into
|
||||||
|
action, or be facilitated by it.
|
||||||
|
|
||||||
|
### [TODO] Implement congestion control
|
||||||
|
When the logic for the fabric is mature enough, it should be upgraded.
|
||||||
|
|
||||||
|
### [TODO] Implement mesh networks allowing inter-fabric routing
|
||||||
|
ROSE shouldn't be limited to only 1 fabric.
|
||||||
|
|
||||||
|
## Changes to the plan
|
||||||
|
The plan is always changing, but it's important to remember what I
|
||||||
|
learned from every change.
|
||||||
|
|
||||||
|
### Ditching dynamic routing
|
||||||
|
In a datacenter or HFT setup, it's rarely expected that the connected
|
||||||
|
devices will change. Hardcoded routing paths is very acceptable and
|
||||||
|
keeps up with the deterministic nature of ROSE.
|
||||||
|
|
||||||
|
#### The lesson learned
|
||||||
|
Figure out the exact target of ROSE - it's not meant for generic
|
||||||
|
networks, so shave off any redundancy that it doesn't need.
|
||||||
|
|
||||||
|
### Not reversing the input when testing out the SPI interface
|
||||||
|
A few things to note:
|
||||||
|
|
||||||
|
1. Sending the bytes incremented by 1 back is sufficient to prove a
|
||||||
|
stable connection.
|
||||||
|
2. Reversing the input would result in a double-ended queue and
|
||||||
|
increasing the complexity of the logic without little benefits to
|
||||||
|
later steps.
|
||||||
|
|
||||||
|
So, I've decided to ditch this idea.
|
||||||
|
|
||||||
|
#### The lesson learned
|
||||||
|
Plan with the next step in mind, take actions with the next step in
|
||||||
|
mind. Know what is enough and what is too far.
|
||||||
|
|
||||||
|
### Moving deployment onto the hardware until later on
|
||||||
|
Originally, I planned to deploy the logic and test with real hardware
|
||||||
|
as soon as I have a working SPI module. But that's not really
|
||||||
|
applicable. I'd be fixing the synthesis with every step thereafter.
|
||||||
|
Better to finalize the design in sims first, and then solve the
|
||||||
|
FPGA-specific problems as an entire step.
|
||||||
|
|
||||||
|
I'd rather scratch my head over a bunch of problems at once than to
|
||||||
|
scratch my head every time I push an update to the logic.
|
||||||
|
|
||||||
|
#### The lesson learned
|
||||||
|
Weight testing against the cost of time and efficiency. If testing
|
||||||
|
hinders development, then it should be separated from the development
|
||||||
|
cycle.
|
31
protocol.md
Normal file
31
protocol.md
Normal file
@ -0,0 +1,31 @@
|
|||||||
|
# The Specifications for ROSE
|
||||||
|
Extensions to the protocol may change the specifications, see the
|
||||||
|
devlogs for specific decisions on changes.
|
||||||
|
|
||||||
|
## Packet specifications
|
||||||
|
### Header
|
||||||
|
|
||||||
|
#### (1 byte) Command + packet size
|
||||||
|
- Packet sizes are chosen out of 4 predetermined sizes. Using only 2
|
||||||
|
bits in this byte to be represented.
|
||||||
|
- Commands are 6 bits with 64 possibilities, see the **Commands**
|
||||||
|
section for details.
|
||||||
|
|
||||||
|
#### (1 byte) Destination address
|
||||||
|
- This can refer to any end-device or fabric within the network
|
||||||
|
|
||||||
|
#### (1 byte) Source address
|
||||||
|
- This can refer to any end-device or fabric within the network
|
||||||
|
|
||||||
|
### Payload
|
||||||
|
Via commands, leading or trailing bytes in the payload can also be
|
||||||
|
repurposed to timestamps or other feature extensions.
|
||||||
|
|
||||||
|
### (1 byte) CRC-8
|
||||||
|
To ensure delivery.
|
||||||
|
|
||||||
|
## Commands
|
||||||
|
TBD.
|
||||||
|
### Feature Extensions
|
||||||
|
#### [CMD: TBD] Include timestamp
|
||||||
|
#### [CMD: TBD] Include sequence number
|
Reference in New Issue
Block a user