began work on the central routing logic, updated some documentation
This commit is contained in:
@ -43,7 +43,8 @@ See `protocol.md` for details.
|
||||
ROSE was designed to embrace newer possibilities as development continues.
|
||||
|
||||
## The planning
|
||||
See the plan in `plan.md`.
|
||||
See the plan in `plan.md`. This file also contains short summaries of
|
||||
what I did at each step.
|
||||
|
||||
Most of ROSE's behaviors and features have been planned *before* the
|
||||
first source file was even created. A good plan serves both as a good
|
||||
|
152
devlog/2025-05-13-Fabric-logic.md
Normal file
152
devlog/2025-05-13-Fabric-logic.md
Normal file
@ -0,0 +1,152 @@
|
||||
# Fabric Logic
|
||||
Date: 2025-05-13
|
||||
|
||||
## Goals and expectations
|
||||
Not much for expectations in this devlog. I fell ill, probably the
|
||||
spring flu. But hopefully, I can write down what I've been planning.
|
||||
|
||||
I've decided to skip the part to implement sending back the reversed
|
||||
string that the fabric received - not much use in asynchronous
|
||||
communication.
|
||||
|
||||
## Reworked design
|
||||
This is what I've been doing for the past few days - reworking some
|
||||
design, or rather, solidifying what I have in my plan into designs.
|
||||
|
||||
### Rethinking the header
|
||||
Initially, I designed ROSE's header to be 20 bytes, which includes a
|
||||
32-bit integer for the size and some redundancy. However, after
|
||||
looking at some methods of organizing the FPGA's internal memory, to
|
||||
prevent cross-block BRAM access and simplifying the logic, I figured I
|
||||
only need some fixed sizes for the packets: 128B, 256B, 512B, and
|
||||
1024B (or some lengths that align with the BRAM partitioning of the
|
||||
FPGA). Note that this could even lead to ultra-small packet sizes
|
||||
like 16 bytes when the network needs them.
|
||||
|
||||
That decision was made with the existence of management ports in mind,
|
||||
i.e. I expect that devices involved in the control of the workload on
|
||||
a ROSE network would also have a management interface (e.g. Ethernet,
|
||||
Wi-Fi, etc.). So, there's no practical need for control packets that
|
||||
are a few dozen bytes large to exist in a ROSE network, that can be
|
||||
left to more mature network stacks even if they have higher latency.
|
||||
|
||||
And this can directly eat up 2 bits on the command byte of the header,
|
||||
I'm confident that ROSE doesn't need 256 different commands for the
|
||||
network, 64 is probably more than enough.
|
||||
|
||||
When booting up the network, the devices will send out a packet to
|
||||
negotiate how big the packets are - all of the devices must share the
|
||||
same packet size. This choice arose from a workload point-of-view.
|
||||
For a certain workload, the packet sizes will almost be consistent
|
||||
across the entire job.
|
||||
|
||||
To take this further, I also have in mind to let devices declare the
|
||||
sizes of their packet buffer slots. With about 103KB of BRAM on the
|
||||
Tang Primer 20K and 4 connected devices, I want the devices to each
|
||||
have 16KB of TX queue. That means at least 16 slots, and up to 128
|
||||
slots per queue. In a well designed system, a device should only
|
||||
receive one size category of packets, (e.g. in a trading system,
|
||||
there's specific devices that handle order book digestion, and expects
|
||||
larger packets, while the devices that expects orders may only expect
|
||||
smaller packets). Of course, this idea can be rethought in the future
|
||||
when THORN actually generates different loads targeting different
|
||||
systems.
|
||||
|
||||
The above design would actually dramatically shrink the ROSE header: 1
|
||||
byte for command + size, 1 byte for destination address, 1 byte for
|
||||
source address, 4 bytes for a possible sequence number, and at the end
|
||||
one byte for CRC-8 checksums.
|
||||
|
||||
After some more thought, the sequence number can be put into "feature
|
||||
extensions" by utilizing the 64 commands I have. Even the CRC byte
|
||||
can be encoded the same way. Which brings the total overhead of a
|
||||
standard ROSE protocol to 4 bytes.
|
||||
|
||||
### Designing the router logic
|
||||
The fabric needs some kind of internal buffer to handle a) asymmetric
|
||||
interface speeds and b) multiple devices trying to send packets to the
|
||||
same one. So, there has to be some internal packet queue.
|
||||
|
||||
I want the internal routing logic to run at 135Mhz, while the SPI
|
||||
interfaces capped at 50Mhz, and have the logic take in a byte at a
|
||||
time instead of a bit. This means that the internal logic runs at a
|
||||
much faster speed than the interfaces. This would enable the fabric
|
||||
to handle simultaneous inputs from different devices.
|
||||
|
||||
The first thing about the design came right off the bat - multi-headed
|
||||
queues. I plan to use 3/4 heads for the TX queues, so that when
|
||||
receiving from a higher-speed internal logic, the interfaces can
|
||||
handle multiple input sources.
|
||||
|
||||
The first idea was to have a shared pool of memory for the queue,
|
||||
which would handle congestion beautifully since it means that all TX
|
||||
queues are dynamically allocated. However, it would mean a disaster
|
||||
for latency, since a FIFO queue doesn't exactly mean fairness.
|
||||
|
||||
Then I thought of the second idea: to have separate TX queues for each
|
||||
interface. Although this would mean less incast and burst resiliency,
|
||||
it would perform wonderfully in fairness and latency, combined with
|
||||
multi-headed queues and faster routing logic than interfaces.
|
||||
|
||||
In compensation for the loss of the central shared pool of memory,
|
||||
each interface should also get their own RX buffer, enough to hold one
|
||||
packet while it gets collected by the routing logic.
|
||||
|
||||
Matching the RX buffer, the interface can directly tell the routing
|
||||
logic where the collected byte should be sent to. This means a
|
||||
dramatic decrease in the complexity of the routing logic (it doesn't
|
||||
have to buffer nor parse headers), at the cost of an increase in the
|
||||
interfaces' logic.
|
||||
|
||||
### Leftover question
|
||||
What kind of congestion control can be implemented from this design
|
||||
that mainly hardens incast and burst resiliency?
|
||||
|
||||
I already have in mind using credit-based control flows or using a
|
||||
DCTCP style ECN. But this would still be left for the future me to
|
||||
decide.
|
||||
|
||||
For now, focus on making the logic happen and let the packets drop.
|
||||
|
||||
## Reflections
|
||||
> Even ideas should be reflected and refined.
|
||||
|
||||
1. SPI is very limiting. 50Mhz (or 50Mbps of wire speed) is slow to
|
||||
be compared against gigabit Ethernet. Hopefully the reduction of
|
||||
latency in the network and transport layers can make up for this.
|
||||
2. I gave a lot of thought on how ROSE would scale to industrial-grade
|
||||
FPGAs, starting with replacing SPI with SERDES, and increasing the
|
||||
internal bus width from 8 bits (the bus through which packets are
|
||||
collected from the interfaces by the routing logic) to 256 bits.
|
||||
This would allow stable operations at the scale of 30Gps SERDES
|
||||
connections and 400Mhz FPGA clocks. Which would make it comparable
|
||||
to modern-day Ethernet connections.
|
||||
3. There was **a lot** of trade-offs when considering asymmetric
|
||||
interface clocks. I'd like direct streaming from one interface to
|
||||
another to be possible, but clearly it won't be the case. This
|
||||
meant 1 copy of packet data within the fabric itself, effectively
|
||||
doubling the latency for the fabric. But this "trade-off" must be
|
||||
made, unless there's a magical way of syncing the interfaces (and
|
||||
that would mean a direct connection, not a network).
|
||||
|
||||
## Final thoughts
|
||||
I've been thinking a lot about the initiative of ROSE: why build it on
|
||||
such limiting hardware? And I came across this conclusion:
|
||||
|
||||
> Design systems in thought, in code, in the garage.
|
||||
|
||||
The hardware for ROSE would stay on my desk, but the things I learned
|
||||
by doing it would stay in my mind and potentially be put to use in the
|
||||
industry.
|
||||
|
||||
Ideas should be scalable up and down. If something works with cheap,
|
||||
off-the-shelf FPGAs, I'd expect it to work on industrial-grade ones;
|
||||
if some idea works in the industry, it should also be applicable (not
|
||||
necessarily practical) on consumer-grade hardware.
|
||||
|
||||
I consider myself a scientist, I create ideas, and I'm not limited by
|
||||
hardware or software stacks.
|
||||
|
||||
## Next goals
|
||||
Implement the logic in sims. I've already started, but it's time to
|
||||
actually get some results.
|
50
fabric/src/mem_hub.sv
Normal file
50
fabric/src/mem_hub.sv
Normal file
@ -0,0 +1,50 @@
|
||||
module mem_hub (input logic rst,
|
||||
input logic sys_clk,
|
||||
input logic [3:0] connected_devices, // manually configured
|
||||
input logic [3:0][7:0] rx_cmd, // for routing-related commands
|
||||
input logic [3:0] rx_cmd_valid,
|
||||
input logic [3:0][7:0] rx_byte,
|
||||
input logic [3:0] rx_valid,
|
||||
input logic [3:0][1:0] rx2tx_dest, // rx byte's destination
|
||||
input logic [3:0] tx_read, // if tx_byte was read
|
||||
output logic [3:0] rx_read, // if rx_byte was read
|
||||
output logic [3:0][1:0] tx_src, // tell the tx where the stream is comming from
|
||||
output logic [3:0][7:0] tx_byte,
|
||||
output logic [3:0] tx_valid,
|
||||
output logic [1:0] packet_size); // 4 states for 4 fixed packet sizes
|
||||
timeunit 1ns;
|
||||
timeprecision 1ps;
|
||||
|
||||
// TBD: pre-agree on packet size
|
||||
|
||||
// [index][rx_src]
|
||||
logic [3:0][1:0] service_queue;
|
||||
logic [3:0] in_queue;
|
||||
|
||||
// [rx_src][tx_dest], might not be useful
|
||||
logic [1:0][1:0] rx2tx_map;
|
||||
|
||||
logic [2:0] i;
|
||||
|
||||
always_ff @ (posedge sys_clk) begin
|
||||
if (rst) begin
|
||||
rx_read <= '0;
|
||||
tx_src <= '0;
|
||||
tx_valid <= '0;
|
||||
packet_size <= '0;
|
||||
service_queue <= '0;
|
||||
in_queue <= '0;
|
||||
rx2tx_map <= '0;
|
||||
i <= 0;
|
||||
end
|
||||
|
||||
if (in_queue == 4'd0) begin // no one is in the queue yet
|
||||
if (tx_valid != 4'd0) begin
|
||||
for (i = 0; i < 3'd4; i++) begin
|
||||
// TODO: write the logic for enqueuing
|
||||
end
|
||||
end
|
||||
end else begin
|
||||
end
|
||||
end
|
||||
endmodule // mem_hub
|
94
plan.md
Normal file
94
plan.md
Normal file
@ -0,0 +1,94 @@
|
||||
# The Plan/Roadmap for ROSE
|
||||
> Plans turn fear into focus, risk into reach, and steps into a path.
|
||||
|
||||
This plan has been modified in the course of the development of ROSE.
|
||||
And that was also in the plan itself: you plan at every step. See the
|
||||
end for the changes made to the plan.
|
||||
|
||||
## The roadmap
|
||||
This is a rough summary of what I did and what I plan to do.
|
||||
|
||||
### [DONE] Learning RTL and HDL and getting familiar with SPI
|
||||
Implement a functional SPI slave on the FPGA. Add small logic to
|
||||
manipulate the data. Learn about cross-clock domain design and
|
||||
implementation.
|
||||
|
||||
### [TODO] Implement the routing logic along with the interfaces
|
||||
This would be the core part of implementing ROSE on the fabric
|
||||
side. This is a bare minimum implementation disregarding any
|
||||
congestion control or inter-fabric routing.
|
||||
|
||||
### [TODO] Test on a RPi-FPGA setup
|
||||
Getting the code to run on sims is one thing, getting it to run on
|
||||
actual hardware is another. This entire step will be to ship the code
|
||||
onto my setup and deal with any kind of synthesis and place-and-route
|
||||
problems that sims won't reveal.
|
||||
|
||||
### [TODO] Implement logging to an external device via UART
|
||||
This would lay the foundations for THORN and PETAL and also would come
|
||||
in handy when analyzing congestion and other anomalies.
|
||||
|
||||
### [TODO] Test on a RPi-FPGA-RPi setup
|
||||
This is where THORN would branch off from ROSE. ROSE should keep some
|
||||
minimal unit test testbenches, and have fully functional test suites
|
||||
and toolchains be implemented in THORN.
|
||||
|
||||
### [TODO] RPi's ROSE buffer implementation
|
||||
`mmap` memory for the SPI drivers on RPi's to *simulate* zero-copy on
|
||||
the RPi's.
|
||||
|
||||
### [TODO] Modify the SPI kernel drivers for explicit DMA control
|
||||
Allow ROSE's DMA to be implemented in the drivers.
|
||||
|
||||
### [TODO] Abstract ROSE into APIs or kernel devices
|
||||
Note: This may be implemented as development of THORN goes into
|
||||
action, or be facilitated by it.
|
||||
|
||||
### [TODO] Implement congestion control
|
||||
When the logic for the fabric is mature enough, it should be upgraded.
|
||||
|
||||
### [TODO] Implement mesh networks allowing inter-fabric routing
|
||||
ROSE shouldn't be limited to only 1 fabric.
|
||||
|
||||
## Changes to the plan
|
||||
The plan is always changing, but it's important to remember what I
|
||||
learned from every change.
|
||||
|
||||
### Ditching dynamic routing
|
||||
In a datacenter or HFT setup, it's rarely expected that the connected
|
||||
devices will change. Hardcoded routing paths is very acceptable and
|
||||
keeps up with the deterministic nature of ROSE.
|
||||
|
||||
#### The lesson learned
|
||||
Figure out the exact target of ROSE - it's not meant for generic
|
||||
networks, so shave off any redundancy that it doesn't need.
|
||||
|
||||
### Not reversing the input when testing out the SPI interface
|
||||
A few things to note:
|
||||
|
||||
1. Sending the bytes incremented by 1 back is sufficient to prove a
|
||||
stable connection.
|
||||
2. Reversing the input would result in a double-ended queue and
|
||||
increasing the complexity of the logic without little benefits to
|
||||
later steps.
|
||||
|
||||
So, I've decided to ditch this idea.
|
||||
|
||||
#### The lesson learned
|
||||
Plan with the next step in mind, take actions with the next step in
|
||||
mind. Know what is enough and what is too far.
|
||||
|
||||
### Moving deployment onto the hardware until later on
|
||||
Originally, I planned to deploy the logic and test with real hardware
|
||||
as soon as I have a working SPI module. But that's not really
|
||||
applicable. I'd be fixing the synthesis with every step thereafter.
|
||||
Better to finalize the design in sims first, and then solve the
|
||||
FPGA-specific problems as an entire step.
|
||||
|
||||
I'd rather scratch my head over a bunch of problems at once than to
|
||||
scratch my head every time I push an update to the logic.
|
||||
|
||||
#### The lesson learned
|
||||
Weight testing against the cost of time and efficiency. If testing
|
||||
hinders development, then it should be separated from the development
|
||||
cycle.
|
31
protocol.md
Normal file
31
protocol.md
Normal file
@ -0,0 +1,31 @@
|
||||
# The Specifications for ROSE
|
||||
Extensions to the protocol may change the specifications, see the
|
||||
devlogs for specific decisions on changes.
|
||||
|
||||
## Packet specifications
|
||||
### Header
|
||||
|
||||
#### (1 byte) Command + packet size
|
||||
- Packet sizes are chosen out of 4 predetermined sizes. Using only 2
|
||||
bits in this byte to be represented.
|
||||
- Commands are 6 bits with 64 possibilities, see the **Commands**
|
||||
section for details.
|
||||
|
||||
#### (1 byte) Destination address
|
||||
- This can refer to any end-device or fabric within the network
|
||||
|
||||
#### (1 byte) Source address
|
||||
- This can refer to any end-device or fabric within the network
|
||||
|
||||
### Payload
|
||||
Via commands, leading or trailing bytes in the payload can also be
|
||||
repurposed to timestamps or other feature extensions.
|
||||
|
||||
### (1 byte) CRC-8
|
||||
To ensure delivery.
|
||||
|
||||
## Commands
|
||||
TBD.
|
||||
### Feature Extensions
|
||||
#### [CMD: TBD] Include timestamp
|
||||
#### [CMD: TBD] Include sequence number
|
Reference in New Issue
Block a user