131 lines
5.2 KiB
Markdown
131 lines
5.2 KiB
Markdown
# DOFS - Datacenter Observability and Failover Simulator
|
|
A simulator for datacenter networks and compute nodes focusing on
|
|
observability, reproducibility, and policy/algorithm modeling.
|
|
|
|
DOFS ships with some defaults and examples, but it's really meant to
|
|
be a framework that you should adapt to your own datacenter
|
|
architecture.
|
|
|
|
## What is DOFS?
|
|
DOFS (pronounced "doofus")is an open-source project that integrates a
|
|
network simulator and a compute node simulator, designed with
|
|
modularity in mind for scalability and parallel simulations with
|
|
planned updates to add features that would allow easy configuration
|
|
and smooth data analysis.
|
|
|
|
## Why DOFS?
|
|
> Keep it simple, stupid.
|
|
|
|
Systems should be simple and easily hackable. DOFS is a system that
|
|
wants users to hack it, not use it.
|
|
|
|
## Built in support
|
|
### Core simulator
|
|
- Multiple event driven simulator instances (TBD).
|
|
- Randomness generator abstraction suite.
|
|
- Simulator instance-bound logging.
|
|
- Thread-safe console messaging.
|
|
### Networking
|
|
#### Switch side
|
|
- Shared and dedicated switch buffers with RED-style ECN generation.
|
|
- Unicast multipath spraying with ECMP.
|
|
- Group-based multicast.
|
|
- Throttling of switch forwarding latency.
|
|
- Shutting down and rebooting of switches.
|
|
#### NIC side
|
|
- Assumes RDMA access for NICs.
|
|
- NIC-side congestion control and unicast multicast load balancing.
|
|
- Automatic packet generation and protocol handling.
|
|
- Throttling of NIC latency.
|
|
- Shutting down and rebooting of NICs.
|
|
#### Link side
|
|
- Throttling and shutting down of links.
|
|
### Compute nodes (hosts)
|
|
- Management messaging via deterministic network.
|
|
- Simple publisher/subscriber model.
|
|
- Reattaching of NICs.
|
|
- Throttling, shutting down and rebooting.
|
|
|
|
## Getting started
|
|
### Requirements
|
|
1. cmake (version at least 3.20)
|
|
2. [`https://git.peisongxiao.com/peisongxiao/yaml-cpp`](yaml-cpp) (mirror provided in a separate repo)
|
|
3. clang20
|
|
4. (Optional) emacs with flycheck-mode, company-mode, and cmake-ide
|
|
### Configuration and Compilation
|
|
For now, use `cmake -S . -B build -DCMAKE_EXPORT_COMPILE_COMMANDS=ON`
|
|
to configure and `cmake --build build` to build. Later updates will
|
|
introduce python scripts for abstractions for these operations.
|
|
|
|
## Coding in style
|
|
Coding style is important. Code is for both the human and the
|
|
machine, the machine doesn't care about style, but humans do. A good
|
|
style would help development a lot.
|
|
|
|
See [`style.md`](style.md) for details.
|
|
|
|
## The planning
|
|
See the plan in [`plan.md`](plan.md).
|
|
|
|
The immediate next step is to add topology generation (fat-tree) and a
|
|
configuration module.
|
|
|
|
Most of DOFS's features have been planned *before* the first source
|
|
file was even created. A good plan serves both as a good guideline
|
|
and a good reward mechanism. You'd know early when you're running
|
|
into trouble, and you'd know when you've made a solid step in
|
|
realizing the project, even if it's just a few header outlines.
|
|
|
|
> Plans turn fear into focus, risk into reach, and steps into a path.
|
|
|
|
When you dream big, use a plan to ground it with smaller, more
|
|
manageable structures. And most people like it when their dreams come
|
|
true.
|
|
|
|
## Why DOFS (the story)?
|
|
> DOFS was designed to do one thing, and that one thing well.
|
|
|
|
I remember first touching an `ns-3` based repo and thinking: why do I
|
|
have to compile for Wi-Fi when my workload is based on RDMA networks
|
|
over RoCE?
|
|
|
|
And then I wanted to add link failures to the existing topology for
|
|
experiments, and I found out that adding dynamic failures is not that
|
|
easy, I had to invent my own APIs and there wasn't much tooling for
|
|
simulating dynamic link failures.
|
|
|
|
Then I came across `ASTRA-sim`, which simulated entire AI workloads
|
|
including the networking components. And I thought: why can't I build
|
|
my own simulator, but extend that to datacenters only?
|
|
|
|
So, DOFS, designed to be laser focused on datacenters, came into
|
|
being.
|
|
|
|
I wanted to build, from the ground up, something that can scale
|
|
networks and compute nodes, something that can allow people to easily
|
|
swap in and out different components of a datacenter at a high level.
|
|
And I wanted something modern, something that was designed with
|
|
modern datacenters in mind, integrating things like the latest
|
|
specifications from UEC.
|
|
|
|
The mission is not to replace `ns-3` or `ASTRA-sim` entirely, but to
|
|
define an area where you only need the bare minimums to get desired
|
|
results, and to leave room for future updates.
|
|
|
|
## Special thanks
|
|
I'd like to share my gratitude to ChatGPT and other AI-driven tools in
|
|
helping to realize this project. I used them to write the actual
|
|
code, and I used them to explore my ideas, to plan my path, and to
|
|
catch anything that I overlooked in the process. They are powerful in
|
|
that way, they can help expand what you have in mind, they can offer
|
|
insights into areas that you've never even heard of. And in that, I'm
|
|
thankful to the ever-evolving world of technology, and the countless
|
|
researchers and their effort to making us live in a better world.
|
|
|
|
## After thoughts
|
|
AI has written a large part of this repo, I'm not ashamed by it, I
|
|
went through every line of the AI generated code and performed checks
|
|
with deterministic tools, it has allowed this project to expand in a
|
|
short amount of time, and good prompting has allowed consistency
|
|
across many different modules.
|