added README and some other documentation

This commit is contained in:
2025-09-13 20:03:42 -04:00
parent defe696710
commit 1a7408bf0c
3 changed files with 271 additions and 0 deletions

130
README.md Normal file
View File

@@ -0,0 +1,130 @@
# DOFS - Datacenter Observability and Failover Simulator
A simulator for datacenter networks and compute nodes focusing on
observability, reproducibility, and policy/algorithm modeling.
DOFS ships with some defaults and examples, but it's really meant to
be a framework that you should adapt to your own datacenter
architecture.
## What is DOFS?
DOFS (pronounced "doofus")is an open-source project that integrates a
network simulator and a compute node simulator, designed with
modularity in mind for scalability and parallel simulations with
planned updates to add features that would allow easy configuration
and smooth data analysis.
## Why DOFS?
> Keep it simple, stupid.
Systems should be simple and easily hackable. DOFS is a system that
wants users to hack it, not use it.
## Built in support
### Core simulator
- Multiple event driven simulator instances (TBD).
- Randomness generator abstraction suite.
- Simulator instance-bound logging.
- Thread-safe console messaging.
### Networking
#### Switch side
- Shared and dedicated switch buffers with RED-style ECN generation.
- Unicast multipath spraying with ECMP.
- Group-based multicast.
- Throttling of switch forwarding latency.
- Shutting down and rebooting of switches.
#### NIC side
- Assumes RDMA access for NICs.
- NIC-side congestion control and unicast multicast load balancing.
- Automatic packet generation and protocol handling.
- Throttling of NIC latency.
- Shutting down and rebooting of NICs.
#### Link side
- Throttling and shutting down of links.
### Compute nodes (hosts)
- Management messaging via deterministic network.
- Simple publisher/subscriber model.
- Reattaching of NICs.
- Throttling, shutting down and rebooting.
## Getting started
### Requirements
1. cmake (version at least 3.20)
2. [`https://git.peisongxiao.com/peisongxiao/yaml-cpp`](yaml-cpp) (mirror provided in a separate repo)
3. clang20
4. (Optional) emacs with flycheck-mode, company-mode, and cmake-ide
### Configuration and Compilation
For now, use `cmake -S . -B build -DCMAKE_EXPORT_COMPILE_COMMANDS=ON`
to configure and `cmake --build build` to build. Later updates will
introduce python scripts for abstractions for these operations.
## Coding in style
Coding style is important. Code is for both the human and the
machine, the machine doesn't care about style, but humans do. A good
style would help development a lot.
See [`style.md`](style.md) for details.
## The planning
See the plan in [`plan.md`](plan.md).
The immediate next step is to add topology generation (fat-tree) and a
configuration module.
Most of DOFS's features have been planned *before* the first source
file was even created. A good plan serves both as a good guideline
and a good reward mechanism. You'd know early when you're running
into trouble, and you'd know when you've made a solid step in
realizing the project, even if it's just a few header outlines.
> Plans turn fear into focus, risk into reach, and steps into a path.
When you dream big, use a plan to ground it with smaller, more
manageable structures. And most people like it when their dreams come
true.
## Why DOFS (the story)?
> DOFS was designed to do one thing, and that one thing well.
I remember first touching an `ns-3` based repo and thinking: why do I
have to compile for Wi-Fi when my workload is based on RDMA networks
over RoCE?
And then I wanted to add link failures to the existing topology for
experiments, and I found out that adding dynamic failures is not that
easy, I had to invent my own APIs and there wasn't much tooling for
simulating dynamic link failures.
Then I came across `ASTRA-sim`, which simulated entire AI workloads
including the networking components. And I thought: why can't I build
my own simulator, but extend that to datacenters only?
So, DOFS, designed to be laser focused on datacenters, came into
being.
I wanted to build, from the ground up, something that can scale
networks and compute nodes, something that can allow people to easily
swap in and out different components of a datacenter at a high level.
And I wanted something modern, something that was designed with
modern datacenters in mind, integrating things like the latest
specifications from UEC.
The mission is not to replace `ns-3` or `ASTRA-sim` entirely, but to
define an area where you only need the bare minimums to get desired
results, and to leave room for future updates.
## Special thanks
I'd like to share my gratitude to ChatGPT and other AI-driven tools in
helping to realize this project. I used them to write the actual
code, and I used them to explore my ideas, to plan my path, and to
catch anything that I overlooked in the process. They are powerful in
that way, they can help expand what you have in mind, they can offer
insights into areas that you've never even heard of. And in that, I'm
thankful to the ever-evolving world of technology, and the countless
researchers and their effort to making us live in a better world.
## After thoughts
AI has written a large part of this repo, I'm not ashamed by it, I
went through every line of the AI generated code and performed checks
with deterministic tools, it has allowed this project to expand in a
short amount of time, and good prompting has allowed consistency
across many different modules.

37
plan.md Normal file
View File

@@ -0,0 +1,37 @@
# The Plan/Roadmap for DOFS
> Plans turn fear into focus, risk into reach, and steps into a path.
This plan has been modified in the course of the development of DOFS.
And that was also in the plan itself: you plan at every step. See the
end for the changes made to the plan.
## The roadmap
This is a rough summary of what I did and what I plan to do.
### [DONE] Core single instance simulator features
Implement an event driven simulator and logging to console and output
features.
### [DONE] Core networking components
Implement links, switches, and NICs, and leave enough but simple
interfaces for the hosts and the simulator orchestration system to
interact with them.
### [DONE] Core hosts with examples
Simple publisher/subscriber model with simple policies as examples to
how to build hosts in DOFS
### [TODO] Topology generation
Define how topology factories should work, and implement a concrete
fat-tree example.
### [TODO] Configuration of simulations using YAML
Define the format of configuration.
### [TODO] Multithreaded simulation orchestrator
Implement a system that can spin up multiple simulations with
different configurations to allow parallel and queued processing.
### [TODO] Configuration and results manager
Using databases to store configurations and simulation results and
access them in a git-like fashion.

104
style.md Normal file
View File

@@ -0,0 +1,104 @@
# Style Guide for DOFS
Coding style matters a lot. Good coding styles makes the code look
better to the eye, and can help mitigate some pitfalls and confusions.
## Sane Defaults
When editing C/C++ code, it is preferred to use the following `astyle`
configuration (please consider to put it in your `.astylerc`):
```
--style=java -k3 -W3 -m0 -f -p -H --squeeze-lines=3 -xb -xf -xh -c --max-code-length=80 -xL -Y --indent=spaces=8
```
## Indentation
For all indentation, use **spaces**, not tabs.
The rationale behind this is to avoid different indent width settings
in different editors. It's a great trade-off of making your source
file a little bigger for portability to different editors.
### C/C++
Use 8 spaces. This is not only to adhere to the Linux kernel's coding
style, but also to prevent your indentation levels from getting too
big.
**IMPORTANT: If the indentation is blowing lines off the 80-char
width, you should probably consider refactoring the logic.**
### Python
Use 4 spaces. This is enough for scripts, and a choice by the people
behind python.
### Shell scripts
Use 4 spaces. There might be arguments to make it 2, but 4 is the
minimum if you want to spot something appearing in an incorrect level
when you've been staring at the screen for 15 hours.
### Line width
80 characters is preferred, but it can be extended by 20 characters or
so to accommodate longer identifiers.
If it breaches 80 characters, consider breaking it into multiple lines.
However, it is important to note that when passing many
parameters/logic, it should always be broken into logical chunks for
each line.
## Avoid magic numbers
Unless it's the bit-length of a byte or something that's commonly
known and obvious at first glance, use a constant to store it.
## Arrays
Avoid using arrays that are more than 2 dimensions. If you need to
store multiple dimensions of data, consider using `struct` or
different containers to clarify what each dimension stores.
## Naming schemes
Names are only meaningful to humans, and the rationale behind the
following guidelines is to allow anyone reading the code to know what
an identifier refers to without scrolling back to its definition or
other references.
### Snake case or camel case?
Snake case for functions and variables' names, use camel case for
class/type/enum names.
### Scoping
For all identifiers, it's important to note the scope of their usage.
Names are there to avoid confusion, not add to them, and the
considerations about confusion should fall in the same scope as their
usage.
### Abbreviating
Using abbreviations is okay and a good idea under the right
circumstances.
As a general rule of thumb, the aggressiveness of abbreviating words
is inversely proportional to the size of the scope. But it's a **bad
idea** to abbreviate global identifiers that are not commonly used.
### Constants
For all constants and enums, use **ALL_CAPS**.
### Global identifiers
Use **FULL NAMES** unless it's something pre-agreed on or by
specifications like `mosi` or `sys_clk`.
## Commenting
Comments are great, but don't over-comment, they are there
for exactly two things:
1. Tell people **what** the code does
2. Give a signal for future development (e.g. implementation notes,
usage warnings, required guarantees)
If you need to explain how your code does something using comments,
it's a better idea to re-write the code.
## Output Messages
Like comment signals, messages should also be in complete lines and
`grep`-friendly.
## Tricks and workarounds
Don't try to write "smart" code, instead, write code that everyone can
understand without too much explanation.