added README and some other documentation
This commit is contained in:
130
README.md
Normal file
130
README.md
Normal file
@@ -0,0 +1,130 @@
|
||||
# DOFS - Datacenter Observability and Failover Simulator
|
||||
A simulator for datacenter networks and compute nodes focusing on
|
||||
observability, reproducibility, and policy/algorithm modeling.
|
||||
|
||||
DOFS ships with some defaults and examples, but it's really meant to
|
||||
be a framework that you should adapt to your own datacenter
|
||||
architecture.
|
||||
|
||||
## What is DOFS?
|
||||
DOFS (pronounced "doofus")is an open-source project that integrates a
|
||||
network simulator and a compute node simulator, designed with
|
||||
modularity in mind for scalability and parallel simulations with
|
||||
planned updates to add features that would allow easy configuration
|
||||
and smooth data analysis.
|
||||
|
||||
## Why DOFS?
|
||||
> Keep it simple, stupid.
|
||||
|
||||
Systems should be simple and easily hackable. DOFS is a system that
|
||||
wants users to hack it, not use it.
|
||||
|
||||
## Built in support
|
||||
### Core simulator
|
||||
- Multiple event driven simulator instances (TBD).
|
||||
- Randomness generator abstraction suite.
|
||||
- Simulator instance-bound logging.
|
||||
- Thread-safe console messaging.
|
||||
### Networking
|
||||
#### Switch side
|
||||
- Shared and dedicated switch buffers with RED-style ECN generation.
|
||||
- Unicast multipath spraying with ECMP.
|
||||
- Group-based multicast.
|
||||
- Throttling of switch forwarding latency.
|
||||
- Shutting down and rebooting of switches.
|
||||
#### NIC side
|
||||
- Assumes RDMA access for NICs.
|
||||
- NIC-side congestion control and unicast multicast load balancing.
|
||||
- Automatic packet generation and protocol handling.
|
||||
- Throttling of NIC latency.
|
||||
- Shutting down and rebooting of NICs.
|
||||
#### Link side
|
||||
- Throttling and shutting down of links.
|
||||
### Compute nodes (hosts)
|
||||
- Management messaging via deterministic network.
|
||||
- Simple publisher/subscriber model.
|
||||
- Reattaching of NICs.
|
||||
- Throttling, shutting down and rebooting.
|
||||
|
||||
## Getting started
|
||||
### Requirements
|
||||
1. cmake (version at least 3.20)
|
||||
2. [`https://git.peisongxiao.com/peisongxiao/yaml-cpp`](yaml-cpp) (mirror provided in a separate repo)
|
||||
3. clang20
|
||||
4. (Optional) emacs with flycheck-mode, company-mode, and cmake-ide
|
||||
### Configuration and Compilation
|
||||
For now, use `cmake -S . -B build -DCMAKE_EXPORT_COMPILE_COMMANDS=ON`
|
||||
to configure and `cmake --build build` to build. Later updates will
|
||||
introduce python scripts for abstractions for these operations.
|
||||
|
||||
## Coding in style
|
||||
Coding style is important. Code is for both the human and the
|
||||
machine, the machine doesn't care about style, but humans do. A good
|
||||
style would help development a lot.
|
||||
|
||||
See [`style.md`](style.md) for details.
|
||||
|
||||
## The planning
|
||||
See the plan in [`plan.md`](plan.md).
|
||||
|
||||
The immediate next step is to add topology generation (fat-tree) and a
|
||||
configuration module.
|
||||
|
||||
Most of DOFS's features have been planned *before* the first source
|
||||
file was even created. A good plan serves both as a good guideline
|
||||
and a good reward mechanism. You'd know early when you're running
|
||||
into trouble, and you'd know when you've made a solid step in
|
||||
realizing the project, even if it's just a few header outlines.
|
||||
|
||||
> Plans turn fear into focus, risk into reach, and steps into a path.
|
||||
|
||||
When you dream big, use a plan to ground it with smaller, more
|
||||
manageable structures. And most people like it when their dreams come
|
||||
true.
|
||||
|
||||
## Why DOFS (the story)?
|
||||
> DOFS was designed to do one thing, and that one thing well.
|
||||
|
||||
I remember first touching an `ns-3` based repo and thinking: why do I
|
||||
have to compile for Wi-Fi when my workload is based on RDMA networks
|
||||
over RoCE?
|
||||
|
||||
And then I wanted to add link failures to the existing topology for
|
||||
experiments, and I found out that adding dynamic failures is not that
|
||||
easy, I had to invent my own APIs and there wasn't much tooling for
|
||||
simulating dynamic link failures.
|
||||
|
||||
Then I came across `ASTRA-sim`, which simulated entire AI workloads
|
||||
including the networking components. And I thought: why can't I build
|
||||
my own simulator, but extend that to datacenters only?
|
||||
|
||||
So, DOFS, designed to be laser focused on datacenters, came into
|
||||
being.
|
||||
|
||||
I wanted to build, from the ground up, something that can scale
|
||||
networks and compute nodes, something that can allow people to easily
|
||||
swap in and out different components of a datacenter at a high level.
|
||||
And I wanted something modern, something that was designed with
|
||||
modern datacenters in mind, integrating things like the latest
|
||||
specifications from UEC.
|
||||
|
||||
The mission is not to replace `ns-3` or `ASTRA-sim` entirely, but to
|
||||
define an area where you only need the bare minimums to get desired
|
||||
results, and to leave room for future updates.
|
||||
|
||||
## Special thanks
|
||||
I'd like to share my gratitude to ChatGPT and other AI-driven tools in
|
||||
helping to realize this project. I used them to write the actual
|
||||
code, and I used them to explore my ideas, to plan my path, and to
|
||||
catch anything that I overlooked in the process. They are powerful in
|
||||
that way, they can help expand what you have in mind, they can offer
|
||||
insights into areas that you've never even heard of. And in that, I'm
|
||||
thankful to the ever-evolving world of technology, and the countless
|
||||
researchers and their effort to making us live in a better world.
|
||||
|
||||
## After thoughts
|
||||
AI has written a large part of this repo, I'm not ashamed by it, I
|
||||
went through every line of the AI generated code and performed checks
|
||||
with deterministic tools, it has allowed this project to expand in a
|
||||
short amount of time, and good prompting has allowed consistency
|
||||
across many different modules.
|
||||
37
plan.md
Normal file
37
plan.md
Normal file
@@ -0,0 +1,37 @@
|
||||
# The Plan/Roadmap for DOFS
|
||||
> Plans turn fear into focus, risk into reach, and steps into a path.
|
||||
|
||||
This plan has been modified in the course of the development of DOFS.
|
||||
And that was also in the plan itself: you plan at every step. See the
|
||||
end for the changes made to the plan.
|
||||
|
||||
## The roadmap
|
||||
This is a rough summary of what I did and what I plan to do.
|
||||
|
||||
### [DONE] Core single instance simulator features
|
||||
Implement an event driven simulator and logging to console and output
|
||||
features.
|
||||
|
||||
### [DONE] Core networking components
|
||||
Implement links, switches, and NICs, and leave enough but simple
|
||||
interfaces for the hosts and the simulator orchestration system to
|
||||
interact with them.
|
||||
|
||||
### [DONE] Core hosts with examples
|
||||
Simple publisher/subscriber model with simple policies as examples to
|
||||
how to build hosts in DOFS
|
||||
|
||||
### [TODO] Topology generation
|
||||
Define how topology factories should work, and implement a concrete
|
||||
fat-tree example.
|
||||
|
||||
### [TODO] Configuration of simulations using YAML
|
||||
Define the format of configuration.
|
||||
|
||||
### [TODO] Multithreaded simulation orchestrator
|
||||
Implement a system that can spin up multiple simulations with
|
||||
different configurations to allow parallel and queued processing.
|
||||
|
||||
### [TODO] Configuration and results manager
|
||||
Using databases to store configurations and simulation results and
|
||||
access them in a git-like fashion.
|
||||
104
style.md
Normal file
104
style.md
Normal file
@@ -0,0 +1,104 @@
|
||||
# Style Guide for DOFS
|
||||
Coding style matters a lot. Good coding styles makes the code look
|
||||
better to the eye, and can help mitigate some pitfalls and confusions.
|
||||
|
||||
## Sane Defaults
|
||||
When editing C/C++ code, it is preferred to use the following `astyle`
|
||||
configuration (please consider to put it in your `.astylerc`):
|
||||
|
||||
```
|
||||
--style=java -k3 -W3 -m0 -f -p -H --squeeze-lines=3 -xb -xf -xh -c --max-code-length=80 -xL -Y --indent=spaces=8
|
||||
```
|
||||
|
||||
## Indentation
|
||||
For all indentation, use **spaces**, not tabs.
|
||||
|
||||
The rationale behind this is to avoid different indent width settings
|
||||
in different editors. It's a great trade-off of making your source
|
||||
file a little bigger for portability to different editors.
|
||||
|
||||
### C/C++
|
||||
Use 8 spaces. This is not only to adhere to the Linux kernel's coding
|
||||
style, but also to prevent your indentation levels from getting too
|
||||
big.
|
||||
|
||||
**IMPORTANT: If the indentation is blowing lines off the 80-char
|
||||
width, you should probably consider refactoring the logic.**
|
||||
|
||||
### Python
|
||||
Use 4 spaces. This is enough for scripts, and a choice by the people
|
||||
behind python.
|
||||
|
||||
### Shell scripts
|
||||
Use 4 spaces. There might be arguments to make it 2, but 4 is the
|
||||
minimum if you want to spot something appearing in an incorrect level
|
||||
when you've been staring at the screen for 15 hours.
|
||||
|
||||
### Line width
|
||||
80 characters is preferred, but it can be extended by 20 characters or
|
||||
so to accommodate longer identifiers.
|
||||
|
||||
If it breaches 80 characters, consider breaking it into multiple lines.
|
||||
|
||||
However, it is important to note that when passing many
|
||||
parameters/logic, it should always be broken into logical chunks for
|
||||
each line.
|
||||
|
||||
## Avoid magic numbers
|
||||
Unless it's the bit-length of a byte or something that's commonly
|
||||
known and obvious at first glance, use a constant to store it.
|
||||
|
||||
## Arrays
|
||||
Avoid using arrays that are more than 2 dimensions. If you need to
|
||||
store multiple dimensions of data, consider using `struct` or
|
||||
different containers to clarify what each dimension stores.
|
||||
|
||||
## Naming schemes
|
||||
Names are only meaningful to humans, and the rationale behind the
|
||||
following guidelines is to allow anyone reading the code to know what
|
||||
an identifier refers to without scrolling back to its definition or
|
||||
other references.
|
||||
|
||||
### Snake case or camel case?
|
||||
Snake case for functions and variables' names, use camel case for
|
||||
class/type/enum names.
|
||||
|
||||
### Scoping
|
||||
For all identifiers, it's important to note the scope of their usage.
|
||||
Names are there to avoid confusion, not add to them, and the
|
||||
considerations about confusion should fall in the same scope as their
|
||||
usage.
|
||||
|
||||
### Abbreviating
|
||||
Using abbreviations is okay and a good idea under the right
|
||||
circumstances.
|
||||
|
||||
As a general rule of thumb, the aggressiveness of abbreviating words
|
||||
is inversely proportional to the size of the scope. But it's a **bad
|
||||
idea** to abbreviate global identifiers that are not commonly used.
|
||||
|
||||
### Constants
|
||||
For all constants and enums, use **ALL_CAPS**.
|
||||
|
||||
### Global identifiers
|
||||
Use **FULL NAMES** unless it's something pre-agreed on or by
|
||||
specifications like `mosi` or `sys_clk`.
|
||||
|
||||
## Commenting
|
||||
Comments are great, but don't over-comment, they are there
|
||||
for exactly two things:
|
||||
|
||||
1. Tell people **what** the code does
|
||||
2. Give a signal for future development (e.g. implementation notes,
|
||||
usage warnings, required guarantees)
|
||||
|
||||
If you need to explain how your code does something using comments,
|
||||
it's a better idea to re-write the code.
|
||||
|
||||
## Output Messages
|
||||
Like comment signals, messages should also be in complete lines and
|
||||
`grep`-friendly.
|
||||
|
||||
## Tricks and workarounds
|
||||
Don't try to write "smart" code, instead, write code that everyone can
|
||||
understand without too much explanation.
|
||||
Reference in New Issue
Block a user