diff --git a/README.md b/README.md new file mode 100644 index 0000000..250e6a1 --- /dev/null +++ b/README.md @@ -0,0 +1,130 @@ +# DOFS - Datacenter Observability and Failover Simulator +A simulator for datacenter networks and compute nodes focusing on +observability, reproducibility, and policy/algorithm modeling. + +DOFS ships with some defaults and examples, but it's really meant to +be a framework that you should adapt to your own datacenter +architecture. + +## What is DOFS? +DOFS (pronounced "doofus")is an open-source project that integrates a +network simulator and a compute node simulator, designed with +modularity in mind for scalability and parallel simulations with +planned updates to add features that would allow easy configuration +and smooth data analysis. + +## Why DOFS? +> Keep it simple, stupid. + +Systems should be simple and easily hackable. DOFS is a system that +wants users to hack it, not use it. + +## Built in support +### Core simulator +- Multiple event driven simulator instances (TBD). +- Randomness generator abstraction suite. +- Simulator instance-bound logging. +- Thread-safe console messaging. +### Networking +#### Switch side +- Shared and dedicated switch buffers with RED-style ECN generation. +- Unicast multipath spraying with ECMP. +- Group-based multicast. +- Throttling of switch forwarding latency. +- Shutting down and rebooting of switches. +#### NIC side +- Assumes RDMA access for NICs. +- NIC-side congestion control and unicast multicast load balancing. +- Automatic packet generation and protocol handling. +- Throttling of NIC latency. +- Shutting down and rebooting of NICs. +#### Link side +- Throttling and shutting down of links. +### Compute nodes (hosts) +- Management messaging via deterministic network. +- Simple publisher/subscriber model. +- Reattaching of NICs. +- Throttling, shutting down and rebooting. + +## Getting started +### Requirements +1. cmake (version at least 3.20) +2. [`https://git.peisongxiao.com/peisongxiao/yaml-cpp`](yaml-cpp) (mirror provided in a separate repo) +3. clang20 +4. (Optional) emacs with flycheck-mode, company-mode, and cmake-ide +### Configuration and Compilation +For now, use `cmake -S . -B build -DCMAKE_EXPORT_COMPILE_COMMANDS=ON` +to configure and `cmake --build build` to build. Later updates will +introduce python scripts for abstractions for these operations. + +## Coding in style +Coding style is important. Code is for both the human and the +machine, the machine doesn't care about style, but humans do. A good +style would help development a lot. + +See [`style.md`](style.md) for details. + +## The planning +See the plan in [`plan.md`](plan.md). + +The immediate next step is to add topology generation (fat-tree) and a +configuration module. + +Most of DOFS's features have been planned *before* the first source +file was even created. A good plan serves both as a good guideline +and a good reward mechanism. You'd know early when you're running +into trouble, and you'd know when you've made a solid step in +realizing the project, even if it's just a few header outlines. + +> Plans turn fear into focus, risk into reach, and steps into a path. + +When you dream big, use a plan to ground it with smaller, more +manageable structures. And most people like it when their dreams come +true. + +## Why DOFS (the story)? +> DOFS was designed to do one thing, and that one thing well. + +I remember first touching an `ns-3` based repo and thinking: why do I +have to compile for Wi-Fi when my workload is based on RDMA networks +over RoCE? + +And then I wanted to add link failures to the existing topology for +experiments, and I found out that adding dynamic failures is not that +easy, I had to invent my own APIs and there wasn't much tooling for +simulating dynamic link failures. + +Then I came across `ASTRA-sim`, which simulated entire AI workloads +including the networking components. And I thought: why can't I build +my own simulator, but extend that to datacenters only? + +So, DOFS, designed to be laser focused on datacenters, came into +being. + +I wanted to build, from the ground up, something that can scale +networks and compute nodes, something that can allow people to easily +swap in and out different components of a datacenter at a high level. +And I wanted something modern, something that was designed with +modern datacenters in mind, integrating things like the latest +specifications from UEC. + +The mission is not to replace `ns-3` or `ASTRA-sim` entirely, but to +define an area where you only need the bare minimums to get desired +results, and to leave room for future updates. + +## Special thanks +I'd like to share my gratitude to ChatGPT and other AI-driven tools in +helping to realize this project. I used them to write the actual +code, and I used them to explore my ideas, to plan my path, and to +catch anything that I overlooked in the process. They are powerful in +that way, they can help expand what you have in mind, they can offer +insights into areas that you've never even heard of. And in that, I'm +thankful to the ever-evolving world of technology, and the countless +researchers and their effort to making us live in a better world. + +## After thoughts +AI has written a large part of this repo, I'm not ashamed by it, I +went through every line of the AI generated code and performed checks +with deterministic tools, it has allowed this project to expand in a +short amount of time, and good prompting has allowed consistency +across many different modules. diff --git a/plan.md b/plan.md new file mode 100644 index 0000000..76910c2 --- /dev/null +++ b/plan.md @@ -0,0 +1,37 @@ +# The Plan/Roadmap for DOFS +> Plans turn fear into focus, risk into reach, and steps into a path. + +This plan has been modified in the course of the development of DOFS. +And that was also in the plan itself: you plan at every step. See the +end for the changes made to the plan. + +## The roadmap +This is a rough summary of what I did and what I plan to do. + +### [DONE] Core single instance simulator features +Implement an event driven simulator and logging to console and output +features. + +### [DONE] Core networking components +Implement links, switches, and NICs, and leave enough but simple +interfaces for the hosts and the simulator orchestration system to +interact with them. + +### [DONE] Core hosts with examples +Simple publisher/subscriber model with simple policies as examples to +how to build hosts in DOFS + +### [TODO] Topology generation +Define how topology factories should work, and implement a concrete +fat-tree example. + +### [TODO] Configuration of simulations using YAML +Define the format of configuration. + +### [TODO] Multithreaded simulation orchestrator +Implement a system that can spin up multiple simulations with +different configurations to allow parallel and queued processing. + +### [TODO] Configuration and results manager +Using databases to store configurations and simulation results and +access them in a git-like fashion. diff --git a/style.md b/style.md new file mode 100644 index 0000000..70a1f94 --- /dev/null +++ b/style.md @@ -0,0 +1,104 @@ +# Style Guide for DOFS +Coding style matters a lot. Good coding styles makes the code look +better to the eye, and can help mitigate some pitfalls and confusions. + +## Sane Defaults +When editing C/C++ code, it is preferred to use the following `astyle` +configuration (please consider to put it in your `.astylerc`): + +``` +--style=java -k3 -W3 -m0 -f -p -H --squeeze-lines=3 -xb -xf -xh -c --max-code-length=80 -xL -Y --indent=spaces=8 +``` + +## Indentation +For all indentation, use **spaces**, not tabs. + +The rationale behind this is to avoid different indent width settings +in different editors. It's a great trade-off of making your source +file a little bigger for portability to different editors. + +### C/C++ +Use 8 spaces. This is not only to adhere to the Linux kernel's coding +style, but also to prevent your indentation levels from getting too +big. + +**IMPORTANT: If the indentation is blowing lines off the 80-char +width, you should probably consider refactoring the logic.** + +### Python +Use 4 spaces. This is enough for scripts, and a choice by the people +behind python. + +### Shell scripts +Use 4 spaces. There might be arguments to make it 2, but 4 is the +minimum if you want to spot something appearing in an incorrect level +when you've been staring at the screen for 15 hours. + +### Line width +80 characters is preferred, but it can be extended by 20 characters or +so to accommodate longer identifiers. + +If it breaches 80 characters, consider breaking it into multiple lines. + +However, it is important to note that when passing many +parameters/logic, it should always be broken into logical chunks for +each line. + +## Avoid magic numbers +Unless it's the bit-length of a byte or something that's commonly +known and obvious at first glance, use a constant to store it. + +## Arrays +Avoid using arrays that are more than 2 dimensions. If you need to +store multiple dimensions of data, consider using `struct` or +different containers to clarify what each dimension stores. + +## Naming schemes +Names are only meaningful to humans, and the rationale behind the +following guidelines is to allow anyone reading the code to know what +an identifier refers to without scrolling back to its definition or +other references. + +### Snake case or camel case? +Snake case for functions and variables' names, use camel case for +class/type/enum names. + +### Scoping +For all identifiers, it's important to note the scope of their usage. +Names are there to avoid confusion, not add to them, and the +considerations about confusion should fall in the same scope as their +usage. + +### Abbreviating +Using abbreviations is okay and a good idea under the right +circumstances. + +As a general rule of thumb, the aggressiveness of abbreviating words +is inversely proportional to the size of the scope. But it's a **bad +idea** to abbreviate global identifiers that are not commonly used. + +### Constants +For all constants and enums, use **ALL_CAPS**. + +### Global identifiers +Use **FULL NAMES** unless it's something pre-agreed on or by +specifications like `mosi` or `sys_clk`. + +## Commenting +Comments are great, but don't over-comment, they are there +for exactly two things: + +1. Tell people **what** the code does +2. Give a signal for future development (e.g. implementation notes, + usage warnings, required guarantees) + +If you need to explain how your code does something using comments, +it's a better idea to re-write the code. + +## Output Messages +Like comment signals, messages should also be in complete lines and +`grep`-friendly. + +## Tricks and workarounds +Don't try to write "smart" code, instead, write code that everyone can +understand without too much explanation.