added README and some other documentation

2025-09-13 20:03:42 -04:00
parent defe696710
commit 1a7408bf0c
3 changed files with 271 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,130 @@
 # DOFS - Datacenter Observability and Failover Simulator
 A simulator for datacenter networks and compute nodes focusing on
 observability, reproducibility, and policy/algorithm modeling.
 DOFS ships with some defaults and examples, but it's really meant to
 be a framework that you should adapt to your own datacenter
 architecture.
 ## What is DOFS?
 DOFS (pronounced "doofus")is an open-source project that integrates a
 network simulator and a compute node simulator, designed with
 modularity in mind for scalability and parallel simulations with
 planned updates to add features that would allow easy configuration
 and smooth data analysis.
 ## Why DOFS?
 > Keep it simple, stupid.
 Systems should be simple and easily hackable.  DOFS is a system that
 wants users to hack it, not use it.
 ## Built in support
 ### Core simulator
 - Multiple event driven simulator instances (TBD).
 - Randomness generator abstraction suite.
 - Simulator instance-bound logging.
 - Thread-safe console messaging.
 ### Networking
 #### Switch side
 - Shared and dedicated switch buffers with RED-style ECN generation.
 - Unicast multipath spraying with ECMP.
 - Group-based multicast.
 - Throttling of switch forwarding latency.
 - Shutting down and rebooting of switches.
 #### NIC side
 - Assumes RDMA access for NICs.
 - NIC-side congestion control and unicast multicast load balancing.
 - Automatic packet generation and protocol handling.
 - Throttling of NIC latency.
 - Shutting down and rebooting of NICs.
 #### Link side
 - Throttling and shutting down of links.
 ### Compute nodes (hosts)
 - Management messaging via deterministic network.
 - Simple publisher/subscriber model.
 - Reattaching of NICs.
 - Throttling, shutting down and rebooting.
 ## Getting started
 ### Requirements
 1. cmake (version at least 3.20)
 2. [`https://git.peisongxiao.com/peisongxiao/yaml-cpp`](yaml-cpp) (mirror provided in a separate repo)
 3. clang20
 4. (Optional) emacs with flycheck-mode, company-mode, and cmake-ide
 ### Configuration and Compilation
 For now, use `cmake -S . -B build -DCMAKE_EXPORT_COMPILE_COMMANDS=ON`
 to configure and `cmake --build build` to build.  Later updates will
 introduce python scripts for abstractions for these operations.
 ## Coding in style
 Coding style is important.  Code is for both the human and the
 machine, the machine doesn't care about style, but humans do.  A good
 style would help development a lot.
 See [`style.md`](style.md) for details.
 ## The planning
 See the plan in [`plan.md`](plan.md).
 The immediate next step is to add topology generation (fat-tree) and a
 configuration module.
 Most of DOFS's features have been planned *before* the first source
 file was even created.  A good plan serves both as a good guideline
 and a good reward mechanism.  You'd know early when you're running
 into trouble, and you'd know when you've made a solid step in
 realizing the project, even if it's just a few header outlines.
 > Plans turn fear into focus, risk into reach, and steps into a path.
 When you dream big, use a plan to ground it with smaller, more
 manageable structures.  And most people like it when their dreams come
 true.
 ## Why DOFS (the story)?
 > DOFS was designed to do one thing, and that one thing well.
 I remember first touching an `ns-3` based repo and thinking: why do I
 have to compile for Wi-Fi when my workload is based on RDMA networks
 over RoCE?
 And then I wanted to add link failures to the existing topology for
 experiments, and I found out that adding dynamic failures is not that
 easy, I had to invent my own APIs and there wasn't much tooling for
 simulating dynamic link failures.
 Then I came across `ASTRA-sim`, which simulated entire AI workloads
 including the networking components.  And I thought: why can't I build
 my own simulator, but extend that to datacenters only?
 So, DOFS, designed to be laser focused on datacenters, came into
 being.
 I wanted to build, from the ground up, something that can scale
 networks and compute nodes, something that can allow people to easily
 swap in and out different components of a datacenter at a high level.
 And I wanted something modern, something that was designed with 
 modern datacenters in mind, integrating things like the latest
 specifications from UEC.
 The mission is not to replace `ns-3` or `ASTRA-sim` entirely, but to
 define an area where you only need the bare minimums to get desired
 results, and to leave room for future updates.
 ## Special thanks
 I'd like to share my gratitude to ChatGPT and other AI-driven tools in
 helping to realize this project.  I used them to write the actual
 code, and I used them to explore my ideas, to plan my path, and to
 catch anything that I overlooked in the process.  They are powerful in
 that way, they can help expand what you have in mind, they can offer
 insights into areas that you've never even heard of.  And in that, I'm
 thankful to the ever-evolving world of technology, and the countless
 researchers and their effort to making us live in a better world.
 ## After thoughts
 AI has written a large part of this repo, I'm not ashamed by it, I
 went through every line of the AI generated code and performed checks
 with deterministic tools, it has allowed this project to expand in a
 short amount of time, and good prompting has allowed consistency
 across many different modules.
--- a/plan.md
+++ b/plan.md
@@ -0,0 +1,37 @@
 # The Plan/Roadmap for DOFS
 > Plans turn fear into focus, risk into reach, and steps into a path.
 This plan has been modified in the course of the development of DOFS.
 And that was also in the plan itself: you plan at every step.  See the
 end for the changes made to the plan.
 ## The roadmap
 This is a rough summary of what I did and what I plan to do.
 ### [DONE] Core single instance simulator features
 Implement an event driven simulator and logging to console and output
 features.
 ### [DONE] Core networking components
 Implement links, switches, and NICs, and leave enough but simple
 interfaces for the hosts and the simulator orchestration system to
 interact with them.
 ### [DONE] Core hosts with examples
 Simple publisher/subscriber model with simple policies as examples to
 how to build hosts in DOFS
 ### [TODO] Topology generation
 Define how topology factories should work, and implement a concrete
 fat-tree example.
 ### [TODO] Configuration of simulations using YAML
 Define the format of configuration.
 ### [TODO] Multithreaded simulation orchestrator
 Implement a system that can spin up multiple simulations with
 different configurations to allow parallel and queued processing.
 ### [TODO] Configuration and results manager
 Using databases to store configurations and simulation results and
 access them in a git-like fashion.
--- a/style.md
+++ b/style.md
@@ -0,0 +1,104 @@
 # Style Guide for DOFS
 Coding style matters a lot.  Good coding styles makes the code look
 better to the eye, and can help mitigate some pitfalls and confusions.
 ## Sane Defaults
 When editing C/C++ code, it is preferred to use the following `astyle`
 configuration (please consider to put it in your `.astylerc`):
 ```
 --style=java -k3 -W3 -m0 -f -p -H --squeeze-lines=3 -xb -xf -xh -c --max-code-length=80 -xL -Y --indent=spaces=8
 ```
 ## Indentation
 For all indentation, use **spaces**, not tabs.
 The rationale behind this is to avoid different indent width settings
 in different editors.  It's a great trade-off of making your source
 file a little bigger for portability to different editors.
 ### C/C++
 Use 8 spaces.  This is not only to adhere to the Linux kernel's coding
 style, but also to prevent your indentation levels from getting too
 big.
 **IMPORTANT: If the indentation is blowing lines off the 80-char
 width, you should probably consider refactoring the logic.**
 ### Python
 Use 4 spaces.  This is enough for scripts, and a choice by the people
 behind python.
 ### Shell scripts
 Use 4 spaces.  There might be arguments to make it 2, but 4 is the
 minimum if you want to spot something appearing in an incorrect level
 when you've been staring at the screen for 15 hours.
 ### Line width
 80 characters is preferred, but it can be extended by 20 characters or
 so to accommodate longer identifiers.
 If it breaches 80 characters, consider breaking it into multiple lines.
 However, it is important to note that when passing many
 parameters/logic, it should always be broken into logical chunks for
 each line.
 ## Avoid magic numbers
 Unless it's the bit-length of a byte or something that's commonly
 known and obvious at first glance, use a constant to store it.
 ## Arrays
 Avoid using arrays that are more than 2 dimensions.  If you need to
 store multiple dimensions of data, consider using `struct` or
 different containers to clarify what each dimension stores.
 ## Naming schemes
 Names are only meaningful to humans, and the rationale behind the
 following guidelines is to allow anyone reading the code to know what
 an identifier refers to without scrolling back to its definition or
 other references.
 ### Snake case or camel case?
 Snake case for functions and variables' names, use camel case for
 class/type/enum names.
 ### Scoping
 For all identifiers, it's important to note the scope of their usage.
 Names are there to avoid confusion, not add to them, and the
 considerations about confusion should fall in the same scope as their
 usage.
 ### Abbreviating
 Using abbreviations is okay and a good idea under the right
 circumstances.
 As a general rule of thumb, the aggressiveness of abbreviating words
 is inversely proportional to the size of the scope.  But it's a **bad
 idea** to abbreviate global identifiers that are not commonly used.
 ### Constants
 For all constants and enums, use **ALL_CAPS**.
 ### Global identifiers
 Use **FULL NAMES** unless it's something pre-agreed on or by
 specifications like `mosi` or `sys_clk`.
 ## Commenting
 Comments are great, but don't over-comment, they are there
 for exactly two things:
 1. Tell people **what** the code does
 2. Give a signal for future development (e.g. implementation notes,
   usage warnings, required guarantees)
 If you need to explain how your code does something using comments,
 it's a better idea to re-write the code.
 ## Output Messages
 Like comment signals, messages should also be in complete lines and
 `grep`-friendly.
 ## Tricks and workarounds
 Don't try to write "smart" code, instead, write code that everyone can
 understand without too much explanation.