# DOFS - Datacenter Observability and Failover Simulator A simulator for datacenter networks and compute nodes focusing on observability, reproducibility, and policy/algorithm modeling. DOFS ships with some defaults and examples, but it's really meant to be a framework that you should adapt to your own datacenter architecture. ## What is DOFS? DOFS (pronounced "doofus")is an open-source project that integrates a network simulator and a compute node simulator, designed with modularity in mind for scalability and parallel simulations with planned updates to add features that would allow easy configuration and smooth data analysis. ## Why DOFS? > Keep it simple, stupid. Systems should be simple and easily hackable. DOFS is a system that wants users to hack it, not use it. ## Built in support ### Core simulator - Multiple event driven simulator instances (TBD). - Randomness generator abstraction suite. - Simulator instance-bound logging. - Thread-safe console messaging. ### Networking #### Switch side - Shared and dedicated switch buffers with RED-style ECN generation. - Unicast multipath spraying with ECMP. - Group-based multicast. - Throttling of switch forwarding latency. - Shutting down and rebooting of switches. #### NIC side - Assumes RDMA access for NICs. - NIC-side congestion control and unicast multicast load balancing. - Automatic packet generation and protocol handling. - Throttling of NIC latency. - Shutting down and rebooting of NICs. #### Link side - Throttling and shutting down of links. ### Compute nodes (hosts) - Management messaging via deterministic network. - Simple publisher/subscriber model. - Reattaching of NICs. - Throttling, shutting down and rebooting. ## Getting started ### Requirements 1. cmake (version at least 3.20) 2. [`https://git.peisongxiao.com/peisongxiao/yaml-cpp`](yaml-cpp) (mirror provided in a separate repo) 3. clang20 4. (Optional) emacs with flycheck-mode, company-mode, and cmake-ide ### Configuration and Compilation For now, use `cmake -S . -B build -DCMAKE_EXPORT_COMPILE_COMMANDS=ON` to configure and `cmake --build build` to build. Later updates will introduce python scripts for abstractions for these operations. ## Coding in style Coding style is important. Code is for both the human and the machine, the machine doesn't care about style, but humans do. A good style would help development a lot. See [`style.md`](style.md) for details. ## The planning See the plan in [`plan.md`](plan.md). The immediate next step is to add topology generation (fat-tree) and a configuration module. Most of DOFS's features have been planned *before* the first source file was even created. A good plan serves both as a good guideline and a good reward mechanism. You'd know early when you're running into trouble, and you'd know when you've made a solid step in realizing the project, even if it's just a few header outlines. > Plans turn fear into focus, risk into reach, and steps into a path. When you dream big, use a plan to ground it with smaller, more manageable structures. And most people like it when their dreams come true. ## Why DOFS (the story)? > DOFS was designed to do one thing, and that one thing well. I remember first touching an `ns-3` based repo and thinking: why do I have to compile for Wi-Fi when my workload is based on RDMA networks over RoCE? And then I wanted to add link failures to the existing topology for experiments, and I found out that adding dynamic failures is not that easy, I had to invent my own APIs and there wasn't much tooling for simulating dynamic link failures. Then I came across `ASTRA-sim`, which simulated entire AI workloads including the networking components. And I thought: why can't I build my own simulator, but extend that to datacenters only? So, DOFS, designed to be laser focused on datacenters, came into being. I wanted to build, from the ground up, something that can scale networks and compute nodes, something that can allow people to easily swap in and out different components of a datacenter at a high level. And I wanted something modern, something that was designed with modern datacenters in mind, integrating things like the latest specifications from UEC. The mission is not to replace `ns-3` or `ASTRA-sim` entirely, but to define an area where you only need the bare minimums to get desired results, and to leave room for future updates. ## Special thanks I'd like to share my gratitude to ChatGPT and other AI-driven tools in helping to realize this project. I used them to write the actual code, and I used them to explore my ideas, to plan my path, and to catch anything that I overlooked in the process. They are powerful in that way, they can help expand what you have in mind, they can offer insights into areas that you've never even heard of. And in that, I'm thankful to the ever-evolving world of technology, and the countless researchers and their effort to making us live in a better world. ## After thoughts AI has written a large part of this repo, I'm not ashamed by it, I went through every line of the AI generated code and performed checks with deterministic tools, it has allowed this project to expand in a short amount of time, and good prompting has allowed consistency across many different modules.