DOFS - Datacenter Observability and Failover Simulator

A simulator for datacenter networks and compute nodes focusing on observability, reproducibility, and policy/algorithm modeling.

DOFS ships with some defaults and examples, but it's really meant to be a framework that you should adapt to your own datacenter architecture.

What is DOFS?

DOFS (pronounced "doofus")is an open-source project that integrates a network simulator and a compute node simulator, designed with modularity in mind for scalability and parallel simulations with planned updates to add features that would allow easy configuration and smooth data analysis.

Why DOFS?

Keep it simple, stupid.

Systems should be simple and easily hackable. DOFS is a system that wants users to hack it, not use it.

Built in support

Core simulator

  • Multiple event driven simulator instances (TBD).
  • Randomness generator abstraction suite.
  • Simulator instance-bound logging.
  • Thread-safe console messaging.

Networking

Switch side

  • Shared and dedicated switch buffers with RED-style ECN generation.
  • Unicast multipath spraying with ECMP.
  • Group-based multicast.
  • Throttling of switch forwarding latency.
  • Shutting down and rebooting of switches.

NIC side

  • Assumes RDMA access for NICs.
  • NIC-side congestion control and unicast multicast load balancing.
  • Automatic packet generation and protocol handling.
  • Throttling of NIC latency.
  • Shutting down and rebooting of NICs.
  • Throttling and shutting down of links.

Compute nodes (hosts)

  • Management messaging via deterministic network.
  • Simple publisher/subscriber model.
  • Reattaching of NICs.
  • Throttling, shutting down and rebooting.

Getting started

Requirements

  1. cmake (version at least 3.20)
  2. https://git.peisongxiao.com/peisongxiao/yaml-cpp (mirror provided in a separate repo)
  3. clang20
  4. (Optional) emacs with flycheck-mode, company-mode, and cmake-ide

Configuration and Compilation

For now, use cmake -S . -B build -DCMAKE_EXPORT_COMPILE_COMMANDS=ON to configure and cmake --build build to build. Later updates will introduce python scripts for abstractions for these operations.

Coding in style

Coding style is important. Code is for both the human and the machine, the machine doesn't care about style, but humans do. A good style would help development a lot.

See style.md for details.

The planning

See the plan in plan.md.

The immediate next step is to add topology generation (fat-tree) and a configuration module.

Most of DOFS's features have been planned before the first source file was even created. A good plan serves both as a good guideline and a good reward mechanism. You'd know early when you're running into trouble, and you'd know when you've made a solid step in realizing the project, even if it's just a few header outlines.

Plans turn fear into focus, risk into reach, and steps into a path.

When you dream big, use a plan to ground it with smaller, more manageable structures. And most people like it when their dreams come true.

Why DOFS (the story)?

DOFS was designed to do one thing, and that one thing well.

I remember first touching an ns-3 based repo and thinking: why do I have to compile for Wi-Fi when my workload is based on RDMA networks over RoCE?

And then I wanted to add link failures to the existing topology for experiments, and I found out that adding dynamic failures is not that easy, I had to invent my own APIs and there wasn't much tooling for simulating dynamic link failures.

Then I came across ASTRA-sim, which simulated entire AI workloads including the networking components. And I thought: why can't I build my own simulator, but extend that to datacenters only?

So, DOFS, designed to be laser focused on datacenters, came into being.

I wanted to build, from the ground up, something that can scale networks and compute nodes, something that can allow people to easily swap in and out different components of a datacenter at a high level. And I wanted something modern, something that was designed with modern datacenters in mind, integrating things like the latest specifications from UEC.

The mission is not to replace ns-3 or ASTRA-sim entirely, but to define an area where you only need the bare minimums to get desired results, and to leave room for future updates.

Special thanks

I'd like to share my gratitude to ChatGPT and other AI-driven tools in helping to realize this project. I used them to write the actual code, and I used them to explore my ideas, to plan my path, and to catch anything that I overlooked in the process. They are powerful in that way, they can help expand what you have in mind, they can offer insights into areas that you've never even heard of. And in that, I'm thankful to the ever-evolving world of technology, and the countless researchers and their effort to making us live in a better world.

After thoughts

AI has written a large part of this repo, I'm not ashamed by it, I went through every line of the AI generated code and performed checks with deterministic tools, it has allowed this project to expand in a short amount of time, and good prompting has allowed consistency across many different modules.

Description
DOFS - Datacenter Observability and Failover Simulator
Readme MIT 1.5 MiB
Languages
C++ 81.2%
Python 13.5%
CMake 5.3%