Introduced KempnerForge at the 2026 Research Computing and Data Summit at Harvard

Published:

I had the opportunity to introduce KempnerForge at the 2026 Research Computing and Data Summit at Harvard.

Presenting KempnerForge at the 2026 RCD Summit, Harvard

KempnerForge is an open-source, PyTorch-native training framework developed by the Research Engineering Team at the Kempner Institute. It is designed for reliable foundation-model training on shared AI clusters, where interruptions, preemptions, changing allocations, and heterogeneous research workloads are part of everyday life.

KempnerForge is built to make large-scale training more reliable, reproducible, and easier to operate on real shared infrastructure.

Key features include:

  • Preemption-safe checkpointing and auto-resume
  • Distributed checkpointing with N-to-M resharding
  • SLURM-native multi-node launch
  • FSDP2, tensor parallelism, expert parallelism, pipeline parallelism, and FP8 support
  • Dense, MoE, and vision-language model support
  • Typed TOML configuration validated before training starts
  • NaN detection, NCCL/GPU health checks, and profiler integration
  • MFU tracking, WandB/TensorBoard logging, and reproducible runs

If you are training foundation models on a shared AI cluster, KempnerForge is built for you.



RCD Summit 2026 — Speaking RCD Summit 2026 — Audience view RCD Summit 2026 — Closing slide RCD Summit 2026 — Q&A