Introduced KempnerForge at the 2026 Research Computing and Data Summit at Harvard
Published:
I had the opportunity to introduce KempnerForge at the 2026 Research Computing and Data Summit at Harvard.
KempnerForge is an open-source, PyTorch-native training framework developed by the Research Engineering Team at the Kempner Institute. It is designed for reliable foundation-model training on shared AI clusters, where interruptions, preemptions, changing allocations, and heterogeneous research workloads are part of everyday life.
KempnerForge is built to make large-scale training more reliable, reproducible, and easier to operate on real shared infrastructure.
Key features include:
- Preemption-safe checkpointing and auto-resume
- Distributed checkpointing with N-to-M resharding
- SLURM-native multi-node launch
- FSDP2, tensor parallelism, expert parallelism, pipeline parallelism, and FP8 support
- Dense, MoE, and vision-language model support
- Typed TOML configuration validated before training starts
- NaN detection, NCCL/GPU health checks, and profiler integration
- MFU tracking, WandB/TensorBoard logging, and reproducible runs
If you are training foundation models on a shared AI cluster, KempnerForge is built for you.
- Full Presentation: Download the slides (PDF)
- Project Repository: github.com/KempnerInstitute/KempnerForge
Photo Gallery
