Introduced KempnerForge at the 2026 Research Computing and Data Summit at Harvard

Published: June 09, 2026

I had the opportunity to introduce KempnerForge at the 2026 Research Computing and Data Summit at Harvard.

Presenting KempnerForge at the 2026 RCD Summit, Harvard

KempnerForge is an open-source, PyTorch-native training framework developed by the Research Engineering Team at the Kempner Institute. It is designed for reliable foundation-model training on shared AI clusters, where interruptions, preemptions, changing allocations, and heterogeneous research workloads are part of everyday life.

KempnerForge is built to make large-scale training more reliable, reproducible, and easier to operate on real shared infrastructure.

Key features include:

Preemption-safe checkpointing and auto-resume
Distributed checkpointing with N-to-M resharding
SLURM-native multi-node launch
FSDP2, tensor parallelism, expert parallelism, pipeline parallelism, and FP8 support
Dense, MoE, and vision-language model support
Typed TOML configuration validated before training starts
NaN detection, NCCL/GPU health checks, and profiler integration
MFU tracking, WandB/TensorBoard logging, and reproducible runs

If you are training foundation models on a shared AI cluster, KempnerForge is built for you.

Full Presentation: Download the slides (PDF)
Project Repository: github.com/KempnerInstitute/KempnerForge

Photo Gallery

Naeem Khoshnevis

Photo Gallery