Fine-tuning LLMs with Training Hub

training_hub is a Python library that wraps Supervised Fine-Tuning (SFT) and Orthogonal Subspace Fine-Tuning (OSFT) behind a single function call (sft(...), osft(...)) that handles single-GPU, multi-GPU, and multi-node training uniformly.

  • Automatic memory managementmax_tokens_per_gpu caps GPU memory and auto-computes micro-batch size and gradient accumulation to hit your target effective_batch_size.
  • OSFT implements Nayak et al., 2025 (arXiv:2504 .07097) — restricting weight updates to orthogonal subspaces prevents catastrophic forgetting without replay data.
  • Built-in checkpointing, experiment tracking, and Liger kernel support.
AspectSFTOSFT
Use caseInitial instruction tuningContinual domain adaptation of tuned models
Forgetting mitigationMix/replay dataAlgorithmic (orthogonal subspaces)
Key parameterStandard hyperparametersunfreeze_rank_ratio (0.0–1.0)
Backendinstructlab-trainingmini-trainer

Requirements

  • Alauda AI Workbench installed in your cluster.
  • A workbench with internet (or internal PyPI mirror), at least one NVIDIA GPU, and persistent storage for checkpoints.
  • HuggingFace model name or local path.
  • Training data in JSONL format (see below).

Data format

Each line is a conversation:

{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is machine learning?"}, {"role": "assistant", "content": "Machine learning is a subset of AI..."}]}

Roles: system, user, assistant, pretraining. Masking:

  • SFT (default) — only assistant content contributes to loss. Add "unmask": true to a sample to include all non-system content.
  • OSFT — controlled by unmask_messages (default False).

Pre-tokenized datasets with input_ids / labels are supported via use_processed_dataset=True.

Run the example notebooks

Download into your workbench and execute cell-by-cell:

NotebookAlgorithmDownload
SFT comprehensive tutorialSFTsft-comprehensive-tutorial.ipynb
OSFT comprehensive tutorialOSFTosft-comprehensive-tutorial.ipynb

Install and configure:

pip install training-hub
# `training-hub` pulls a stray `attr` package that shadows `attrs.attr` and
# breaks `aiohttp`; uninstall it.
pip uninstall -y attr
NOTE

On the prebuilt traininghub0.1-cu126-amd64:v0.1.0 runtime image, install training-hub inside a fresh venv — pip install --user training-hub upgrades transformers to a version incompatible with the bundled peft:

python -m venv /workspace/venv
source /workspace/venv/bin/activate
pip install training-hub
pip uninstall -y attr

Edit the parameter cells:

model_path = "Qwen/Qwen2.5-7B-Instruct"        # HF name or local path
data_path       = "/path/to/your/training_data.jsonl"
ckpt_output_dir = "/path/to/checkpoints/my_experiment"
selected_distributed = "single_node_8gpu"      # or single_gpu_dev, multi_node_master, ...
# OSFT only:
unfreeze_rank_ratio = 0.25                      # 0.1–0.3 conservative, 0.3–0.5 balanced

Bundled model presets cover Qwen 2.5 7B, Llama 3.1 8B, Phi 4 Mini, and generic 7B / small models.

Run all cells. The final training cell calls:

from training_hub import sft, osft

result = sft(model_path=model_path, data_path=data_path, ckpt_output_dir=ckpt_output_dir,
             effective_batch_size=128, max_tokens_per_gpu=20000, max_seq_len=16384,
             learning_rate=1e-5, num_epochs=3, nproc_per_node=8)

result = osft(model_path=model_path, data_path=data_path, ckpt_output_dir=ckpt_output_dir,
              unfreeze_rank_ratio=0.25,
              effective_batch_size=128, max_tokens_per_gpu=10000, max_seq_len=8192,
              learning_rate=5e-6, num_epochs=1, nproc_per_node=8)

Checkpoints land in ckpt_output_dir at each epoch (controlled by checkpoint_at_epoch).

Key parameters

Common (SFT and OSFT):

ParameterRequiredDescription
model_pathYesHF name or local path
data_pathYesJSONL training data
ckpt_output_dirYesCheckpoint directory
effective_batch_sizeYesGlobal effective batch size
max_tokens_per_gpuYesPer-GPU token budget; auto-computes micro-batch size
max_seq_lenYesMaximum sequence length
learning_rateYesOptimizer LR
num_epochsNoDefault 1
lr_scheduler, warmup_stepsNoLR schedule
use_ligerNoLiger kernels (default True for OSFT)
seedNoDefault 42
data_output_dirNoProcessed data cache; "/dev/shm" for RAM-disk
use_processed_datasetNoSkip tokenization if data has input_ids / labels
checkpoint_at_epoch, save_final_checkpointNoDefault True
nproc_per_node, nnodes, node_rankNoDistributed topology
rdzv_id, rdzv_endpointNoMulti-node rendezvous

OSFT-only:

ParameterRequiredDescription
unfreeze_rank_ratioYesFraction of each weight matrix updateable (0.0–1.0). Lower = more preservation.
unmask_messagesNoIf True, train on all non-system content
target_patternsNoSubstring patterns to restrict OSFT to specific layers

Multi-node

Run the notebook (or script) on every node with the same rdzv_id / rdzv_endpoint and varying node_rank:

nproc_per_node = 8
nnodes         = 2
rdzv_id        = 42
rdzv_endpoint  = "10.0.0.1:29500"
node_rank      = 0   # 1 on the worker

All nodes need network reachability to rdzv_endpoint before training starts.