Micrograd-style autograd.
NumPy/PyTorch, no nn.Transformer.
Byte-pair encoding implementation.
Decoder-only transformer.
100M-1B params, real data pipeline, proper training loop.
Batching, concurrent requests, measure bottlenecks.
Simple CUDA matmul implementation.
Fused softmax kernel.
Custom attention kernel.
Full FlashAttention implementation.
KV cache with optimization.
Implement GPTQ from scratch.
Implement AWQ from scratch.
Speculative decoding implementation.
PagedAttention implementation.
Dynamic batching for inference.
Combine all the above, benchmark against vLLM.
PyTorch-like API → optimized GPU kernels, TinyGrad-style.
Implement LoRA from scratch.
Implement QLoRA from scratch.
Implement HNSW or IVF from scratch.
Full retrieval-augmented generation pipeline.
Implement DDPM from scratch.
Implement DDIM.
Implement latent diffusion.
Few-step generation.
Implement ViT from scratch.
Contrastive training.
Vision encoder + LLM + projection layer.
Document understanding.
Temporal consistency, 3D attention.
Game environment, action-conditioned.
Learned video compression.
Unified text/image/video representation.
Deploy on real or simulated hardware.
Write non-trivial programs.
BIOS/UEFI, get to protected mode, load a kernel.
Interrupt handling implementation.
Physical memory management.
Virtual memory implementation.
Process scheduler.
Syscall interface.
Run ELF binaries.
Basic filesystem.
Add journaling support.
Implement TCP/IP from scratch.
Port something real (bash, vim, etc).
Run Linux inside your hypervisor.
Implement VT-x/AMD-V support.
Talk to hardware, submit commands.
Define your own ML accelerator ISA.
For your accelerator.
Your compiler → your runtime → your kernels → beat PyTorch.
Phone/edge, requires quantization + architecture search + kernel optimization.
WebGPU/WASM, real model, acceptable perf.
From scratch: all-reduce, gradient compression, fault tolerance.
Modifies its own inference code based on profiling.
DDP, FSDP, pipeline parallelism.
Implementation from scratch.
Implement loss scaling, understand numerics.
Don't let your GPUs starve.
Data mixing strategies.
Not just fine-tuning.
Full implementation.
Run experiments, understand compute-optimal training.
Loss spikes, instabilities, diagnose what went wrong.
From scratch.
Safe code execution environment.
MCTS, tree search with LLMs.
Coordination and communication.
Extended task execution.
Beyond RAG — working memory, episodic memory.
Correction loops.
Rigorously, not just vibes.
Build evaluation frameworks.
Harder than it sounds.
Understand what representations encode.
Causal tracing.
On activations.
Induction heads, etc.
Find algorithms in weights.
Adversarial robustness.
Crawling, filtering, deduplication.
Quality filtering models.
Privacy-preserving data processing.
Generate training data.
Underrated — taste matters.
Slurm, Kubernetes for ML.
Reproducibility.
Spot instances, efficient scheduling.
Not just tricks — systematic optimization.
Constrained generation.
Real-time user experience.
Semantic cache, KV cache reuse across requests.
Safety systems.
For production.
Experimentation infrastructure.
People actually use.