$ whoami

mihir waknis

software engineer · georgia tech mscs · ml systems

$ cat about.md

Building ML systems where compiler passes, GPU kernels, and training infrastructure meet. Currently incoming Systems & Infrastructure intern at LinkedIn (May–Aug 2026) and an MSCS Computing Systems student at Georgia Tech.

2026cuda-perf-lab[cuda · nsight compute · cub · cmake][expand]

CUDA Kernel Performance Lab

Architecture-aware CUDA kernel optimization lab covering reduction, softmax, and Black-Scholes Monte Carlo workloads, with reproducible benchmarks and Nsight Compute profiling.

built >

Built staged kernels — naive → shared-memory trees → warp-shuffle reductions → vectorized float4 loads — against CUB baselines, with deterministic inputs, CPU double-precision reference validation, CUDA-event timing, CSV/JSON result capture, and Nsight Compute profiling. Roofline and occupancy notes explain why each kernel is fast or slow.

impact >

Custom vectorized_float4 reduction reached 1.006× CUB latency at n=67M on RTX 5060 Ti; warp-per-row softmax was the fastest implementation across tested shapes including 1024×4096. Nsight Compute reports surface memory-dependency and long-scoreboard stalls as the central bottleneck.

2026mlir-softmax-backend[c++17 · mlir · llvm · nvptx · cuda driver][expand]

MLIR Softmax Backend

An end-to-end compiler backend that takes MLIR softmax input through lowering, NVPTX codegen, PTX emission, and CUDA Driver kernel launch.

built >

Built the C++17 pipeline, a custom LICM strength-reduction pass, CUDA runtime integration, and FileCheck plus CTest coverage around the full compiler path.

impact >

The optimization replaces N loop divisions with 1 division plus N multiplications when the denominator is loop-invariant, then verifies correctness on GPU.

2025simclr-from-scratch[pytorch · simclr · cifar-10 · ci / docker][expand]

SimCLR From Scratch

A compact, production-quality reimplementation of SimCLR for self-supervised visual representation learning on CIFAR-10, plus a linear-probe evaluation.

built >

Implemented two-crop augmentations, a ResNet encoder, MLP projection head, and NT-Xent loss, plus a frozen-encoder linear-probe protocol. Wired up config-driven CLI overrides, CSV + TensorBoard logging, automatic CUDA/MPS/CPU selection, AMP, cosine LR with warmup, and deterministic seeds.

impact >

Demonstrates research correctness — the full SimCLR pipeline and linear-probe protocol — alongside engineering hygiene: pytest unit tests, GitHub Actions CI, Dockerfile, Conda and pip environments.

2024open-rlhf-pipeline[python · pytorch · trl · gradio · triton][expand]

Open RLHF Pipeline

A reproducible RLHF training stack covering supervised fine-tuning, reward modeling, PPO or DPO policy optimization, evaluation hooks, and a chat demo.

built >

Organized data prep, training scripts, hardware configs, web demo tooling, and kernel experiments into a single repository that is easy to run and extend.

impact >

Documents a full alignment workflow that can run on one consumer GPU and scale to multi-GPU setups, lowering the barrier to hands-on RLHF experimentation.

click any row to expand the readme