Mihir Waknis | ML Systems Engineer

$ whoami

mihir waknis

software engineer · georgia tech mscs · ml systems

$ cat about.md

Building ML systems where compiler passes, GPU kernels, and training infrastructure meet. Currently incoming Systems & Infrastructure intern at LinkedIn (May–Aug 2026) and an MSCS Computing Systems student at Georgia Tech.

────$ ls -la work/────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

2026cuda-perf-lab[cuda · nsight compute · cub · cmake][expand]

CUDA Kernel Performance Lab

Architecture-aware CUDA kernel optimization lab covering reduction, softmax, and Black-Scholes Monte Carlo workloads, with reproducible benchmarks and Nsight Compute profiling.

built >

Built staged kernels — naive → shared-memory trees → warp-shuffle reductions → vectorized float4 loads — against CUB baselines, with deterministic inputs, CPU double-precision reference validation, CUDA-event timing, CSV/JSON result capture, and Nsight Compute profiling. Roofline and occupancy notes explain why each kernel is fast or slow.

impact >

Custom vectorized_float4 reduction reached 1.006× CUB latency at n=67M on RTX 5060 Ti; warp-per-row softmax was the fastest implementation across tested shapes including 1024×4096. Nsight Compute reports surface memory-dependency and long-scoreboard stalls as the central bottleneck.

→ github.com/Waknis/cuda-perf-lab

2026mlir-softmax-backend[c++17 · mlir · llvm · nvptx · cuda driver][expand]

MLIR Softmax Backend

An end-to-end compiler backend that takes MLIR softmax input through lowering, NVPTX codegen, PTX emission, and CUDA Driver kernel launch.

built >

Built the C++17 pipeline, a custom LICM strength-reduction pass, CUDA runtime integration, and FileCheck plus CTest coverage around the full compiler path.

impact >

The optimization replaces N loop divisions with 1 division plus N multiplications when the denominator is loop-invariant, then verifies correctness on GPU.

→ github.com/Waknis/mlir-softmax-backend

2025simclr-from-scratch[pytorch · simclr · cifar-10 · ci / docker][expand]

SimCLR From Scratch

A compact, production-quality reimplementation of SimCLR for self-supervised visual representation learning on CIFAR-10, plus a linear-probe evaluation.

built >

Implemented two-crop augmentations, a ResNet encoder, MLP projection head, and NT-Xent loss, plus a frozen-encoder linear-probe protocol. Wired up config-driven CLI overrides, CSV + TensorBoard logging, automatic CUDA/MPS/CPU selection, AMP, cosine LR with warmup, and deterministic seeds.

impact >

Demonstrates research correctness — the full SimCLR pipeline and linear-probe protocol — alongside engineering hygiene: pytest unit tests, GitHub Actions CI, Dockerfile, Conda and pip environments.

→ github.com/Waknis/simclr_from_scratch

2024open-rlhf-pipeline[python · pytorch · trl · gradio · triton][expand]

Open RLHF Pipeline

A reproducible RLHF training stack covering supervised fine-tuning, reward modeling, PPO or DPO policy optimization, evaluation hooks, and a chat demo.

built >

Organized data prep, training scripts, hardware configs, web demo tooling, and kernel experiments into a single repository that is easy to run and extend.

impact >

Documents a full alignment workflow that can run on one consumer GPU and scale to multi-GPU setups, lowering the barrier to hands-on RLHF experimentation.

→ github.com/Waknis/open_rlhf_pipeline

click any row to expand the readme

────$ git log --author="mihir"────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

May 2026linkedinsoftware engineer intern
- Systems & Infrastructure — incoming Summer 2026 intern.
Sep 2021itronsoftware engineer intern(Full-time alongside university — 4.5-year tenure)
- Built performance-focused Python and C++ data-processing modules for 10M+ time-series events using vectorized kernels and cache-friendly SoA layouts, reducing end-to-end runtime by 25%.
- Profiled PyTorch inference hot paths with Nsight Systems timelines and the PyTorch profiler; isolated host-side kernel launch overhead and H2D stalls, then recovered throughput with batching, pinned memory, and async copies on dedicated CUDA streams.
- Wrote custom CUDA extensions for small but frequent tensor ops, replacing Python-level loops and applying memory coalescing plus SoA layout fixes — lifted GPU inference throughput 1.6× in controlled benchmarks.
- Built microbenchmark harnesses around kernel-level changes, tracking p50/p95 latency, achieved SM occupancy, and HBM bandwidth via Nsight Compute counters to catch regressions before merge.
- Hardened inference services with unit tests, structured logging, and CI perf guardrails that flagged kernel latency and throughput regressions against baseline budgets.
- Automated retraining and evaluation workflows using CI/CD plus DVC-style dataset versioning, cutting deployment cycle time from one week to two days.
Jan 2024asu-acecai/ml undergraduate student researcher
- Built a GPU-accelerated retrieval-augmented generation pipeline for medical literature analysis, optimized for embedding throughput and end-to-end query latency.
- Designed an embedding ingestion and indexing pipeline for 15,000+ papers using batched encoder inference, pinned memory transfers, and async prefetch to sustain stable GPU utilization on a single consumer GPU.
- Implemented ranking and re-ranking features using citation metrics, publication dates, and author signals; benchmarked retrieval quality against latency overhead per re-rank stage.
- Benchmarked vLLM, llama.cpp, and HF transformers for medical QA — tracked p50/p95 latency, tokens/sec throughput, and memory footprint across FP16 and INT8 quantization regimes; documented accuracy-vs-throughput tradeoffs.
- Received the Global Impact Award at the ASU Fusion Symposium.

────$ cat skills.txt────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

languagesrust c++ cuda python sql
compilersmlir llvm-ir nvptx
ml-systemspytorch triton vllm
profilingnsight-systems nsight-compute pytorch-profiler
toolslinux git docker ci/cd

────$ cat education.txt────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

georgia techm.s. computer science · computing systems
2025 – 2027 (expected)
deep learning · high-performance computing · gpu hardware & software · hpc architecture
arizona stateb.s. data science · cs concentration
2021 – 2025
data structures & algorithms · applied linear algebra · machine learning