mihir waknis
software engineer · georgia tech mscs · ml systems
Building ML systems where compiler passes, GPU kernels, and training infrastructure meet. Currently incoming Systems & Infrastructure intern at LinkedIn (May–Aug 2026) and an MSCS Computing Systems student at Georgia Tech.
- →emailmihirwaknis282@gmail.com
- →githubgithub.com/Waknis
- →linkedlinkedin.com/in/waknis
- →xx.com/_Waknis
2026cuda-perf-lab[cuda · nsight compute · cub · cmake][expandcollapse]
CUDA Kernel Performance Lab
Architecture-aware CUDA kernel optimization lab covering reduction, softmax, and Black-Scholes Monte Carlo workloads, with reproducible benchmarks and Nsight Compute profiling.
built >
Built staged kernels — naive → shared-memory trees → warp-shuffle reductions → vectorized float4 loads — against CUB baselines, with deterministic inputs, CPU double-precision reference validation, CUDA-event timing, CSV/JSON result capture, and Nsight Compute profiling. Roofline and occupancy notes explain why each kernel is fast or slow.
impact >
Custom vectorized_float4 reduction reached 1.006× CUB latency at n=67M on RTX 5060 Ti; warp-per-row softmax was the fastest implementation across tested shapes including 1024×4096. Nsight Compute reports surface memory-dependency and long-scoreboard stalls as the central bottleneck.
2026mlir-softmax-backend[c++17 · mlir · llvm · nvptx · cuda driver][expandcollapse]
MLIR Softmax Backend
An end-to-end compiler backend that takes MLIR softmax input through lowering, NVPTX codegen, PTX emission, and CUDA Driver kernel launch.
built >
Built the C++17 pipeline, a custom LICM strength-reduction pass, CUDA runtime integration, and FileCheck plus CTest coverage around the full compiler path.
impact >
The optimization replaces N loop divisions with 1 division plus N multiplications when the denominator is loop-invariant, then verifies correctness on GPU.
2025simclr-from-scratch[pytorch · simclr · cifar-10 · ci / docker][expandcollapse]
SimCLR From Scratch
A compact, production-quality reimplementation of SimCLR for self-supervised visual representation learning on CIFAR-10, plus a linear-probe evaluation.
built >
Implemented two-crop augmentations, a ResNet encoder, MLP projection head, and NT-Xent loss, plus a frozen-encoder linear-probe protocol. Wired up config-driven CLI overrides, CSV + TensorBoard logging, automatic CUDA/MPS/CPU selection, AMP, cosine LR with warmup, and deterministic seeds.
impact >
Demonstrates research correctness — the full SimCLR pipeline and linear-probe protocol — alongside engineering hygiene: pytest unit tests, GitHub Actions CI, Dockerfile, Conda and pip environments.
2024open-rlhf-pipeline[python · pytorch · trl · gradio · triton][expandcollapse]
Open RLHF Pipeline
A reproducible RLHF training stack covering supervised fine-tuning, reward modeling, PPO or DPO policy optimization, evaluation hooks, and a chat demo.
built >
Organized data prep, training scripts, hardware configs, web demo tooling, and kernel experiments into a single repository that is easy to run and extend.
impact >
Documents a full alignment workflow that can run on one consumer GPU and scale to multi-GPU setups, lowering the barrier to hands-on RLHF experimentation.
click any row to expand the readme
- May 2026linkedinsoftware engineer intern
- Systems & Infrastructure — incoming Summer 2026 intern.
- Sep 2021itronsoftware engineer intern(Full-time alongside university — 4.5-year tenure)
- Built performance-focused Python and C++ data-processing modules for 10M+ time-series events using vectorized kernels and cache-friendly SoA layouts, reducing end-to-end runtime by 25%.
- Profiled PyTorch inference hot paths with Nsight Systems timelines and the PyTorch profiler; isolated host-side kernel launch overhead and H2D stalls, then recovered throughput with batching, pinned memory, and async copies on dedicated CUDA streams.
- Wrote custom CUDA extensions for small but frequent tensor ops, replacing Python-level loops and applying memory coalescing plus SoA layout fixes — lifted GPU inference throughput 1.6× in controlled benchmarks.
- Built microbenchmark harnesses around kernel-level changes, tracking p50/p95 latency, achieved SM occupancy, and HBM bandwidth via Nsight Compute counters to catch regressions before merge.
- Hardened inference services with unit tests, structured logging, and CI perf guardrails that flagged kernel latency and throughput regressions against baseline budgets.
- Automated retraining and evaluation workflows using CI/CD plus DVC-style dataset versioning, cutting deployment cycle time from one week to two days.
- Jan 2024asu-acecai/ml undergraduate student researcher
- Built a GPU-accelerated retrieval-augmented generation pipeline for medical literature analysis, optimized for embedding throughput and end-to-end query latency.
- Designed an embedding ingestion and indexing pipeline for 15,000+ papers using batched encoder inference, pinned memory transfers, and async prefetch to sustain stable GPU utilization on a single consumer GPU.
- Implemented ranking and re-ranking features using citation metrics, publication dates, and author signals; benchmarked retrieval quality against latency overhead per re-rank stage.
- Benchmarked vLLM, llama.cpp, and HF transformers for medical QA — tracked p50/p95 latency, tokens/sec throughput, and memory footprint across FP16 and INT8 quantization regimes; documented accuracy-vs-throughput tradeoffs.
- Received the Global Impact Award at the ASU Fusion Symposium.
- languagesrust c++ cuda python sql
- compilersmlir llvm-ir nvptx
- ml-systemspytorch triton vllm
- profilingnsight-systems nsight-compute pytorch-profiler
- toolslinux git docker ci/cd
- georgia techm.s. computer science · computing systems
2025 – 2027 (expected)
deep learning · high-performance computing · gpu hardware & software · hpc architecture
- arizona stateb.s. data science · cs concentration
2021 – 2025
data structures & algorithms · applied linear algebra · machine learning