Engineering Manager Of Pytorch NvFuser Team at NVIDIA
San Francisco, California, United States
Join Prog.AI to see contacts
Join Prog.AI to see contacts
Summary
🤩
Rockstar
🎓
Top School
Kevin Stephano is the Engineering Manager of the PyTorch nvFuser team at NVIDIA based in San Francisco, with six years focused on deep learning infrastructure and a long history of GPU and compiler work. He leads development of the nvFuser Python frontend and fusion caching while continuing to contribute hands-on C++/CUDA and Python code to PyTorch and NVIDIA/apex, including novel multihead-attention and fused-optimizer implementations. His optimizations have driven measurable performance wins — from a rewrite of Transformer attention that eliminates copies and transposes to MLPerf training submissions that delivered 2x single-GPU speedups and scale-outs to hundreds of GPUs. Comfortable spanning hardware and software, he pairs low-level GPU/RTL experience (including earlier GPU simulator and FPU design work) with production ML system engineering, making him effective at turning research primitives into deployable, high-performance kernels.
6 years of coding experience
17 years of employment as a software developer
Other, Data Science Course, Other, Data Science Course at General Assembly
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
Role in this project:
ML Engineer
Contributions:6 reviews, 6 commits, 6 PRs in 2 years 4 months
Contributions summary:Kevin significantly contributed to the `nvidia/apex` repository, focusing on enhancing and optimizing multihead attention mechanisms for PyTorch. They implemented a C++ multihead attention implementation within the contrib module, and created several python versions of attention models, which indicates a significant amount of work in the domain of deep learning. Furthermore, the user improved the performance of existing kernels by updating to the current CUDA Stream and worked on integrating the Fused Lamb optimizer. The modifications include both forward and backward passes, suggesting a focus on both model functionality and training efficiency within the context of deep learning frameworks.
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Role in this project:
ML Engineer
Contributions:87 reviews, 44 commits, 19 PRs in 2 years 2 months
Contributions summary:Kevin primarily contributes to the NVFuser Python frontend within the PyTorch repository, a key component for accelerating deep learning computations. Their work focuses on enhancing the NVFuser framework, including implementing caching mechanisms for fusion reuse and improving batch normalization functionality. These changes involve modifications to the C++ and Python bindings for NVFuser, specifically adding support for new primitives like `rand_like` and improving code organization and printing of function definitions. The impact of their work is aimed at improving performance and usability.
pythongpu-accelerationdeep-learninggpunumpy
Find and Hire Top DevelopersWe’ve analyzed the programming source code of over 60 million software developers on GitHub and scored them by 50,000 skills. Sign-up on Prog,AI to search for software developers.