Kevin Stephano - Engineering Manager Of Pytorch NvFuser Team at NVIDIA

Kevin Stephano

Engineering Manager Of Pytorch NvFuser Team at NVIDIA

San Francisco, California, United States

Join Prog.AI to see contacts

Summary

🤩

Rockstar

🎓

Top School

Kevin Stephano is the Engineering Manager of the PyTorch nvFuser team at NVIDIA based in San Francisco, with six years focused on deep learning infrastructure and a long history of GPU and compiler work. He leads development of the nvFuser Python frontend and fusion caching while continuing to contribute hands-on C++/CUDA and Python code to PyTorch and NVIDIA/apex, including novel multihead-attention and fused-optimizer implementations. His optimizations have driven measurable performance wins — from a rewrite of Transformer attention that eliminates copies and transposes to MLPerf training submissions that delivered 2x single-GPU speedups and scale-outs to hundreds of GPUs. Comfortable spanning hardware and software, he pairs low-level GPU/RTL experience (including earlier GPU simulator and FPU design work) with production ML system engineering, making him effective at turning research primitives into deployable, high-performance kernels.

6 years of coding experience

17 years of employment as a software developer

Other, Data Science Course, Other, Data Science Course at General Assembly

University of Illinois Urbana-Champaign

English

Github Skills (14)

cuda10

pytorch10

machine-learning10

c-language10

deep-learning10

cprogramming-language10

python10

gpu9

neural-network9

cublas9

tensor8

autograd7

numpy6

tensorflow3

Programming languages (2)

C++Python

Github contributions (5)

NVIDIA/apex

Feb 2020 - Jun 2022

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Role in this project:

ML Engineer

Contributions:6 reviews, 6 commits, 6 PRs in 2 years 4 months

Contributions summary:Kevin significantly contributed to the `nvidia/apex` repository, focusing on enhancing and optimizing multihead attention mechanisms for PyTorch. They implemented a C++ multihead attention implementation within the contrib module, and created several python versions of attention models, which indicates a significant amount of work in the domain of deep learning. Furthermore, the user improved the performance of existing kernels by updating to the current CUDA Stream and worked on integrating the Fused Lamb optimizer. The modifications include both forward and backward passes, suggesting a focus on both model functionality and training efficiency within the context of deep learning frameworks.

pytorchraymixed-precisiondeep-learningtemporal-data

pytorch/pytorch

Sep 2020 - Nov 2022

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Role in this project:

ML Engineer

Contributions:87 reviews, 44 commits, 19 PRs in 2 years 2 months

Contributions summary:Kevin primarily contributes to the NVFuser Python frontend within the PyTorch repository, a key component for accelerating deep learning computations. Their work focuses on enhancing the NVFuser framework, including implementing caching mechanisms for fusion reuse and improving batch normalization functionality. These changes involve modifications to the C++ and Python bindings for NVFuser, specifically adding support for new primitives like `rand_like` and improving code organization and printing of function definitions. The impact of their work is aimed at improving performance and usability.

pythongpu-accelerationdeep-learninggpunumpy

Find and Hire Top DevelopersWe’ve analyzed the programming source code of over 60 million software developers on GitHub and scored them by 50,000 skills. Sign-up on Prog,AI to search for software developers.

Request Free Trial