Devendar Bureddy is a Distinguished Engineer based in California with 12 years of experience building high-performance, GPU-accelerated system software for industry-leading vendors including NVIDIA and Mellanox. He specializes in low-latency distributed communication and MPI ecosystems, contributing upstream to prominent open-source projects like UCX and Open MPI where he implemented CUDA integration, memory-type support, and optimized collective operations. Known for solving hard problems in memory registration, datatype handling, and progress efficiency, he blends deep kernel-to-user-space systems expertise with pragmatic engineering leadership. A master’s graduate from IIT Kanpur, Devendar brings a research-informed approach to production-grade software and a track record of influencing both vendor platforms and community-driven HPC stacks.
12 years of coding experience
18 years of employment as a software developer
Jawaharlal Nehru Technological University Hyderabad
Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
Role in this project:
Backend Developer
Contributions:150 reviews, 254 commits, 165 PRs in 4 years 3 months
Contributions summary:Devendar primarily contributed to the codebase by implementing CUDA-related functionalities within the Unified Communication X (UCX) library. Their work included adding build configurations and flags for CUDA, configuring GDRCOPY, and introducing and modifying various UCT/API interfaces. They also made changes to incorporate and test the new memory type support. The contributions demonstrate a focus on integrating CUDA and potentially GPU-accelerated technologies into the UCX framework.
Contributions:12 reviews, 51 commits, 11 PRs in 9 years 2 months
Contributions summary:Devendar primarily contributed to the Open MPI project by modifying and enhancing the HCOLL (High-Performance Collective Communications) module. Their changes included implementing support for new collective operations like gatherv and alltoallv, fixing issues related to datatype handling, and improving overall performance through progress-related optimizations. The user also addressed memory management issues, particularly concerning memory registration limits within the OpenIB BTL (Byte Transfer Layer) component. These modifications indicate a focus on improving the efficiency and functionality of MPI collective operations.
mpicluster-computingfortranopenmpipetsc
Find and Hire Top DevelopersWe’ve analyzed the programming source code of over 60 million software developers on GitHub and scored them by 50,000 skills. Sign-up on Prog,AI to search for software developers.
Request Free Trial
Devendar Bureddy - Distinguished Engineer at NVIDIA