Summary
Xi Luo is a middleware development engineer with a decade of experience specializing in high-performance computing, MPI collective operations, and performance analysis. Based in the San Francisco Bay Area, he builds low-latency runtime systems and has a strong track record of optimizing collective communication across CPU, GPU, and FPGA platforms. His work spans industry and national labs—contributing FPGA offload modules for Realm at SLAC and developing autotuned, topology-aware collective frameworks in Open MPI during his PhD research, including award-winning results at scale. Xi combines systems-level C/C++ expertise with practical Python tooling for profiling and in-situ analytics, and he has reduced autotuning time and improved model accuracy to within 10% of real performance in simulator validations. Notably, he designed lock-free intra-node memory pools and novel cost models that materially improved MPI_Bcast and MPI_Allreduce performance on thousands of cores. He brings a research-driven approach to production middleware, translating academic innovations into deployable runtime components.
10 years of coding experience
8 years of employment as a software developer
Bachelor of Science - BS, Computer Science, Bachelor of Science - BS, Computer Science at Sichuan University
Master of Science - MS, Computer Science, Master of Science - MS, Computer Science at Stevens Institute of Technology
Doctor of Philosophy - PhD, Computer Science, Doctor of Philosophy - PhD, Computer Science at University of Tennessee, Knoxville