Simon Mo is a software leader and founder with nine years of experience building high-throughput inference and serving systems for machine learning. As CEO and Cofounder of Inferact and a core maintainer of the popular open-source vLLM project, he focuses on making LLM inference faster and cheaper while driving production-ready tooling. His background spans building Ray Serve at Anyscale, production GPU efficiency work at Character.AI, and core contributions to Ray, Modin, and Clipper that improved distributed runtime, data IO, and low-latency prediction serving. He combines hands-on backend and DevOps expertise—containerization, Kubernetes, Prometheus metrics, and performance tuning—with research rigor from his PhD work at UC Berkeley. Based in Berkeley, he has a track record of turning prototype models into scalable deployments and even migrated linters and added observability hooks that materially improved maintainability in major open-source repos.
9 years of coding experience
7 years of employment as a software developer
Doctor of Philosophy - PhD Computer Science, Doctor of Philosophy - PhD Computer Science at University of California, Berkeley
A high-throughput and memory-efficient inference and serving engine for LLMs
Role in this project:
Backend & DevOps Engineer
Contributions:1021 reviews, 1290 PRs, 1119 pushes in 1 year 5 months
Contributions summary:Simon contributed to the project by fixing a bug related to sequence group duplication within the engine step, demonstrating a focus on core engine functionality. They also documented the official Docker image deployment process, which involved updating the documentation to include shared memory usage. Furthermore, the user migrated the linter from `pylint` to `ruff`, improving code quality and maintainability, along with adding production metrics in Prometheus format.
Contributions:1 release, 29 commits, 97 PRs in 2 years 9 months
Contributions summary:Simon primarily focused on improving the Clipper project's infrastructure and backend functionality. Their contributions include fixing broken links, addressing Docker and exception handling issues, and implementing a metrics monitoring system, including frontend exporters. They also made significant changes to the Docker container manager, and added Kubernetes support for multi tenancy.
pythonservingpredictiondeep-learninglatency
Find and Hire Top DevelopersWe’ve analyzed the programming source code of over 60 million software developers on GitHub and scored them by 50,000 skills. Sign-up on Prog,AI to search for software developers.