Summary
Ti Cheng is a Senior Software Engineer specializing in LLMs and NLP with seven years of experience delivering faster, more accurate AI inference for production systems. He has hands-on expertise across Transformer architectures and inference acceleration techniques—such as speculative decoding variants (Kangaroo, Medusa, Hydra, Eagle) and ongoing Arctic Inference research—applied with tools like vLLM, SGLang, PyTorch, CUDA and Triton. Ti has led end-to-end initiatives from LoRA/QLoRA fine-tuning to building RAG systems for intelligent customer service, achieving measurable gains like 100% inference speedups and model accuracies up to 97% in commercial projects. An active open-source contributor (HuggingFace TRL, FlagEmbedding) and blogger at Clay-Technology World, he blends research-driven optimization with production engineering. Based in Taipei, he’s currently focused on fast-llm-inference and low-level GPU kernel optimization to make large models more accessible and efficient.
7 years of coding experience
4 years of employment as a software developer
Master of Science - MS, Computer Science, Master of Science - MS, Computer Science at 國立政治大學