Featherless AI
Role Overview
We are seeking an AI Researcher with deep experience in inference optimization to design, evaluate, and deploy high-performance inference systems for large-scale machine learning models. You will work at the intersection of model architecture, systems engineering, and hardware-aware optimization, improving latency, throughput, and cost efficiency across real-world production environments.
Key Responsibilities
-
Research and develop techniques to optimize inference performance for large neural networks.
-
Improve latency, throughput, memory efficiency, and cost per inference.
-
Design and evaluate model-level optimizations (quantization, pruning, KV-cache optimization, architecture-aware simplifications).
-
Implement systems-level optimizations (dynamic batching, kernel fusion, multi-GPU inference, prefill vs decode optimization).
-
Benchmark inference workloads across hardware accelerators.
-
Collaborate with engineering teams to deploy optimized inference pipelines.
-
Translate research insights into production-ready improvements.
Required Qualifications
-
Strong background in machine learning, deep learning, or AI systems.
-
Hands-on experience optimizing inference for large-scale models.
-
Proficiency in Python and modern ML frameworks (e.g., PyTorch).
-
Experience with inference tooling (e.g., Triton, TensorRT, vLLM, ONNX Runtime).
-
Ability to design experiments and communicate results clearly.
Preferred / Nice-to-Have Qualifications
-
Experience deploying production inference systems at scale.
-
Familiarity with distributed and multi-GPU inference.
-
Experience contributing to open-source ML or inference frameworks.
-
Authorship or co-authorship of peer-reviewed research papers in machine learning, systems, or related fields.
-
Experience working close to hardware (CUDA, ROCm, profiling tools).
What Success Looks Like
-
Measurable gains in latency, throughput, and cost efficiency.
-
Optimized inference systems running reliably in production.
-
Research ideas successfully translated into deployable systems.
-
Clear benchmarks and documentation that inform product decisions.
Relevant Research Areas (Bonus)
-
Long-context inference optimization
-
Speculative decoding
-
KV-cache compression and paging
-
Efficient decoding strategies
-
Hardware-aware inference design
Originally posted on Himalayas
To apply for this job please visit himalayas.app.
Keep exploring on Get A Job.ai
Not quite the right fit? Your next opportunity is a click away.
- Browse all jobs
- More jobs by category
- Remote jobs you can do from anywhere
- Research typical pay for this role
- Set a job alert so new matches reach you first
- Upload your resume to apply faster
Hiring instead? Post a job and reach candidates searching right now.