Featherless AI
About the Role
We’re looking for a Machine Learning Engineer to own and push the limits of model inference performance at scale. You’ll work at the intersection of research and production—turning cutting-edge models into fast, reliable, and cost-efficient systems that serve real users.
This role is ideal for someone who enjoys deep technical work, profiling systems down to the kernel/GPU level, and translating research ideas into production-grade performance gains.
What You’ll Do
-
Optimize inference latency, throughput, and cost for large-scale ML models in production
-
Profile and bottleneck GPU/CPU inference pipelines (memory, kernels, batching, IO)
-
Implement and tune techniques such as:
-
Quantization (fp16, bf16, int8, fp8)
-
KV-cache optimization & reuse
-
Speculative decoding, batching, and streaming
-
Model pruning or architectural simplifications for inference
-
-
Collaborate with research engineers to productionize new model architectures
-
Build and maintain inference-serving systems (e.g. Triton, custom runtimes, or bespoke stacks)
-
Benchmark performance across hardware (NVIDIA / AMD GPUs, CPUs) and cloud setups
-
Improve system reliability, observability, and cost efficiency under real workloads
What We’re Looking For
-
Strong experience in ML inference optimization or high-performance ML systems
-
Solid understanding of deep learning internals (attention, memory layout, compute graphs)
-
Hands-on experience with PyTorch (or similar) and model deployment
-
Familiarity with GPU performance tuning (CUDA, ROCm, Triton, or kernel-level optimizations)
-
Experience scaling inference for real users (not just research benchmarks)
-
Comfortable working in fast-moving startup environments with ownership and ambiguity
Nice to Have
-
Experience with LLM or long-context model inference
-
Knowledge of inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton)
-
Experience optimizing across different hardware vendors
-
Open-source contributions in ML systems or inference tooling
-
Background in distributed systems or low-latency services
Why Join Us
-
Real ownership over performance-critical systems
-
Direct impact on product reliability and unit economics
-
Close collaboration with research, infra, and product
-
Competitive compensation + meaningful equity at Series A
-
A team that cares about engineering quality, not hype
Originally posted on Himalayas
To apply for this job please visit himalayas.app.
Working in Australia
Australia, officially the Commonwealth of Australia, is a country comprising the mainland of the Australian continent, the island of Tasmania and numerous smaller islands. It has a land area of 7,688,287 km2 (2,968,464 sq mi), making it the sixth-largest country in the world. Australia is the world's flattest and driest inhabited continent. It is a megadiverse country, and its size gives it a wide variety of landscapes and climates including deserts in the interior and tropical rainforests along the coast.
More jobs at Featherless AI
Keep exploring on Get A Job.ai
Not quite the right fit? Your next opportunity is a click away.
- Browse all jobs
- More jobs by category
- Remote jobs you can do from anywhere
- Research typical pay for this role
- Set a job alert so new matches reach you first
- Upload your resume to apply faster
Hiring instead? Post a job and reach candidates searching right now.