While newer, pre-trained AI models and inference infrastructure typically support transactional, online workloads like chatbots and IDEs, they are also incredibly useful for offline, large-scale, data analysis and generation workloads.Sutro offers a managed, scalable, and cost-effective platform for batch and offline inference with LLM’s. We handle the infrastructure, inference optimizations, and cost management, so you can focus on solving the problems you care about. You simply bring your data, (and optionally, your models), and we’ll handle the rest.With Sutro, you can expect improvements in:
Speed: you can expect large-scale jobs (thousands of requests or more) to finish in minutes, allowing you to iterate quickly.
Scale: Our platform can handle a few tokens, to billions of tokens per job.
Cost: you can expect to pay up to 90% less than real-time inference providers in certain cases.
Security: set custom data retention policies and optionally bring your own storage for zero-visibility deployments.
Ease of use: you can use our Python SDK to quickly get up and running, and use our web observability app to easily monitor jobs and view results.
Ready to get started? Refer to the Python SDK reference for more details, or check out our Quick Start page to get up and running immediately.
You’re using LLM’s for offline, analytical or generation workloads - likely over tables or collections of objects. Latency isn’t nearly as important as cost or quality.
You’re serving a user-facing application or product with one-off, online inference calls (e.g. a chatbot) and latency is critical. For such use-cases, we recommend using another inference provider who optimizes for latency.
Not sure if Sutro is right for you, or not sure how to get started?
We’d love to hear about your use case, and help you figure out if we offer the right solution. We also offer bespoke, custom solutions for enterprise customers. Please contact us at team@sutro.sh.