Intercloud Brokers

May 16, 2023

Vicuna 13B was fine-tuned from LLaMA for $300 via managed spot instances by SkyPilot. The 7B model was trained for $140.

Skypilot is a framework to utilize spot instances to train large models. It comes from Ion Stoica’s UC Berkeley Lab (Stoica was previously the CEO and co-founder of Databricks).

But it does more than just make training large language models cheaply. It tracks pricing and dynamic availability across clouds. It does dynamic optimization across the application DAG — considering egress fees, resource availability, quota availability, and cross-cloud differences.

The benefit today is managed spot jobs for computationally heavy batch jobs. Find the spot resources across regions, checkpoint work, and recover from preemptions and other failures. Then auto-stop when the job is complete.

Cloud costs are real, especially for any machine-learning-focused companies that require accelerators. Multi-cloud is hard across an entire application stack — but it just might work if the domain is constrained enough and the compute is big enough (see multi-model vs. multi-cloud).