Local vs. Cloud
One of the most practical choices you make with LLMs is whether to run them locally or call them through a cloud API.
Local models
Local models run on your own hardware. That often gives you:
- better privacy and data control
- offline access
- predictable marginal cost once hardware is in place
- more room to experiment with open-weight models
The downside is that you inherit the operational work: downloads, runtime setup, hardware constraints, and quality limits tied to what your machine can handle.
Cloud models
Cloud models are hosted by a provider. That often gives you:
- access to stronger frontier systems
- simpler setup
- managed scaling
- better ecosystem features like structured output, tool calling, or enterprise controls
The tradeoff is ongoing usage cost and less control over where the model runs or how it is updated.
The right answer depends on your priorities. If privacy and flexibility matter most, local can be compelling. If capability and convenience matter most, cloud often wins.
Modality
Not all models work with the same kinds of input and output. This is where modality comes in.
Modality describes the form of information a model can accept or produce, such as:
- text
- images
- audio
- video
- mixed combinations of the above
Common categories
A text-only LLM works with language tokens. A vision-language model can take text plus images. Speech models work with audio. Some systems are genuinely multimodal, meaning they can interpret or generate across several kinds of media.
Why modality matters
Modality changes the design space. A model that can see screenshots can help with UI review. A model that can hear audio can support transcription or voice workflows. A text-only model may still be excellent for planning, coding, or document work.
So when choosing a model, do not just ask “which one is smartest?” Also ask “which one works with the kinds of information this task actually uses?”
Fast vs. Thinking
Some models are designed to respond as quickly as possible. Others are built to reason through a problem before committing to an answer. This distinction is sometimes called the difference between fast and thinking models.
Fast models
Fast models prioritize low latency. They generate responses token by token without an extended internal reasoning step. For most everyday tasks — summarization, drafting, classification, simple Q&A — a fast model is the right default. Speed matters more than deliberation when the task is well-defined.
Thinking models
Thinking models spend additional compute before producing a final response. This often takes the form of an internal chain of reasoning that the model works through before replying. The output tends to be more reliable on problems that require multi-step logic, mathematics, code debugging, or anything where rushing to an answer increases the chance of error.
The tradeoff is latency and cost. Thinking models are slower and more expensive per response. That makes them a poor fit for high-throughput or real-time use cases, but a strong choice when accuracy on hard problems matters more than speed.
In practice, many teams use both: a fast model for routine operations and a thinking model for the cases where getting it right is worth the wait.