Latency refers to the time it takes for the model to process a prompt and and generate an output. Latency can be influenced by various factors, such as the size of the model, the complexity of the prompt, and the underlying infrastructure supporting the model and point of interaction.
It's always better to first engineer a prompt that works well without model or prompt constraints, and then try latency reduction strategies afterward. Trying to reduce latency prematurely might prevent you from discovering what top performance looks like.
When discussing latency, you may come across several terms and measurements:
For a more in-depth understanding of these terms, check out our glossary.
One of the most straightforward ways to reduce latency is to select the appropriate model for your use case. Anthropic offers a range of models with different capabilities and performance characteristics. Consider your specific requirements and choose the model that best fits your needs in terms of speed and output quality.
For speed-critical applications, Claude Haiku 4.5 offers the fastest response times while maintaining high intelligence:
import anthropic
client = anthropic.Anthropic()
# For time-sensitive applications, use Claude Haiku 4.5
message = client.messages.create(
model="claude-haiku-4-5",
max_tokens=100,
messages=[{
"role": "user",
"content": "Summarize this customer feedback in 2 sentences: [feedback text]"
}]
)For more details about model metrics, see our models overview page.
Minimize the number of tokens in both your input prompt and the expected output, while still maintaining high performance. The fewer tokens the model has to process and generate, the faster the response will be.
Here are some tips to help you optimize your prompts and outputs:
max_tokens parameter to set a hard limit on the maximum length of the generated response. This prevents Claude from generating overly long outputs.
Note: When the response reaches
max_tokenstokens, the response will be cut off, perhaps midsentence or mid-word, so this is a blunt technique that may require post-processing and is usually most appropriate for multiple choice or short answer responses where the answer comes right at the beginning.
temperature parameter controls the randomness of the output. Lower values (e.g., 0.2) can sometimes lead to more focused and shorter responses, while higher values (e.g., 0.8) may result in more diverse but potentially longer outputs.Finding the right balance between prompt clarity, output quality, and token count may require some experimentation.
Streaming is a feature that allows the model to start sending back its response before the full output is complete. This can significantly improve the perceived responsiveness of your application, as users can see the model's output in real-time.
With streaming enabled, you can process the model's output as it arrives, updating your user interface or performing other tasks in parallel. This can greatly enhance the user experience and make your application feel more interactive and responsive.
Visit streaming Messages to learn about how you can implement streaming for your use case.