Coordinates and bounding boxes

MessagesImages and vision

Coordinates and bounding boxes

How Claude resizes images, and how to work with the pixel coordinates it returns for bounding boxes, points, and UI elements.

Claude can locate and label regions of an image (for example, returning bounding boxes for tables, form fields, chart elements, or UI components). This guide covers how Claude resizes images before processing them and how to work with the pixel coordinates it returns, so that boxes and points line up with your original image.

You'll need this for OCR pipelines, form extraction, chart parsing, UI element location, and any task where you act on a specific region of an image. For sending images, supported formats, and per-model resolution limits, see Vision.

Claude works best with absolute pixel coordinates. Ask for them explicitly in your prompt. For example: "Return the bounding box of each table as [x1, y1, x2, y2] in pixel coordinates." Claude does not work well when you ask for normalized coordinates, for example: "Return bounding box coordinates between 0 and 1000." Always ask for pixel coordinates and normalize in your own code if you need to.

Coordinates follow the standard image convention: the origin (0, 0) is the top-left corner of the image, with x increasing to the right and y increasing downward. The coordinates Claude returns are pixel positions in the image Claude sees: your image after Claude resizes it to fit the model's native resolution (see How Claude resizes and pads images). To get coordinates you can use directly, either pre-resize your image so the coordinates map one-to-one onto the image you have (see Resize your image before uploading), or rescale the coordinates Claude returns (see Rescale coordinates when you cannot pre-resize).

Claude's spatial reasoning has limits (see Limitations). Coordinate accuracy is best when you state the expected coordinate format in your prompt and spot-check results visually before processing at scale. For PDF support, pages are rasterized to images server-side at dimensions you don't control, so the returned coordinates can't be reliably mapped back onto the page. To work with coordinates on PDF content, rasterize the pages to images yourself and use the pre-resize approach.

Was this page helpful?

import math


def count_image_tokens(width: int, height: int) -> int:
    """Visual tokens consumed by an image: one token per 28x28 pixel patch."""
    return math.ceil(width / 28) * math.ceil(height / 28)


def resized_size(
    width: int,
    height: int,
    max_edge: int = 1568,
    max_tokens: int = 1568,
) -> tuple[int, int]:
    """The size Claude resizes an image to before padding.

    Defaults are for the standard resolution tier. For high-resolution-tier
    models, use max_edge=2576 and max_tokens=4784. Returns (width, height).
    Images that already fit within the limits are returned unchanged.
    """

    def fits(w: int, h: int) -> bool:
        return (
            math.ceil(w / 28) * 28 <= max_edge
            and math.ceil(h / 28) * 28 <= max_edge
            and count_image_tokens(w, h) <= max_tokens
        )

    if fits(width, height):
        return (width, height)
    if height > width:
        resized_h, resized_w = resized_size(height, width, max_edge, max_tokens)
        return (resized_w, resized_h)

    # Binary search along the long edge for the largest aspect-preserving
    # size that fits.
    aspect_ratio = width / height
    lo, hi = 1, width  # lo always fits; hi never fits
    while lo + 1 < hi:
        mid = (lo + hi) // 2
        if fits(mid, max(round(mid / aspect_ratio), 1)):
            lo = mid
        else:
            hi = mid
    return (lo, max(round(lo / aspect_ratio), 1))


# The A4 example from "How Claude resizes and pads images":
print(resized_size(1075, 1520))  # (924, 1307)

def to_relative_coordinates(
    x: float,
    y: float,
    original_width: int,
    original_height: int,
    max_edge: int = 1568,
    max_tokens: int = 1568,
) -> tuple[float, float]:
    """Map a pixel coordinate returned by Claude to relative coordinates in [0, 1].

    Pass the dimensions of the image you uploaded. For high-resolution-tier
    models, use max_edge=2576 and max_tokens=4784.
    """
    resized_w, resized_h = resized_size(
        original_width, original_height, max_edge, max_tokens
    )
    return (x / resized_w, y / resized_h)


# To express the coordinate in your original image's pixel space, multiply the
# relative coordinate by your original dimensions:
# (rel_x * original_width, rel_y * original_height)

How Claude resizes and pads images

Resize your image before uploading

Rescale coordinates when you cannot pre-resize

Related

How Claude resizes and pads images

Resize your image before uploading

Rescale coordinates when you cannot pre-resize

Related