Engines#

PanoOCR uses dependency injection for OCR engines. Provide any object with a matching recognize() method.

OCREngine Protocol#

OCREngine #

Bases: Protocol

Protocol for OCR engines (structural typing).

Any class with a matching recognize() method can be used. No inheritance required.

recognize #

recognize(image: Image) -> list[FlatOCRResult]

Recognize text in an image.

Parameters:

Name Type Description Default
image Image

Input image as PIL Image.

required

Returns:

Type Description
list[FlatOCRResult]

List of FlatOCRResult objects with normalized bounding boxes (0-1 range).

Source code in src/panoocr/api/models.py
def recognize(self, image: Image.Image) -> list[FlatOCRResult]:
    """Recognize text in an image.

    Args:
        image: Input image as PIL Image.

    Returns:
        List of FlatOCRResult objects with normalized bounding boxes (0-1 range).
    """
    ...

Structured Engines#

Structured engines return per-word bounding boxes, enabling geographic text indexing.

MacOCREngine#

Uses Apple's Vision Framework for fast, accurate OCR on macOS. Requires the [macocr] extra.

pip install "panoocr[macocr]"

MacOCREngine #

MacOCREngine(config: Dict[str, Any] | None = None)

OCR engine using Apple Vision Framework via ocrmac.

This engine uses macOS's built-in Vision Framework for text recognition. It provides excellent accuracy for many languages on Apple Silicon.

Attributes:

Name Type Description
language_preference

List of language codes to use for recognition.

recognition_level

Recognition accuracy level ("fast" or "accurate").

Example

from panoocr.engines.macocr import MacOCREngine, MacOCRLanguageCode

engine = MacOCREngine(config={ ... "language_preference": [MacOCRLanguageCode.ENGLISH_US], ... "recognition_level": MacOCRRecognitionLevel.ACCURATE, ... }) results = engine.recognize(image)

Note

Requires macOS and the ocrmac package. Install with: pip install "panoocr[macocr]"

Initialize the MacOCR engine.

Parameters:

Name Type Description Default
config Dict[str, Any] | None

Configuration dictionary with optional keys: - language_preference: List of MacOCRLanguageCode values. - recognition_level: MacOCRRecognitionLevel value.

None

Raises:

Type Description
ImportError

If ocrmac is not installed.

ValueError

If configuration values are invalid.

Source code in src/panoocr/engines/macocr.py
def __init__(self, config: Dict[str, Any] | None = None) -> None:
    """Initialize the MacOCR engine.

    Args:
        config: Configuration dictionary with optional keys:
            - language_preference: List of MacOCRLanguageCode values.
            - recognition_level: MacOCRRecognitionLevel value.

    Raises:
        ImportError: If ocrmac is not installed.
        ValueError: If configuration values are invalid.
    """
    # Check dependencies first
    _check_macocr_dependencies()

    config = config or {}

    # Parse language preference
    language_preference = config.get(
        "language_preference", DEFAULT_LANGUAGE_PREFERENCE
    )
    try:
        self.language_preference = [
            lang.value if isinstance(lang, MacOCRLanguageCode) else lang
            for lang in language_preference
        ]
    except (KeyError, AttributeError):
        raise ValueError("Invalid language code in language_preference")

    # Parse recognition level
    recognition_level = config.get("recognition_level", DEFAULT_RECOGNITION_LEVEL)
    try:
        self.recognition_level = (
            recognition_level.value
            if isinstance(recognition_level, MacOCRRecognitionLevel)
            else recognition_level
        )
    except (KeyError, AttributeError):
        raise ValueError("Invalid recognition level")

recognize #

recognize(image: Image) -> List[FlatOCRResult]

Recognize text in an image.

Parameters:

Name Type Description Default
image Image

Input image as PIL Image.

required

Returns:

Type Description
List[FlatOCRResult]

List of FlatOCRResult with normalized bounding boxes.

Source code in src/panoocr/engines/macocr.py
def recognize(self, image: Image.Image) -> List[FlatOCRResult]:
    """Recognize text in an image.

    Args:
        image: Input image as PIL Image.

    Returns:
        List of FlatOCRResult with normalized bounding boxes.
    """
    import ocrmac.ocrmac

    annotations = ocrmac.ocrmac.OCR(
        image,
        recognition_level=self.recognition_level,
        language_preference=self.language_preference,
    ).recognize()

    mac_ocr_results = [
        MacOCRResult(
            text=annotation[0],
            confidence=annotation[1],
            bounding_box=annotation[2],
        )
        for annotation in annotations
    ]

    return [result.to_flat() for result in mac_ocr_results]

RapidOCREngine#

PaddleOCR PP-OCRv4/v5 models via ONNX Runtime. Bypasses PaddlePaddle framework (avoids Apple Silicon issues). Supports v4 and v5 model versions. Requires the [rapidocr] extra.

pip install "panoocr[rapidocr]"

RapidOCREngine #

RapidOCREngine(config: Dict[str, Any] | None = None)

OCR engine using RapidOCR (PP-OCRv4/v5 via ONNX Runtime).

Attributes:

Name Type Description
ocr_version

Which PP-OCR model version to use ("PP-OCRv4" or "PP-OCRv5").

Example

from panoocr.engines.rapidocr_engine import RapidOCREngine

engine_v4 = RapidOCREngine() engine_v5 = RapidOCREngine(config={"ocr_version": "PP-OCRv5"}) results = engine_v4.recognize(image)

Note

Install with: pip install "panoocr[rapidocr]" onnxruntime

Source code in src/panoocr/engines/rapidocr_engine.py
def __init__(self, config: Dict[str, Any] | None = None) -> None:
    _check_rapidocr_dependencies()

    from rapidocr import RapidOCR, OCRVersion

    config = config or {}
    version_str = config.get("ocr_version", "PP-OCRv4")

    params: Dict[str, Any] = {}

    if version_str == "PP-OCRv5":
        params["Rec.ocr_version"] = OCRVersion.PPOCRV5
        self._engine_tag = "RAPID_OCR_V5"
    else:
        self._engine_tag = "RAPID_OCR_V4"

    self.ocr = RapidOCR(params=params if params else None)
    self.ocr_version = version_str

recognize #

recognize(image: Image) -> List[FlatOCRResult]

Recognize text in an image.

Parameters:

Name Type Description Default
image Image

Input image as PIL Image.

required

Returns:

Type Description
List[FlatOCRResult]

List of FlatOCRResult with normalized bounding boxes.

Source code in src/panoocr/engines/rapidocr_engine.py
def recognize(self, image: Image.Image) -> List[FlatOCRResult]:
    """Recognize text in an image.

    Args:
        image: Input image as PIL Image.

    Returns:
        List of FlatOCRResult with normalized bounding boxes.
    """
    image_array = np.array(image)
    result = self.ocr(image_array)

    if not result or len(result) == 0:
        return []

    rapid_results = []
    for box, txt, score in zip(result.boxes, result.txts, result.scores):
        if not txt or not txt.strip():
            continue

        rapid_results.append(
            RapidOCRResult(
                text=txt,
                bounding_box=box,
                confidence=float(score),
                image_width=image.width,
                image_height=image.height,
            )
        )

    return [r.to_flat(self._engine_tag) for r in rapid_results]

EasyOCREngine#

Cross-platform OCR supporting 80+ languages. Requires the [easyocr] extra.

pip install "panoocr[easyocr]"

EasyOCREngine #

EasyOCREngine(config: Dict[str, Any] | None = None)

OCR engine using EasyOCR library.

EasyOCR supports 80+ languages and can run on CPU or GPU. It provides good accuracy for many scripts including CJK.

Attributes:

Name Type Description
language_preference

List of language codes to use.

reader

EasyOCR Reader instance.

Example

from panoocr.engines.easyocr import EasyOCREngine, EasyOCRLanguageCode

engine = EasyOCREngine(config={ ... "language_preference": [EasyOCRLanguageCode.ENGLISH], ... "gpu": True, ... }) results = engine.recognize(image)

Note

Install with: pip install "panoocr[easyocr]" For GPU support, install PyTorch with CUDA.

Initialize the EasyOCR engine.

Parameters:

Name Type Description Default
config Dict[str, Any] | None

Configuration dictionary with optional keys: - language_preference: List of EasyOCRLanguageCode values. - gpu: Whether to use GPU (default: True).

None

Raises:

Type Description
ImportError

If easyocr is not installed.

ValueError

If configuration values are invalid.

Source code in src/panoocr/engines/easyocr.py
def __init__(self, config: Dict[str, Any] | None = None) -> None:
    """Initialize the EasyOCR engine.

    Args:
        config: Configuration dictionary with optional keys:
            - language_preference: List of EasyOCRLanguageCode values.
            - gpu: Whether to use GPU (default: True).

    Raises:
        ImportError: If easyocr is not installed.
        ValueError: If configuration values are invalid.
    """
    # Check dependencies first
    _check_easyocr_dependencies()

    import easyocr

    config = config or {}

    # Parse language preference
    language_preference = config.get(
        "language_preference", DEFAULT_LANGUAGE_PREFERENCE
    )
    try:
        self.language_preference = [
            lang.value if isinstance(lang, EasyOCRLanguageCode) else lang
            for lang in language_preference
        ]
    except (KeyError, AttributeError):
        raise ValueError("Invalid language code in language_preference")

    # Parse GPU setting
    use_gpu = config.get("gpu", True)

    # Initialize reader
    self.reader = easyocr.Reader(self.language_preference, gpu=use_gpu)

recognize #

recognize(image: Image) -> List[FlatOCRResult]

Recognize text in an image.

Parameters:

Name Type Description Default
image Image

Input image as PIL Image.

required

Returns:

Type Description
List[FlatOCRResult]

List of FlatOCRResult with normalized bounding boxes.

Source code in src/panoocr/engines/easyocr.py
def recognize(self, image: Image.Image) -> List[FlatOCRResult]:
    """Recognize text in an image.

    Args:
        image: Input image as PIL Image.

    Returns:
        List of FlatOCRResult with normalized bounding boxes.
    """
    image_array = np.array(image)
    annotations = self.reader.readtext(image_array)

    easyocr_results = []
    for annotation in annotations:
        bounding_box = annotation[0]
        text = annotation[1]
        confidence = annotation[2]

        easyocr_results.append(
            EasyOCRResult(
                text=text,
                confidence=confidence,
                bounding_box=bounding_box,
                image_width=image.width,
                image_height=image.height,
            )
        )

    return [result.to_flat() for result in easyocr_results]

PaddleOCREngine#

PaddlePaddle-based OCR supporting multiple languages with automatic model management. Requires the [paddleocr] extra (includes both paddleocr and paddlepaddle).

pip install "panoocr[paddleocr]"

PaddleOCREngine #

PaddleOCREngine(config: Dict[str, Any] | None = None)

OCR engine using PaddleOCR library (v3.x).

PaddleOCR is developed by PaddlePaddle and supports multiple languages. It provides good accuracy with automatic model management.

Attributes:

Name Type Description
language_preference

Language code for recognition.

recognize_upside_down

Whether to use textline orientation classifier.

ocr_version

The PP-OCR model version to use.

Example

from panoocr.engines.paddleocr import PaddleOCREngine, PaddleOCRLanguageCode

engine = PaddleOCREngine(config={ ... "language_preference": PaddleOCRLanguageCode.CHINESE, ... }) results = engine.recognize(image)

Note

Install with: pip install "panoocr[paddleocr]" paddlepaddle For GPU support, install paddlepaddle-gpu instead.

Initialize the PaddleOCR engine.

Parameters:

Name Type Description Default
config Dict[str, Any] | None

Configuration dictionary with optional keys: - language_preference: PaddleOCRLanguageCode value. - recognize_upside_down: Enable textline orientation classifier (default: False). - ocr_version: PaddleOCRVersion or string like "PP-OCRv5" (default: auto-selected by PaddleOCR based on language). - text_detection_model_name: Override detection model name. - text_recognition_model_name: Override recognition model name. - text_det_limit_side_len: Max side length for detection input. - text_rec_score_thresh: Minimum recognition score threshold.

None

Raises:

Type Description
ImportError

If paddleocr or paddlepaddle is not installed.

ValueError

If configuration values are invalid.

Source code in src/panoocr/engines/paddleocr.py
def __init__(self, config: Dict[str, Any] | None = None) -> None:
    """Initialize the PaddleOCR engine.

    Args:
        config: Configuration dictionary with optional keys:
            - language_preference: PaddleOCRLanguageCode value.
            - recognize_upside_down: Enable textline orientation classifier
              (default: False).
            - ocr_version: PaddleOCRVersion or string like "PP-OCRv5"
              (default: auto-selected by PaddleOCR based on language).
            - text_detection_model_name: Override detection model name.
            - text_recognition_model_name: Override recognition model name.
            - text_det_limit_side_len: Max side length for detection input.
            - text_rec_score_thresh: Minimum recognition score threshold.

    Raises:
        ImportError: If paddleocr or paddlepaddle is not installed.
        ValueError: If configuration values are invalid.
    """
    # Check dependencies first
    _check_paddleocr_dependencies()

    from paddleocr import PaddleOCR

    config = config or {}

    # Parse language preference
    language = config.get("language_preference", DEFAULT_LANGUAGE)
    try:
        self.language_preference = (
            language.value if isinstance(language, PaddleOCRLanguageCode) else language
        )
    except (KeyError, AttributeError):
        raise ValueError("Invalid language code")

    # Parse textline orientation (replaces use_angle_cls from v2)
    self.recognize_upside_down = config.get(
        "recognize_upside_down", DEFAULT_RECOGNIZE_UPSIDE_DOWN
    )
    if not isinstance(self.recognize_upside_down, bool):
        raise ValueError("recognize_upside_down must be a boolean")

    # Parse OCR version
    ocr_version = config.get("ocr_version", None)
    if isinstance(ocr_version, PaddleOCRVersion):
        ocr_version = ocr_version.value
    self.ocr_version = ocr_version

    # Handle deprecated use_v4_server config
    if config.get("use_v4_server", False):
        warnings.warn(
            "use_v4_server is deprecated in PaddleOCR 3.x. "
            "Use ocr_version='PP-OCRv4' instead. "
            "Model management is now handled automatically by PaddleOCR.",
            DeprecationWarning,
            stacklevel=2,
        )
        if ocr_version is None:
            ocr_version = "PP-OCRv4"

    # Handle deprecated use_gpu config
    if "use_gpu" in config:
        warnings.warn(
            "use_gpu is deprecated in PaddleOCR 3.x. "
            "GPU/CPU device selection is now automatic.",
            DeprecationWarning,
            stacklevel=2,
        )

    # Build PaddleOCR kwargs
    ocr_kwargs: Dict[str, Any] = {
        "lang": self.language_preference,
        "use_textline_orientation": self.recognize_upside_down,
        # Disable document preprocessing for perspective images
        # (these are designed for scanned documents, not perspective crops)
        "use_doc_orientation_classify": False,
        "use_doc_unwarping": False,
    }

    if ocr_version is not None:
        ocr_kwargs["ocr_version"] = ocr_version

    # Allow custom model overrides
    for key in (
        "text_detection_model_name",
        "text_detection_model_dir",
        "text_recognition_model_name",
        "text_recognition_model_dir",
        "text_det_limit_side_len",
        "text_rec_score_thresh",
    ):
        if key in config:
            ocr_kwargs[key] = config[key]

    # Suppress connectivity check if env var is not set
    if "PADDLE_PDX_DISABLE_MODEL_SOURCE_CHECK" not in os.environ:
        os.environ["PADDLE_PDX_DISABLE_MODEL_SOURCE_CHECK"] = "True"

    self.ocr = PaddleOCR(**ocr_kwargs)

recognize #

recognize(image: Image) -> List[FlatOCRResult]

Recognize text in an image.

Parameters:

Name Type Description Default
image Image

Input image as PIL Image.

required

Returns:

Type Description
List[FlatOCRResult]

List of FlatOCRResult with normalized bounding boxes.

Source code in src/panoocr/engines/paddleocr.py
def recognize(self, image: Image.Image) -> List[FlatOCRResult]:
    """Recognize text in an image.

    Args:
        image: Input image as PIL Image.

    Returns:
        List of FlatOCRResult with normalized bounding boxes.
    """
    image_array = np.array(image)

    # Use the predict API (PaddleOCR 3.x)
    results_iter = self.ocr.predict(image_array)

    paddle_results = []
    for result in results_iter:
        # PaddleOCR 3.x returns OCRResult objects with:
        # - rec_texts: list of recognized text strings
        # - rec_scores: list of confidence scores
        # - dt_polys: list of detection polygons (numpy arrays)
        texts = result.get("rec_texts", []) if hasattr(result, "get") else getattr(result, "rec_texts", [])
        scores = result.get("rec_scores", []) if hasattr(result, "get") else getattr(result, "rec_scores", [])
        polys = result.get("dt_polys", []) if hasattr(result, "get") else getattr(result, "dt_polys", [])

        for text, score, poly in zip(texts, scores, polys):
            # Skip empty or very low confidence results
            if not text or not text.strip():
                continue

            # Convert numpy array polygon to list of [x, y] points
            if hasattr(poly, "tolist"):
                bounding_box = poly.tolist()
            else:
                bounding_box = list(poly)

            paddle_results.append(
                PaddleOCRResult(
                    text=text,
                    confidence=float(score),
                    bounding_box=bounding_box,
                    image_width=image.width,
                    image_height=image.height,
                )
            )

    return [result.to_flat() for result in paddle_results]

GoogleVisionEngine#

Google Cloud Vision API (TEXT_DETECTION). Uses REST API with an API key. Requires the [google-vision] extra and GOOGLE_VISION_API_KEY environment variable.

pip install "panoocr[google-vision]"

GoogleVisionEngine #

GoogleVisionEngine(config: Dict[str, Any] | None = None)

OCR engine using Google Cloud Vision API (TEXT_DETECTION).

Uses the REST API with an API key. Set GOOGLE_VISION_API_KEY in environment or .env file, or pass via config.

Example

from panoocr.engines.google_vision import GoogleVisionEngine

engine = GoogleVisionEngine() results = engine.recognize(image)

Note

Install with: pip install "panoocr[google-vision]" Requires GOOGLE_VISION_API_KEY environment variable.

Source code in src/panoocr/engines/google_vision.py
def __init__(self, config: Dict[str, Any] | None = None) -> None:
    _check_google_vision_dependencies()

    config = config or {}

    self._load_dotenv()

    self.api_key = config.get(
        "api_key",
        os.environ.get("GOOGLE_VISION_API_KEY"),
    )
    if not self.api_key:
        raise ValueError(
            "Google Vision API key not found. Set GOOGLE_VISION_API_KEY "
            "environment variable or pass api_key in config."
        )

recognize #

recognize(image: Image) -> List[FlatOCRResult]

Recognize text in an image via Google Cloud Vision API.

Parameters:

Name Type Description Default
image Image

Input image as PIL Image.

required

Returns:

Type Description
List[FlatOCRResult]

List of FlatOCRResult with normalized bounding boxes.

Source code in src/panoocr/engines/google_vision.py
def recognize(self, image: Image.Image) -> List[FlatOCRResult]:
    """Recognize text in an image via Google Cloud Vision API.

    Args:
        image: Input image as PIL Image.

    Returns:
        List of FlatOCRResult with normalized bounding boxes.
    """
    import requests

    buf = io.BytesIO()
    image.save(buf, format="JPEG", quality=90)
    image_b64 = base64.b64encode(buf.getvalue()).decode("utf-8")

    payload = {
        "requests": [
            {
                "image": {"content": image_b64},
                "features": [{"type": "TEXT_DETECTION"}],
            }
        ]
    }

    response = requests.post(
        VISION_API_URL,
        params={"key": self.api_key},
        json=payload,
        timeout=30,
    )
    response.raise_for_status()
    data = response.json()

    annotations = (
        data.get("responses", [{}])[0].get("textAnnotations", [])
    )

    if len(annotations) <= 1:
        return []

    # annotations[0] is the full text block; [1:] are individual words
    results = []
    for ann in annotations[1:]:
        text = ann.get("description", "")
        vertices = ann.get("boundingPoly", {}).get("vertices", [])
        if not text.strip() or len(vertices) < 3:
            continue

        results.append(
            GoogleVisionResult(
                text=text,
                vertices=vertices,
                image_width=image.width,
                image_height=image.height,
            )
        )

    return [r.to_flat() for r in results]

Florence2OCREngine#

Microsoft's Florence-2 vision-language model via transformers + torch. Requires the [florence2] extra.

pip install "panoocr[florence2]"

Florence2OCREngine #

Florence2OCREngine(config: Dict[str, Any] | None = None)

OCR engine using Microsoft's Florence-2 model.

Florence-2 is a vision-language model that can perform OCR with region detection. It provides good accuracy across many languages and can detect text in various orientations.

Attributes:

Name Type Description
device

Device to run inference on (cuda, mps, or cpu).

model

The Florence-2 model.

processor

The Florence-2 processor.

Example

from panoocr.engines.florence2 import Florence2OCREngine

engine = Florence2OCREngine(config={ ... "model_id": "microsoft/Florence-2-large", ... }) results = engine.recognize(image)

Note

Install with: pip install "panoocr[florence2]" For GPU support, install PyTorch with CUDA.

Initialize the Florence-2 engine.

Parameters:

Name Type Description Default
config Dict[str, Any] | None

Configuration dictionary with optional keys: - model_id: HuggingFace model ID (default: "microsoft/Florence-2-large"). - device: Device to use ("cuda", "mps", "cpu", or None for auto).

None

Raises:

Type Description
ImportError

If dependencies are not installed.

Source code in src/panoocr/engines/florence2.py
def __init__(self, config: Dict[str, Any] | None = None) -> None:
    """Initialize the Florence-2 engine.

    Args:
        config: Configuration dictionary with optional keys:
            - model_id: HuggingFace model ID (default: "microsoft/Florence-2-large").
            - device: Device to use ("cuda", "mps", "cpu", or None for auto).

    Raises:
        ImportError: If dependencies are not installed.
    """
    # Check dependencies first
    _check_florence2_dependencies()

    import torch
    from transformers import AutoProcessor, AutoModelForCausalLM

    config = config or {}

    model_id = config.get("model_id", "microsoft/Florence-2-large")

    # Auto-detect device
    device = config.get("device")
    if device is None:
        if torch.cuda.is_available():
            device = "cuda"
        elif torch.backends.mps.is_available():
            device = "mps"
        else:
            device = "cpu"

    self.device = device

    # Select dtype based on device
    if torch.cuda.is_available() and device == "cuda":
        self.dtype = torch.float16
    else:
        self.dtype = torch.float32

    print(f"Loading Florence-2 model on {device}...")
    self.model = AutoModelForCausalLM.from_pretrained(
        model_id, torch_dtype=self.dtype, trust_remote_code=True
    ).to(device)
    self.processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
    print("Florence-2 model loaded successfully.")

    self.prompt = "<OCR_WITH_REGION>"

recognize #

recognize(image: Image) -> List[FlatOCRResult]

Recognize text in an image.

Parameters:

Name Type Description Default
image Image

Input image as PIL Image.

required

Returns:

Type Description
List[FlatOCRResult]

List of FlatOCRResult with normalized bounding boxes.

Source code in src/panoocr/engines/florence2.py
def recognize(self, image: Image.Image) -> List[FlatOCRResult]:
    """Recognize text in an image.

    Args:
        image: Input image as PIL Image.

    Returns:
        List of FlatOCRResult with normalized bounding boxes.
    """
    import torch

    inputs = self.processor(
        text=self.prompt, images=image, return_tensors="pt"
    ).to(self.device, self.dtype)

    generated_ids = self.model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3,
        do_sample=False,
    )

    generated_text = self.processor.batch_decode(
        generated_ids, skip_special_tokens=False
    )[0]

    parsed_answer = self.processor.post_process_generation(
        generated_text,
        task="<OCR_WITH_REGION>",
        image_size=(image.width, image.height),
    )

    florence2_results = []
    try:
        ocr_data = parsed_answer.get("<OCR_WITH_REGION>", {})
        quad_boxes = ocr_data.get("quad_boxes", [])
        labels = ocr_data.get("labels", [])

        for quad_box, label in zip(quad_boxes, labels):
            # Clean up text
            label = label.replace("</s>", "").replace("<s>", "")

            # Convert quad_box [x1,y1,x2,y2,x3,y3,x4,y4] to corner points
            bounding_box = [
                [quad_box[0], quad_box[1]],
                [quad_box[2], quad_box[3]],
                [quad_box[4], quad_box[5]],
                [quad_box[6], quad_box[7]],
            ]

            florence2_results.append(
                Florence2OCRResult(
                    text=label,
                    bounding_box=bounding_box,
                    image_width=image.width,
                    image_height=image.height,
                )
            )
    except KeyError:
        print("Error parsing OCR results, returning empty list")

    # Clean up to prevent memory leak
    del inputs
    del generated_ids
    del generated_text
    del parsed_answer
    gc.collect()

    if str(self.device).startswith("cuda"):
        import torch
        torch.cuda.empty_cache()

    return [result.to_flat() for result in florence2_results]

Florence2MLXEngine#

Florence-2 via mlx-vlm with <OCR_WITH_REGION> structured output. The only VLM engine that returns per-word bounding boxes. Requires macOS Apple Silicon with the [mlx-vlm] extra, plus torch and torchvision.

pip install "panoocr[mlx-vlm]" torch torchvision

Florence2MLXEngine #

Florence2MLXEngine(config: Dict[str, Any] | None = None)

OCR engine using Florence-2 via mlx-vlm with .

This is a structured engine -- it returns per-word bounding boxes.

Example

from panoocr.engines.florence2_mlx import Florence2MLXEngine engine = Florence2MLXEngine() results = engine.recognize(image)

Note

Install with: pip install "panoocr[mlx-vlm]" torch

Source code in src/panoocr/engines/florence2_mlx.py
def __init__(self, config: Dict[str, Any] | None = None) -> None:
    _check_florence2_mlx_dependencies()
    _patch_florence2_config()

    from mlx_vlm import load

    config = config or {}
    model_id = config.get("model_id", "mlx-community/Florence-2-large-ft-8bit")
    self.max_tokens = config.get("max_tokens", 1024)

    print(f"Loading Florence-2 MLX model: {model_id}...")
    self.model, self.processor = load(model_id, trust_remote_code=False)
    print("Florence-2 MLX model loaded.")

recognize #

recognize(image: Image) -> List[FlatOCRResult]
Source code in src/panoocr/engines/florence2_mlx.py
def recognize(self, image: Image.Image) -> List[FlatOCRResult]:
    from mlx_vlm import generate

    with tempfile.NamedTemporaryFile(suffix=".jpg", delete=False) as f:
        image.save(f, format="JPEG")
        tmp_path = f.name

    try:
        result = generate(
            self.model,
            self.processor,
            image=tmp_path,
            prompt="<OCR_WITH_REGION>",
            max_tokens=self.max_tokens,
            verbose=False,
        )
    finally:
        import os
        os.unlink(tmp_path)

    output_text = result.text if hasattr(result, "text") else str(result)
    parsed = _parse_ocr_with_region(output_text, image.width, image.height)

    results = []
    for label, coords in parsed:
        xs = [c[0] for c in coords]
        ys = [c[1] for c in coords]
        if max(xs) - min(xs) < 1 and max(ys) - min(ys) < 1:
            continue
        results.append(
            Florence2MLXResult(
                text=label,
                bounding_box=coords,
                image_width=image.width,
                image_height=image.height,
            )
        )

    return [r.to_flat() for r in results]

Unstructured Engines#

Unstructured engines return text without bounding boxes. Each detection gets a full-image bounding box for crop-level attribution in the panoocr pipeline.

GeminiEngine#

Google Gemini API (Gemini 2.5 Flash / Pro). Requires the [gemini] extra and GOOGLE_GEMINI_API_KEY environment variable.

pip install "panoocr[gemini]"

GeminiEngine #

GeminiEngine(config: Dict[str, Any] | None = None)

OCR engine using Gemini API with spatial grounding.

Returns structured text with per-detection bounding boxes by default. Set config "use_bbox" to False for plain text mode (crop-level attribution).

Example

from panoocr.engines.gemini import GeminiEngine engine = GeminiEngine(config={"model": "gemini-2.5-flash"}) results = engine.recognize(image)

Note

Install with: pip install "panoocr[gemini]" Requires GOOGLE_GEMINI_API_KEY environment variable.

Source code in src/panoocr/engines/gemini.py
def __init__(self, config: Dict[str, Any] | None = None) -> None:
    _check_gemini_dependencies()

    from google import genai

    config = config or {}

    self._load_dotenv()

    api_key = config.get(
        "api_key",
        os.environ.get("GOOGLE_GEMINI_API_KEY",
            os.environ.get("GEMINI_API_KEY",
                os.environ.get("GOOGLE_API_KEY"))),
    )
    if not api_key:
        raise ValueError(
            "Gemini API key not found. Set GOOGLE_GEMINI_API_KEY, "
            "GEMINI_API_KEY, or GOOGLE_API_KEY environment variable, "
            "or pass api_key in config."
        )

    self.client = genai.Client(api_key=api_key)
    self.model_name = config.get("model", "gemini-2.5-pro")
    self.use_bbox = config.get("use_bbox", True)

    if self.use_bbox:
        self.prompt = config.get("prompt", BBOX_PROMPT)
    else:
        self.prompt = config.get("prompt", PLAIN_PROMPT)

    if "flash" in self.model_name:
        self._engine_tag = "GEMINI_2_5_FLASH"
    elif "pro" in self.model_name:
        self._engine_tag = "GEMINI_2_5_PRO"
    else:
        self._engine_tag = "GEMINI"

recognize #

recognize(image: Image) -> List[FlatOCRResult]
Source code in src/panoocr/engines/gemini.py
def recognize(self, image: Image.Image) -> List[FlatOCRResult]:
    from google.genai import types

    buf = io.BytesIO()
    image.save(buf, format="JPEG", quality=90)
    image_bytes = buf.getvalue()

    response = self.client.models.generate_content(
        model=self.model_name,
        contents=[
            types.Content(
                parts=[
                    types.Part.from_bytes(data=image_bytes, mime_type="image/jpeg"),
                    types.Part.from_text(text=self.prompt),
                ]
            )
        ],
    )

    output_text = response.text or ""

    if self.use_bbox:
        return _parse_bbox_response(output_text, self._engine_tag)

    results = []
    for line in output_text.splitlines():
        line = line.strip()
        if not line:
            continue
        if line.startswith("```") or line.startswith("---"):
            continue
        results.append(
            FlatOCRResult(
                text=line,
                confidence=1.0,
                bounding_box=_FULL_IMAGE_BBOX,
                engine=self._engine_tag,
            )
        )

    return results

GlmOCREngine#

GLM-OCR (0.9B) via mlx-vlm. Document-focused VLM with limited effectiveness on scene text. Requires macOS Apple Silicon with the [mlx-vlm] extra.

pip install "panoocr[mlx-vlm]" torch

GlmOCREngine #

GlmOCREngine(config: Dict[str, Any] | None = None)

OCR engine using GLM-OCR (0.9B) via mlx-vlm.

This is an unstructured engine -- it returns text without bounding boxes. Each text line gets a full-image bounding box for crop-level attribution.

Example

from panoocr.engines.glm_ocr import GlmOCREngine engine = GlmOCREngine() results = engine.recognize(image)

Note

Install with: pip install "panoocr[mlx-vlm]" torch

Source code in src/panoocr/engines/glm_ocr.py
def __init__(self, config: Dict[str, Any] | None = None) -> None:
    _check_glm_ocr_dependencies()

    from mlx_vlm import load

    config = config or {}
    model_id = config.get("model_id", "mlx-community/GLM-OCR-4bit")
    self.max_tokens = config.get("max_tokens", 512)
    self.prompt = config.get("prompt", DEFAULT_PROMPT)

    print(f"Loading GLM-OCR model: {model_id}...")
    self.model, self.processor = load(model_id)
    print("GLM-OCR model loaded.")

recognize #

recognize(image: Image) -> List[FlatOCRResult]
Source code in src/panoocr/engines/glm_ocr.py
def recognize(self, image: Image.Image) -> List[FlatOCRResult]:
    from mlx_vlm import generate

    with tempfile.NamedTemporaryFile(suffix=".jpg", delete=False) as f:
        image.save(f, format="JPEG")
        tmp_path = f.name

    try:
        result = generate(
            self.model,
            self.processor,
            image=tmp_path,
            prompt=self.prompt,
            max_tokens=self.max_tokens,
            verbose=False,
        )
    finally:
        import os
        os.unlink(tmp_path)

    output_text = result.text if hasattr(result, "text") else str(result)

    results = []
    for line in output_text.splitlines():
        line = line.strip()
        if not line:
            continue
        # Skip markdown artifacts from VLM output
        if line.startswith("```") or line.startswith("---"):
            continue
        results.append(
            FlatOCRResult(
                text=line,
                confidence=1.0,
                bounding_box=_FULL_IMAGE_BBOX,
                engine="GLM_OCR",
            )
        )

    return results

DotsOCREngine#

DOTS.OCR (2.9B) via mlx-vlm. Document layout parser with limited effectiveness on scene text. Requires macOS Apple Silicon with the [mlx-vlm] extra.

pip install "panoocr[mlx-vlm]" torch

DotsOCREngine #

DotsOCREngine(config: Dict[str, Any] | None = None)

OCR engine using DOTS.OCR (2.9B) via mlx-vlm.

This is an unstructured engine -- it returns text without bounding boxes. Each text line gets a full-image bounding box for crop-level attribution.

Example

from panoocr.engines.dots_ocr import DotsOCREngine engine = DotsOCREngine() results = engine.recognize(image)

Note

Install with: pip install "panoocr[mlx-vlm]" torch

Source code in src/panoocr/engines/dots_ocr.py
def __init__(self, config: Dict[str, Any] | None = None) -> None:
    _check_dots_ocr_dependencies()

    from mlx_vlm import load

    config = config or {}
    model_id = config.get("model_id", "mlx-community/dots.ocr-4bit")
    self.max_tokens = config.get("max_tokens", 512)
    self.prompt = config.get("prompt", DEFAULT_PROMPT)

    print(f"Loading DOTS.OCR model: {model_id}...")
    self.model, self.processor = load(model_id)
    print("DOTS.OCR model loaded.")

recognize #

recognize(image: Image) -> List[FlatOCRResult]
Source code in src/panoocr/engines/dots_ocr.py
def recognize(self, image: Image.Image) -> List[FlatOCRResult]:
    from mlx_vlm import generate

    with tempfile.NamedTemporaryFile(suffix=".jpg", delete=False) as f:
        image.save(f, format="JPEG")
        tmp_path = f.name

    try:
        result = generate(
            self.model,
            self.processor,
            image=tmp_path,
            prompt=self.prompt,
            max_tokens=self.max_tokens,
            verbose=False,
        )
    finally:
        import os
        os.unlink(tmp_path)

    output_text = result.text if hasattr(result, "text") else str(result)

    results = []
    for line in output_text.splitlines():
        line = line.strip()
        if not line:
            continue
        if line.startswith("```") or line.startswith("---"):
            continue
        results.append(
            FlatOCRResult(
                text=line,
                confidence=1.0,
                bounding_box=_FULL_IMAGE_BBOX,
                engine="DOTS_OCR",
            )
        )

    return results

TrOCREngine (experimental)#

Microsoft's TrOCR transformer-based OCR. Requires the [trocr] extra.

pip install "panoocr[trocr]"

TrOCREngine #

TrOCREngine(config: Dict[str, Any] | None = None)

OCR engine using Microsoft's TrOCR model.

TrOCR is a transformer-based OCR model that excels at single-line text recognition. It does NOT provide bounding boxes - it reads the entire image as a single text line.

WARNING: This engine is experimental and may not work well for panorama OCR since it doesn't detect text regions. Consider using Florence2OCREngine or other engines that provide region detection.

Attributes:

Name Type Description
model

The TrOCR model.

processor

The TrOCR processor.

Example

from panoocr.engines.trocr import TrOCREngine, TrOCRModel

engine = TrOCREngine(config={ ... "model": TrOCRModel.MICROSOFT_TROCR_LARGE_PRINTED, ... })

Note: Returns single result for entire image#

results = engine.recognize(cropped_text_image)

Note

Install with: pip install "panoocr[trocr]" For GPU support, install PyTorch with CUDA.

Initialize the TrOCR engine.

Parameters:

Name Type Description Default
config Dict[str, Any] | None

Configuration dictionary with optional keys: - model: TrOCRModel enum value or model ID string. - device: Device to use ("cuda", "mps", "cpu", or None for auto).

None

Raises:

Type Description
ImportError

If dependencies are not installed.

ValueError

If configuration values are invalid.

Source code in src/panoocr/engines/trocr.py
def __init__(self, config: Dict[str, Any] | None = None) -> None:
    """Initialize the TrOCR engine.

    Args:
        config: Configuration dictionary with optional keys:
            - model: TrOCRModel enum value or model ID string.
            - device: Device to use ("cuda", "mps", "cpu", or None for auto).

    Raises:
        ImportError: If dependencies are not installed.
        ValueError: If configuration values are invalid.
    """
    # Check dependencies first
    _check_trocr_dependencies()

    import torch
    from transformers import TrOCRProcessor, VisionEncoderDecoderModel

    config = config or {}

    # Parse model
    model = config.get("model", DEFAULT_MODEL)
    try:
        model_id = model.value if isinstance(model, TrOCRModel) else model
    except (KeyError, AttributeError):
        raise ValueError("Invalid model specified")

    # Auto-detect device
    device = config.get("device")
    if device is None:
        if torch.cuda.is_available():
            device = "cuda"
        elif torch.backends.mps.is_available():
            device = "mps"
        else:
            device = "cpu"

    self.device = device

    print(f"Loading TrOCR model {model_id} on {device}...")
    self.processor = TrOCRProcessor.from_pretrained(model_id)
    self.model = VisionEncoderDecoderModel.from_pretrained(model_id).to(device)
    print("TrOCR model loaded successfully.")

recognize #

recognize(image: Image) -> List[FlatOCRResult]

Recognize text in an image.

NOTE: TrOCR treats the entire image as a single text line and does not provide bounding boxes. This makes it unsuitable for most panorama OCR use cases. The result will have a bounding box covering the entire image.

Parameters:

Name Type Description Default
image Image

Input image as PIL Image.

required

Returns:

Type Description
List[FlatOCRResult]

List with single FlatOCRResult covering the entire image, or empty

List[FlatOCRResult]

list if no text is recognized.

Source code in src/panoocr/engines/trocr.py
def recognize(self, image: Image.Image) -> List[FlatOCRResult]:
    """Recognize text in an image.

    NOTE: TrOCR treats the entire image as a single text line and does
    not provide bounding boxes. This makes it unsuitable for most panorama
    OCR use cases. The result will have a bounding box covering the entire
    image.

    Args:
        image: Input image as PIL Image.

    Returns:
        List with single FlatOCRResult covering the entire image, or empty
        list if no text is recognized.
    """
    import torch

    # Convert to RGB if needed
    if image.mode != "RGB":
        image = image.convert("RGB")

    pixel_values = self.processor(images=image, return_tensors="pt").pixel_values
    pixel_values = pixel_values.to(self.device)

    with torch.no_grad():
        generated_ids = self.model.generate(pixel_values)

    generated_text = self.processor.batch_decode(
        generated_ids, skip_special_tokens=True
    )[0]

    # TrOCR doesn't provide bounding boxes - return full image bbox
    if generated_text.strip():
        return [
            FlatOCRResult(
                text=generated_text.strip(),
                confidence=1.0,  # TrOCR doesn't provide confidence
                bounding_box=BoundingBox(
                    left=0.0,
                    top=0.0,
                    right=1.0,
                    bottom=1.0,
                    width=1.0,
                    height=1.0,
                ),
                engine="TROCR",
            )
        ]

    return []

Custom Engines#

Any class with a compatible recognize() method works:

from panoocr import PanoOCR, FlatOCRResult, BoundingBox
from PIL import Image

class MyEngine:
    def recognize(self, image: Image.Image) -> list[FlatOCRResult]:
        # Return list of FlatOCRResult with normalized bounding boxes (0-1)
        return [
            FlatOCRResult(
                text="Hello",
                confidence=0.95,
                bounding_box=BoundingBox(
                    left=0.1, top=0.2, right=0.4, bottom=0.3,
                    width=0.3, height=0.1
                ),
                engine="my_engine",
            )
        ]

pano = PanoOCR(engine=MyEngine())