Deduplication#

Removes duplicate text detections when the same text appears in multiple perspective views.

Algorithm#

OCR results are deduplicated using spatial overlap and text similarity:

  1. For each result pair, check both text similarity and region overlap
  2. If texts are similar (Levenshtein) or overlapping, and regions intersect sufficiently, mark as duplicate
  3. Keep the result with longer text, or higher confidence if equal length

Adaptive Processing Strategy#

PanoOCR automatically selects the optimal deduplication strategy based on perspective arrangement:

  • Sequential pairwise (fast): Used when perspectives form a simple horizontal ring (same pitch, sorted by yaw). Compares only adjacent perspective pairs plus wrap-around.
  • Incremental master list (thorough): Used for arbitrary perspective arrangements (multiple pitch levels, custom arrangements). Compares each result against all previous results.

This is handled automatically - no configuration needed.

SphereOCRDuplicationDetectionEngine#

SphereOCRDuplicationDetectionEngine #

SphereOCRDuplicationDetectionEngine(min_text_similarity: float = DEFAULT_MIN_TEXT_SIMILARITY, min_intersection_ratio_for_similar_text: float = DEFAULT_MIN_INTERSECTION_RATIO_FOR_SIMILAR_TEXT, min_text_overlap: float = DEFAULT_MIN_TEXT_OVERLAP, min_intersection_ratio_for_overlapping_text: float = DEFAULT_MIN_INTERSECTION_RATIO_FOR_OVERLAPPING_TEXT, min_intersection_ratio: float = DEFAULT_MIN_INTERSECTION_RATIO)

Engine for detecting and removing duplicate OCR results.

Duplicates are identified by comparing both spatial overlap and text similarity between results from overlapping perspective views.

Attributes:

Name Type Description
min_text_similarity

Minimum Levenshtein similarity threshold.

min_intersection_ratio_for_similar_text

Required overlap for similar texts.

min_text_overlap

Minimum overlap similarity threshold.

min_intersection_ratio_for_overlapping_text

Required overlap for overlapping texts.

min_intersection_ratio

Minimum intersection ratio to consider.

Initialize the deduplication engine.

Parameters:

Name Type Description Default
min_text_similarity float

Minimum Levenshtein normalized similarity (0-1).

DEFAULT_MIN_TEXT_SIMILARITY
min_intersection_ratio_for_similar_text float

Minimum region overlap for text matches based on similarity.

DEFAULT_MIN_INTERSECTION_RATIO_FOR_SIMILAR_TEXT
min_text_overlap float

Minimum overlap normalized similarity (0-1).

DEFAULT_MIN_TEXT_OVERLAP
min_intersection_ratio_for_overlapping_text float

Minimum region overlap for text matches based on overlap.

DEFAULT_MIN_INTERSECTION_RATIO_FOR_OVERLAPPING_TEXT
min_intersection_ratio float

Absolute minimum intersection ratio to consider any match.

DEFAULT_MIN_INTERSECTION_RATIO
Source code in src/panoocr/dedup/detection.py
def __init__(
    self,
    min_text_similarity: float = DEFAULT_MIN_TEXT_SIMILARITY,
    min_intersection_ratio_for_similar_text: float = DEFAULT_MIN_INTERSECTION_RATIO_FOR_SIMILAR_TEXT,
    min_text_overlap: float = DEFAULT_MIN_TEXT_OVERLAP,
    min_intersection_ratio_for_overlapping_text: float = DEFAULT_MIN_INTERSECTION_RATIO_FOR_OVERLAPPING_TEXT,
    min_intersection_ratio: float = DEFAULT_MIN_INTERSECTION_RATIO,
):
    """Initialize the deduplication engine.

    Args:
        min_text_similarity: Minimum Levenshtein normalized similarity (0-1).
        min_intersection_ratio_for_similar_text: Minimum region overlap for
            text matches based on similarity.
        min_text_overlap: Minimum overlap normalized similarity (0-1).
        min_intersection_ratio_for_overlapping_text: Minimum region overlap
            for text matches based on overlap.
        min_intersection_ratio: Absolute minimum intersection ratio to
            consider any match.
    """
    self.min_text_similarity = min_text_similarity
    self.min_intersection_ratio_for_similar_text = (
        min_intersection_ratio_for_similar_text
    )
    self.min_text_overlap = min_text_overlap
    self.min_intersection_ratio_for_overlapping_text = (
        min_intersection_ratio_for_overlapping_text
    )
    self.min_intersection_ratio = min_intersection_ratio

check_duplication #

check_duplication(ocr_result_1: SphereOCRResult, ocr_result_2: SphereOCRResult) -> bool

Check if two OCR results are duplicates.

Parameters:

Name Type Description Default
ocr_result_1 SphereOCRResult

First OCR result.

required
ocr_result_2 SphereOCRResult

Second OCR result.

required

Returns:

Type Description
bool

True if the results are considered duplicates.

Source code in src/panoocr/dedup/detection.py
def check_duplication(
    self, ocr_result_1: SphereOCRResult, ocr_result_2: SphereOCRResult
) -> bool:
    """Check if two OCR results are duplicates.

    Args:
        ocr_result_1: First OCR result.
        ocr_result_2: Second OCR result.

    Returns:
        True if the results are considered duplicates.
    """
    text_similarity = self._get_texts_similarity(
        ocr_result_1.text, ocr_result_2.text
    )
    text_overlap = self._get_texts_overlap(ocr_result_1.text, ocr_result_2.text)

    # If texts are neither similar nor overlapping, not duplicates
    if (text_similarity < self.min_text_similarity) and (
        text_overlap < self.min_text_overlap
    ):
        return False

    # Check spatial intersection
    intersection = self._intersect_ocr_results(ocr_result_1, ocr_result_2)
    if intersection is None:
        return False

    if intersection.intersection_ratio < self.min_intersection_ratio:
        return False

    # Check if texts overlap and regions overlap sufficiently
    if (
        text_overlap >= self.min_text_overlap
        and intersection.intersection_ratio
        >= self.min_intersection_ratio_for_overlapping_text
    ):
        return True

    # Check if texts are similar and regions overlap sufficiently
    if (
        text_similarity >= self.min_text_similarity
        and intersection.intersection_ratio
        >= self.min_intersection_ratio_for_similar_text
    ):
        return True

    return False

remove_duplication_for_two_lists #

remove_duplication_for_two_lists(ocr_results_0: List[SphereOCRResult], ocr_results_1: List[SphereOCRResult]) -> Tuple[List[SphereOCRResult], List[SphereOCRResult]]

Remove duplicates between two lists of OCR results.

When duplicates are found, keeps the result with longer text, or higher confidence if texts are equal length.

Parameters:

Name Type Description Default
ocr_results_0 List[SphereOCRResult]

First list of OCR results (modified in place).

required
ocr_results_1 List[SphereOCRResult]

Second list of OCR results (modified in place).

required

Returns:

Type Description
Tuple[List[SphereOCRResult], List[SphereOCRResult]]

Tuple of the two lists with duplicates removed.

Source code in src/panoocr/dedup/detection.py
def remove_duplication_for_two_lists(
    self,
    ocr_results_0: List[SphereOCRResult],
    ocr_results_1: List[SphereOCRResult],
) -> Tuple[List[SphereOCRResult], List[SphereOCRResult]]:
    """Remove duplicates between two lists of OCR results.

    When duplicates are found, keeps the result with longer text,
    or higher confidence if texts are equal length.

    Args:
        ocr_results_0: First list of OCR results (modified in place).
        ocr_results_1: Second list of OCR results (modified in place).

    Returns:
        Tuple of the two lists with duplicates removed.
    """
    # Find all duplicate pairs
    duplications = []
    for i, ocr_result_0 in enumerate(ocr_results_0):
        for j, ocr_result_1 in enumerate(ocr_results_1):
            if self.check_duplication(ocr_result_0, ocr_result_1):
                duplications.append((i, j))

    # Determine which to remove from each list
    indices_to_remove_from_0: List[int] = []
    indices_to_remove_from_1: List[int] = []

    for i, j in duplications:
        candidate_0 = ocr_results_0[i]
        candidate_1 = ocr_results_1[j]

        if len(candidate_0.text) == len(candidate_1.text):
            # Equal length: prefer higher confidence
            if candidate_0.confidence < candidate_1.confidence:
                indices_to_remove_from_0.append(i)
            else:
                indices_to_remove_from_1.append(j)
        elif len(candidate_0.text) > len(candidate_1.text):
            # Prefer longer text
            indices_to_remove_from_1.append(j)
        else:
            indices_to_remove_from_0.append(i)

    # Remove duplicates (in reverse order to preserve indices)
    indices_to_remove_from_0 = sorted(set(indices_to_remove_from_0), reverse=True)
    indices_to_remove_from_1 = sorted(set(indices_to_remove_from_1), reverse=True)

    for index in indices_to_remove_from_0:
        ocr_results_0.pop(index)

    for index in indices_to_remove_from_1:
        ocr_results_1.pop(index)

    return ocr_results_0, ocr_results_1

deduplicate_frames #

deduplicate_frames(frames: List[List[SphereOCRResult]]) -> List[SphereOCRResult]

Deduplicate OCR results across multiple frames using incremental merging.

Processes frames one by one, maintaining a master list. Each result from a new frame is compared against the entire master list. When duplicates are found, keeps the result with longer text or higher confidence.

This approach is slower than pairwise deduplication but handles arbitrary perspective arrangements correctly (e.g., multiple pitch levels, custom arrangements).

Parameters:

Name Type Description Default
frames List[List[SphereOCRResult]]

List of frames, each containing a list of OCR results.

required

Returns:

Type Description
List[SphereOCRResult]

Deduplicated list of sphere OCR results.

Source code in src/panoocr/dedup/detection.py
def deduplicate_frames(
    self,
    frames: List[List[SphereOCRResult]],
) -> List[SphereOCRResult]:
    """Deduplicate OCR results across multiple frames using incremental merging.

    Processes frames one by one, maintaining a master list. Each result from
    a new frame is compared against the entire master list. When duplicates
    are found, keeps the result with longer text or higher confidence.

    This approach is slower than pairwise deduplication but handles arbitrary
    perspective arrangements correctly (e.g., multiple pitch levels, custom
    arrangements).

    Args:
        frames: List of frames, each containing a list of OCR results.

    Returns:
        Deduplicated list of sphere OCR results.
    """
    if not frames:
        return []

    # Start with the first frame as the master list
    master_list: List[SphereOCRResult] = list(frames[0])

    # Process each subsequent frame
    for frame_results in frames[1:]:
        for new_result in frame_results:
            # Check for overlaps with existing master list
            duplicate_indices = []

            for i, master_result in enumerate(master_list):
                if self.check_duplication(new_result, master_result):
                    duplicate_indices.append(i)

            if not duplicate_indices:
                # No duplicates found, add to master list
                master_list.append(new_result)
            else:
                # Found duplicates - select best result among all candidates
                candidates = [new_result] + [
                    master_list[i] for i in duplicate_indices
                ]

                # Find the best result
                best_result = candidates[0]
                for candidate in candidates[1:]:
                    best_result = self._select_best_result(best_result, candidate)

                # Remove all duplicates from master list (in reverse order)
                for i in sorted(duplicate_indices, reverse=True):
                    master_list.pop(i)

                # Add the best result
                master_list.append(best_result)

    return master_list

Data Classes#

RegionIntersection dataclass #

RegionIntersection(region_1_area: float, region_2_area: float, intersection_area: float, intersection_ratio: float)

Result of intersecting two regions.

Attributes:

Name Type Description
region_1_area float

Area of the first region.

region_2_area float

Area of the second region.

intersection_area float

Area of the intersection.

intersection_ratio float

Intersection area divided by minimum region area.