Deduplication#

Removes duplicate text detections when the same text appears in multiple perspective views.

Algorithm#

OCR results are deduplicated using spatial overlap and text similarity:

For each result pair, check both text similarity and region overlap
If texts are similar (Levenshtein) or overlapping, and regions intersect sufficiently, mark as duplicate
Keep the result with longer text, or higher confidence if equal length

Adaptive Processing Strategy#

PanoOCR automatically selects the optimal deduplication strategy based on perspective arrangement:

Sequential pairwise (fast): Used when perspectives form a simple horizontal ring (same pitch, sorted by yaw). Compares only adjacent perspective pairs plus wrap-around.
Incremental master list (thorough): Used for arbitrary perspective arrangements (multiple pitch levels, custom arrangements). Compares each result against all previous results.

This is handled automatically - no configuration needed.

SphereOCRDuplicationDetectionEngine#

SphereOCRDuplicationDetectionEngine #

SphereOCRDuplicationDetectionEngine(min_text_similarity: float = DEFAULT_MIN_TEXT_SIMILARITY, min_intersection_ratio_for_similar_text: float = DEFAULT_MIN_INTERSECTION_RATIO_FOR_SIMILAR_TEXT, min_text_overlap: float = DEFAULT_MIN_TEXT_OVERLAP, min_intersection_ratio_for_overlapping_text: float = DEFAULT_MIN_INTERSECTION_RATIO_FOR_OVERLAPPING_TEXT, min_intersection_ratio: float = DEFAULT_MIN_INTERSECTION_RATIO)

Engine for detecting and removing duplicate OCR results.

Duplicates are identified by comparing both spatial overlap and text similarity between results from overlapping perspective views.

Attributes:

Name	Type	Description
`min_text_similarity`		Minimum Levenshtein similarity threshold.
`min_intersection_ratio_for_similar_text`		Required overlap for similar texts.
`min_text_overlap`		Minimum overlap similarity threshold.
`min_intersection_ratio_for_overlapping_text`		Required overlap for overlapping texts.
`min_intersection_ratio`		Minimum intersection ratio to consider.

Initialize the deduplication engine.

Parameters:

Name	Type	Description	Default
`min_text_similarity`	`float`	Minimum Levenshtein normalized similarity (0-1).	`DEFAULT_MIN_TEXT_SIMILARITY`
`min_intersection_ratio_for_similar_text`	`float`	Minimum region overlap for text matches based on similarity.	`DEFAULT_MIN_INTERSECTION_RATIO_FOR_SIMILAR_TEXT`
`min_text_overlap`	`float`	Minimum overlap normalized similarity (0-1).	`DEFAULT_MIN_TEXT_OVERLAP`
`min_intersection_ratio_for_overlapping_text`	`float`	Minimum region overlap for text matches based on overlap.	`DEFAULT_MIN_INTERSECTION_RATIO_FOR_OVERLAPPING_TEXT`
`min_intersection_ratio`	`float`	Absolute minimum intersection ratio to consider any match.	`DEFAULT_MIN_INTERSECTION_RATIO`

Source code in src/panoocr/dedup/detection.py

def __init__(
    self,
    min_text_similarity: float = DEFAULT_MIN_TEXT_SIMILARITY,
    min_intersection_ratio_for_similar_text: float = DEFAULT_MIN_INTERSECTION_RATIO_FOR_SIMILAR_TEXT,
    min_text_overlap: float = DEFAULT_MIN_TEXT_OVERLAP,
    min_intersection_ratio_for_overlapping_text: float = DEFAULT_MIN_INTERSECTION_RATIO_FOR_OVERLAPPING_TEXT,
    min_intersection_ratio: float = DEFAULT_MIN_INTERSECTION_RATIO,
):
    """Initialize the deduplication engine.

    Args:
        min_text_similarity: Minimum Levenshtein normalized similarity (0-1).
        min_intersection_ratio_for_similar_text: Minimum region overlap for
            text matches based on similarity.
        min_text_overlap: Minimum overlap normalized similarity (0-1).
        min_intersection_ratio_for_overlapping_text: Minimum region overlap
            for text matches based on overlap.
        min_intersection_ratio: Absolute minimum intersection ratio to
            consider any match.
    """
    self.min_text_similarity = min_text_similarity
    self.min_intersection_ratio_for_similar_text = (
        min_intersection_ratio_for_similar_text
    )
    self.min_text_overlap = min_text_overlap
    self.min_intersection_ratio_for_overlapping_text = (
        min_intersection_ratio_for_overlapping_text
    )
    self.min_intersection_ratio = min_intersection_ratio

check_duplication #

check_duplication(ocr_result_1: SphereOCRResult, ocr_result_2: SphereOCRResult) -> bool

Check if two OCR results are duplicates.

Parameters:

Name	Type	Description	Default
`ocr_result_1`	`SphereOCRResult`	First OCR result.	required
`ocr_result_2`	`SphereOCRResult`	Second OCR result.	required

Returns:

Type	Description
`bool`	True if the results are considered duplicates.

Source code in src/panoocr/dedup/detection.py

def check_duplication(
    self, ocr_result_1: SphereOCRResult, ocr_result_2: SphereOCRResult
) -> bool:
    """Check if two OCR results are duplicates.

    Args:
        ocr_result_1: First OCR result.
        ocr_result_2: Second OCR result.

    Returns:
        True if the results are considered duplicates.
    """
    text_similarity = self._get_texts_similarity(
        ocr_result_1.text, ocr_result_2.text
    )
    text_overlap = self._get_texts_overlap(ocr_result_1.text, ocr_result_2.text)

    # If texts are neither similar nor overlapping, not duplicates
    if (text_similarity < self.min_text_similarity) and (
        text_overlap < self.min_text_overlap
    ):
        return False

    # Check spatial intersection
    intersection = self._intersect_ocr_results(ocr_result_1, ocr_result_2)
    if intersection is None:
        return False

    if intersection.intersection_ratio < self.min_intersection_ratio:
        return False

    # Check if texts overlap and regions overlap sufficiently
    if (
        text_overlap >= self.min_text_overlap
        and intersection.intersection_ratio
        >= self.min_intersection_ratio_for_overlapping_text
    ):
        return True

    # Check if texts are similar and regions overlap sufficiently
    if (
        text_similarity >= self.min_text_similarity
        and intersection.intersection_ratio
        >= self.min_intersection_ratio_for_similar_text
    ):
        return True

    return False

remove_duplication_for_two_lists #

remove_duplication_for_two_lists(ocr_results_0: List[SphereOCRResult], ocr_results_1: List[SphereOCRResult]) -> Tuple[List[SphereOCRResult], List[SphereOCRResult]]

Remove duplicates between two lists of OCR results.

When duplicates are found, keeps the result with longer text, or higher confidence if texts are equal length.

Parameters:

Name	Type	Description	Default
`ocr_results_0`	`List[SphereOCRResult]`	First list of OCR results (modified in place).	required
`ocr_results_1`	`List[SphereOCRResult]`	Second list of OCR results (modified in place).	required

Returns:

Type	Description
`Tuple[List[SphereOCRResult], List[SphereOCRResult]]`	Tuple of the two lists with duplicates removed.

Source code in src/panoocr/dedup/detection.py

def remove_duplication_for_two_lists(
    self,
    ocr_results_0: List[SphereOCRResult],
    ocr_results_1: List[SphereOCRResult],
) -> Tuple[List[SphereOCRResult], List[SphereOCRResult]]:
    """Remove duplicates between two lists of OCR results.

    When duplicates are found, keeps the result with longer text,
    or higher confidence if texts are equal length.

    Args:
        ocr_results_0: First list of OCR results (modified in place).
        ocr_results_1: Second list of OCR results (modified in place).

    Returns:
        Tuple of the two lists with duplicates removed.
    """
    # Find all duplicate pairs
    duplications = []
    for i, ocr_result_0 in enumerate(ocr_results_0):
        for j, ocr_result_1 in enumerate(ocr_results_1):
            if self.check_duplication(ocr_result_0, ocr_result_1):
                duplications.append((i, j))

    # Determine which to remove from each list
    indices_to_remove_from_0: List[int] = []
    indices_to_remove_from_1: List[int] = []

    for i, j in duplications:
        candidate_0 = ocr_results_0[i]
        candidate_1 = ocr_results_1[j]

        if len(candidate_0.text) == len(candidate_1.text):
            # Equal length: prefer higher confidence
            if candidate_0.confidence < candidate_1.confidence:
                indices_to_remove_from_0.append(i)
            else:
                indices_to_remove_from_1.append(j)
        elif len(candidate_0.text) > len(candidate_1.text):
            # Prefer longer text
            indices_to_remove_from_1.append(j)
        else:
            indices_to_remove_from_0.append(i)

    # Remove duplicates (in reverse order to preserve indices)
    indices_to_remove_from_0 = sorted(set(indices_to_remove_from_0), reverse=True)
    indices_to_remove_from_1 = sorted(set(indices_to_remove_from_1), reverse=True)

    for index in indices_to_remove_from_0:
        ocr_results_0.pop(index)

    for index in indices_to_remove_from_1:
        ocr_results_1.pop(index)

    return ocr_results_0, ocr_results_1

deduplicate_frames #

deduplicate_frames(frames: List[List[SphereOCRResult]]) -> List[SphereOCRResult]

Deduplicate OCR results across multiple frames using incremental merging.

Processes frames one by one, maintaining a master list. Each result from a new frame is compared against the entire master list. When duplicates are found, keeps the result with longer text or higher confidence.

This approach is slower than pairwise deduplication but handles arbitrary perspective arrangements correctly (e.g., multiple pitch levels, custom arrangements).

Parameters:

Name	Type	Description	Default
`frames`	`List[List[SphereOCRResult]]`	List of frames, each containing a list of OCR results.	required

Returns:

Type	Description
`List[SphereOCRResult]`	Deduplicated list of sphere OCR results.

Source code in src/panoocr/dedup/detection.py

def deduplicate_frames(
    self,
    frames: List[List[SphereOCRResult]],
) -> List[SphereOCRResult]:
    """Deduplicate OCR results across multiple frames using incremental merging.

    Processes frames one by one, maintaining a master list. Each result from
    a new frame is compared against the entire master list. When duplicates
    are found, keeps the result with longer text or higher confidence.

    This approach is slower than pairwise deduplication but handles arbitrary
    perspective arrangements correctly (e.g., multiple pitch levels, custom
    arrangements).

    Args:
        frames: List of frames, each containing a list of OCR results.

    Returns:
        Deduplicated list of sphere OCR results.
    """
    if not frames:
        return []

    # Start with the first frame as the master list
    master_list: List[SphereOCRResult] = list(frames[0])

    # Process each subsequent frame
    for frame_results in frames[1:]:
        for new_result in frame_results:
            # Check for overlaps with existing master list
            duplicate_indices = []

            for i, master_result in enumerate(master_list):
                if self.check_duplication(new_result, master_result):
                    duplicate_indices.append(i)

            if not duplicate_indices:
                # No duplicates found, add to master list
                master_list.append(new_result)
            else:
                # Found duplicates - select best result among all candidates
                candidates = [new_result] + [
                    master_list[i] for i in duplicate_indices
                ]

                # Find the best result
                best_result = candidates[0]
                for candidate in candidates[1:]:
                    best_result = self._select_best_result(best_result, candidate)

                # Remove all duplicates from master list (in reverse order)
                for i in sorted(duplicate_indices, reverse=True):
                    master_list.pop(i)

                # Add the best result
                master_list.append(best_result)

    return master_list

Data Classes#

RegionIntersection `dataclass` #

RegionIntersection(region_1_area: float, region_2_area: float, intersection_area: float, intersection_ratio: float)

Result of intersecting two regions.

Attributes:

Name	Type	Description
`region_1_area`	`float`	Area of the first region.
`region_2_area`	`float`	Area of the second region.
`intersection_area`	`float`	Area of the intersection.
`intersection_ratio`	`float`	Intersection area divided by minimum region area.

Deduplication#

Algorithm#

Adaptive Processing Strategy#

SphereOCRDuplicationDetectionEngine#

SphereOCRDuplicationDetectionEngine #

check_duplication #

remove_duplication_for_two_lists #

deduplicate_frames #

Data Classes#

RegionIntersection dataclass #

RegionIntersection `dataclass` #