Deduplication#
Removes duplicate text detections when the same text appears in multiple perspective views.
Algorithm#
OCR results are deduplicated using spatial overlap and text similarity:
- For each result pair, check both text similarity and region overlap
- If texts are similar (Levenshtein) or overlapping, and regions intersect sufficiently, mark as duplicate
- Keep the result with longer text, or higher confidence if equal length
Adaptive Processing Strategy#
PanoOCR automatically selects the optimal deduplication strategy based on perspective arrangement:
- Sequential pairwise (fast): Used when perspectives form a simple horizontal ring (same pitch, sorted by yaw). Compares only adjacent perspective pairs plus wrap-around.
- Incremental master list (thorough): Used for arbitrary perspective arrangements (multiple pitch levels, custom arrangements). Compares each result against all previous results.
This is handled automatically - no configuration needed.
SphereOCRDuplicationDetectionEngine#
SphereOCRDuplicationDetectionEngine
#
SphereOCRDuplicationDetectionEngine(min_text_similarity: float = DEFAULT_MIN_TEXT_SIMILARITY, min_intersection_ratio_for_similar_text: float = DEFAULT_MIN_INTERSECTION_RATIO_FOR_SIMILAR_TEXT, min_text_overlap: float = DEFAULT_MIN_TEXT_OVERLAP, min_intersection_ratio_for_overlapping_text: float = DEFAULT_MIN_INTERSECTION_RATIO_FOR_OVERLAPPING_TEXT, min_intersection_ratio: float = DEFAULT_MIN_INTERSECTION_RATIO)
Engine for detecting and removing duplicate OCR results.
Duplicates are identified by comparing both spatial overlap and text similarity between results from overlapping perspective views.
Attributes:
| Name | Type | Description |
|---|---|---|
min_text_similarity |
Minimum Levenshtein similarity threshold. |
|
min_intersection_ratio_for_similar_text |
Required overlap for similar texts. |
|
min_text_overlap |
Minimum overlap similarity threshold. |
|
min_intersection_ratio_for_overlapping_text |
Required overlap for overlapping texts. |
|
min_intersection_ratio |
Minimum intersection ratio to consider. |
Initialize the deduplication engine.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
min_text_similarity
|
float
|
Minimum Levenshtein normalized similarity (0-1). |
DEFAULT_MIN_TEXT_SIMILARITY
|
min_intersection_ratio_for_similar_text
|
float
|
Minimum region overlap for text matches based on similarity. |
DEFAULT_MIN_INTERSECTION_RATIO_FOR_SIMILAR_TEXT
|
min_text_overlap
|
float
|
Minimum overlap normalized similarity (0-1). |
DEFAULT_MIN_TEXT_OVERLAP
|
min_intersection_ratio_for_overlapping_text
|
float
|
Minimum region overlap for text matches based on overlap. |
DEFAULT_MIN_INTERSECTION_RATIO_FOR_OVERLAPPING_TEXT
|
min_intersection_ratio
|
float
|
Absolute minimum intersection ratio to consider any match. |
DEFAULT_MIN_INTERSECTION_RATIO
|
Source code in src/panoocr/dedup/detection.py
check_duplication
#
Check if two OCR results are duplicates.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ocr_result_1
|
SphereOCRResult
|
First OCR result. |
required |
ocr_result_2
|
SphereOCRResult
|
Second OCR result. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if the results are considered duplicates. |
Source code in src/panoocr/dedup/detection.py
remove_duplication_for_two_lists
#
remove_duplication_for_two_lists(ocr_results_0: List[SphereOCRResult], ocr_results_1: List[SphereOCRResult]) -> Tuple[List[SphereOCRResult], List[SphereOCRResult]]
Remove duplicates between two lists of OCR results.
When duplicates are found, keeps the result with longer text, or higher confidence if texts are equal length.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ocr_results_0
|
List[SphereOCRResult]
|
First list of OCR results (modified in place). |
required |
ocr_results_1
|
List[SphereOCRResult]
|
Second list of OCR results (modified in place). |
required |
Returns:
| Type | Description |
|---|---|
Tuple[List[SphereOCRResult], List[SphereOCRResult]]
|
Tuple of the two lists with duplicates removed. |
Source code in src/panoocr/dedup/detection.py
deduplicate_frames
#
Deduplicate OCR results across multiple frames using incremental merging.
Processes frames one by one, maintaining a master list. Each result from a new frame is compared against the entire master list. When duplicates are found, keeps the result with longer text or higher confidence.
This approach is slower than pairwise deduplication but handles arbitrary perspective arrangements correctly (e.g., multiple pitch levels, custom arrangements).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
frames
|
List[List[SphereOCRResult]]
|
List of frames, each containing a list of OCR results. |
required |
Returns:
| Type | Description |
|---|---|
List[SphereOCRResult]
|
Deduplicated list of sphere OCR results. |
Source code in src/panoocr/dedup/detection.py
Data Classes#
RegionIntersection
dataclass
#
RegionIntersection(region_1_area: float, region_2_area: float, intersection_area: float, intersection_ratio: float)
Result of intersecting two regions.
Attributes:
| Name | Type | Description |
|---|---|---|
region_1_area |
float
|
Area of the first region. |
region_2_area |
float
|
Area of the second region. |
intersection_area |
float
|
Area of the intersection. |
intersection_ratio |
float
|
Intersection area divided by minimum region area. |