improving heuristic reading order #206

bertsky · 2025-10-20T15:45:09Z

~~WIP~~, starting off with regressions from 0.5.0 and old issues (IndexError etc)

TODO:

modify return_boxes_of_images_by_order_of_reading_new such that it becomes mildly recursive, in order to avoid cutting through regions: if (for some y slice) some columns have much higher peaks than others, then pick those first and search for new y splitters within the others

(also, simplify `run` and separate `run_single`)

extend horizontal separators to full img width if they do not overlap any other regions (only as regards to returned `splitter_y` result, but without changing returned separators mask)

regarding `splitter_y` result, for headings, instead of cutting right through them via center line, add their toplines and baselines as if they were horizontal separators

- enumeration instead of indexing - array instead of list operations - add better plotting (but commented out)

- when handling lines without mother, and biggest line already accounts for all columns, but some are too close to the top and therefore must be removed, avoid invalidating `biggest` index, causing `IndexError` - remove try-catch (now unnecessary) - array instead of list operations

simplify and document - simplify - rename identifiers to make readable: - `y_sep` → `y_mid` (because the cy gets passed) - `y_diff` → `y_max` (because the ymax gets passed) - array instead of list operations - add docstring and in-line comments - return (zero-length) numpy array instead of empty list

when calculating `reading_order_type`, upper limit on column range (`x_end`) needs to be `+1` here as well

- array instead of list operations - return array of index pairs instead of list objects

- array instead of list operations - add better plotting (but commented out) - add more debug printing (but commented out) - add more inline comments for documentation - rename identifiers to make more readable: - `cy_hor_diff` → `y_max_hor_some` (because the ymax gets passed) - `lines` → `seps` - `y_type_2` → `y_mid` - `y_diff_type_2` → `y_max` - `y_lines_by_order` → `y_mid_by_order` - `y_lines_without_mother` → `y_mid_without_mother` - `y_lines_with_child_without_mother` → `y_mid_with_child_without_mother` - `y_column` → `y_mid_column` - `y_column_nc` → `y_mid_column_nc` - `y_all_between_nm_wc` → `y_mid_between_nm_wc` - `lines_so_close_to_top_separator` → `seps_too_close_to_top_separator` - `y_in_cols` and `y_down` → `y_mid_next` - use `pairwise()` `nc_top:nc_bot` instead of `i_c` indexing

when y slice (`top:bot`) is not a significant part of the page, viz. less than 22% (as in `find_number_of_columns_in_document`), avoid forcing `find_num_col` to reach `num_col_classifier` (allows large headers not to be split up and thus better ordered)

(by removing unnecessary conditional)

- use array instead of list operations - rename identifiers: - `pixel` → `label` - `line` → `sep`

- drop connected components analysis to test overlaps between horizontal separators and (horizontal) neighbours (introduced in ab17a92) - instead of converting headings to topline and baseline during `find_number_of_columns_in_document` (introduced in 9f1595d7), add them to the matrix unchanged, but mark as extra type (besides horizontal and vertical separtors) - convert headings to toplines and baselines no earlier than in `return_boxes_of_images_by_order_of_reading_new` - for both headings and horizontal separators, if they already span multiple columns, check if they would overlap (horizontal) neighbours by looking at successively larger (left and right) intervals of columns (and pick the largest elongation which does not introduce any overlaps)

bertsky · 2025-10-25T12:01:35Z

Sorry for the force-push! I had accidentally rebased back to a2a06a8 (which now became cd35241).

Still have not addressed the big TODO (which is coming shortly), but found some more useful changes along the way:

b2a79cc was clearly a bug
66a0e55 is an improvement: find_num_col should not enforce the overall num_col_classifier result for every part of the page, but rather (as in find_number_of_columns_in_document) only for the big parts, so there would still be room for sections with headings stretching multiple columns
19b2c3f improves on the previous change to elongate horizontal separators if that does not introduce any overlap: now this is done later, when the column split is already determined, so one can also try partial spans; it also covers headings now, but single-column separators/headings are exempt from this now

There are also lots of new plotting directives (commented out). I'll upload some explanatory images next.

(so we can be sure they do not fall through the "pixel cracks": bboxes are delimited by integers, and we do not want to assign contours between boxes)

- `lines` → `seps` (to distinguish from textlines) - `text_regions_p_1_n` → `text_regions_p_d` (because all other deskewed variables are called like this) - `pixel` → `label`

- rename `return_x_start_end_mothers_childs_and_type_of_reading_order` → `return_multicol_separators_x_start_end`, and drop all the analysis pertaining to mother/child relationships and full-span separators, also drop the separator unification rules; instead of the latter, try to combine neighbouring separators more generally: join column spans iff there is nothing in between (which also necessitates passing the region mask), and keep only one of every such redundant pair; add the top (of each page part) as full-span separator up front, and return separators already ordered by y - `return_boxes_of_images_by_order_of_reading_new`: - also pass regions with separators, so they do not have to be reconstructed from the separator coordinates, and also contain images and other non-text region types, when trying to elongate separators to maximize their span (without introducing overlaps) - determine connected components of the region mask, i.e. labels and their respective bboxes, in order to 1. gain additional multi-column separators, if possible 2. avoid cutting through regions which do cross column boundaries later on - whenever adding a new bbox, first look up the label map to see if there are any multi-column regions extending to the right of the current column; if there are, then advance not just one column to the right, but as many as necessary to avoid cutting through these regions - new core algorithm: iterate separators sorted by y and then column by column, but whenever the next separator ends in the same column as the current one or even further left, recurse (i.e. finish that span first before continuing with the top iteration)

- reduce `sigma` for smoothing of input to `find_peaks` (so we get deeper gaps between columns) - allow column boundaries closer to the margins (50 instead of 100 or 200 px, 170 instead of 370 px) - allow column boundaries closer to each other (300 instead of 400 px) - add a secondary `grenze` criterion for depth of gap (relative to lowest minimum, if that is smaller than the old criterion relative to lowest maximum) - for calls to `find_num_col` within parts of a page, do allow unbalanced column boundaries

(because the latter does not preserve coordinates; it scales, even when resizing the image; this caused coordinate problems when matching deskewed contours)

- `do_order_of_regions`: simplify aggregating per-box orders for paragraphs and headings to overall order passed to `xml_reading_order`; no need for `order_and_id_of_texts`, no need to return `id_of_texts_tot` - `do_order_of_regions_with_model`: no need to return `region_ids` - writer: no need to pass `id_of_texts_tot` in `build_pagexml`

bertsky · 2025-11-15T16:11:54Z

Done! Let me explain…

1a76ce1 is needed because sometimes contours centers in float were exactly "between" bboxes in integer, e.g. cy=400.3 where one box ends on 400 and the next starts on 401, preventing a match
95f7608 is merely cosmetic
4abc2ff turned out to be a much bigger change than I initially thought; using recursion now makes the box/column ordering much simpler (and thus easier to manage going forward); I'll illstrate how it works below
4475183 is a mixed bag of adapting various heuristics that control splitting each vertical strip into columns – the old rules often did not fit well;
in particular, they were too rigorous in not allowing cuts near the left and right page margin or near each other; also, the minimum for gaps was defined solely based on the "thickness" of the text columns, but not on the "thinness" of the gaps, which (in the presence of noise regions) sometimes yielded suboptimal results
3c15c4f and 5a77800 is a follow-up to new attempt at #173 (valid polygons, faster deskewing, various fixes) #192, where the matching between original contours and deskewed contours had been improved;
however, it turned out that
- I had confused and with or within the negation (so it would not look for a match on both sides, but on either side)
- the deskewing method (rotation_not_90_func or rotation_image_new) implicitly created a scaled (and not just rotated and translated) coordinate space, which I was unable to reproduce with the coordinate calculations in shapely, so the matching algorithm would have a hard time, esp. when joining or splitting
72d059f is a simplification for the aggregation of local box orderings for the writer, but also fixes a subtle bug which had caused left-right inversion within boxes

bertsky · 2025-11-15T17:23:18Z

Examples:

original	final RO boxes	final RO

For reference, I put the old version's final RO boxes result of the first example in the top row. There, notice that many regions were cut right through, which in turn gave the contour-box-matching algorithm a hard time, and made it necessary to have workarounds (like "assign the left-most box").

instead of tree without looking at the actual hierarchy (to prevent retrieving holes as separators)

when eroding the vertical separator mask (by slicing), avoid leaving 1px strips

- `x_width_smaller_than_acolumn_width` → `avg_col_width` - `len_lines_bigger_than_x_width_smaller_than_acolumn_width` → `nseps_wider_than_than_avg_col_width` - `img_in_hor` → `img_p_in_hor` (analogous to vertical)

- avoid unnecessary `fillPoly` (we already have the mask) - do not merge hseps if vseps interfere - remove old criterion (based on total length of hseps) - create new criterion (no x overlap and x close to each other) - rename identifiers: * `sum_dis` → `sum_xspan` * `diff_max_min_uniques` → `tot_xspan` * np.std / np.mean → `dev_xspan` - remove rule cutting around the center of crossing seps (which is unnecessary and creates small isolated seps at the center, unrelated to the actual crossing points) - create rule cutting hseps by vseps _prior_ to merging

(forgot to also flip `regions_with_separators` if right2left)

- when analysing regions spanning across columns, disregard tiny regions (smaller than half the median size) - if a region spans across columns just by a tiny fraction, and therefore is not good enough for a multi-col separator, then it should also not be good enough for a multi-col box maker

- when searching for multi-col box makers, pick the right-most allowable column, not the left-most

when searching for gaps between text regions, consider the vertical separator mask (if given): add the vertical sum of vertical separators to the peak scores (making column detection more robust if still slighly skewed or partially obscured by multi-column regions, but fg seps are present)

- `find_number_of_columns_in_document`: retain vertical separators and pass to `find_num_col` for each vertical split - `return_boxes_of_images_by_order_of_reading_new`: reconstruct the vertical separators from the segmentation mask and the separator bboxes; pass it on to `find_num_col` everywhere - `return_boxes_of_images_by_order_of_reading_new`: no need to try-catch `find_num_col` anymore - `return_boxes_of_images_by_order_of_reading_new`: when a vertical split has too few columns, * do not raise but lower the threshold `multiplier` responsible for allowing gaps as column boundaries * do not pass the `num_col_classifier` (i.e. expected number of resulting columns) of the entire page to the iterative `find_num_col` for each existing column, but only the portion of that span

when passing the text region mask, do not apply erosion only if there are more than 2 columns, but iff `not erosion_hurts` (consistent with `find_num_col`'s expectations and making it as easy to find the column gaps on 1 and 2-column pages as on multi-column pages)

after selecting the optimum angle on the original search range, narrow down around in the vicinity with half the range (adding computational costs, but gaining precision)

bertsky · 2025-11-28T18:04:25Z

@vahidrezanezhad thanks to your regression test, I was able to address remaining issues with another series of commits:

ee59a68 fixes a severe typo I made in 0.6 – me bad!
38d9167 just now surfaced as problematic: it allowed holes instead of exteriors to serve as separator candidates (I noticed uncritical use of cv2.RETR_TREE throughout the code base, but have not checked elsewhere if this is actually problematic)
06cb9d1 likewise created spurious separators (preventing merges of hseps)
5c12b6a is only cosmetic
a527d7a fixes various issues with the function responsible for merging horizontal separators (I never looked at this before)
b71bb80 is only relevent for right2left
5abf0c1 improves on the robustness of the new facility of multi-column region contours introduced in bertsky@4abc2ff
84d1096 fixes the criterion for making boxes span multiple columns (min→max, as originally intended)
4dd40c5 allows vertical separators to help finding columns (in addition to the other 2 criteria), increasing robustness
5a3de3b makes use of this in both callers; it also fixes 2 previously undetected bugs for the case when a part of the page has too few columns
adcea47 this is a rather radical change, but seems compelling in hindsight: IMHO it does not make sense to apply erosion of the text mask only for pages with 3 or more columns (because 2-column pages can be very tight as well, and the erosion_hurts / multiplier heuristics of the downward functions already expect a certain "spaciousness")
56e73bf this also a big change – we talked about it already, I find it very helpful (sacrificing a few percent in speed for gaining some in precision) for finding optimal column separators and reading order – in addition to all of the above

All of the regressions are gone, and I have not yet found any new ones (but I will search more intensively this time). Here are some diagnostic example of the above:

← This shows how a missed out horizontal separator merge caused suboptimal RO

← This shows how in certain cases (despite more or less correct deskewing, but in this case still without erosion) the gaps are just a little to weak to meet the grenze criterion. Hence the idea to utilise the foreground separators, which will make these gaps pass →

… other places)

…as in all other places)

bertsky · 2025-12-04T23:00:52Z

@vahidrezanezhad the new recursive RO algorithm is described briefly here, and its implementation follows in just a few lines.

Regarding the question of whether we should (as a general rule) A: "finish top mothers first" (as required by the anno_1.jpg example) …

… or B: "recurse into next mothers first" (as illustrated by the 4310063.jpg example) …

Here is the change needed to make my implementation behave effectively as yours did in that circumstance:

--- a/src/eynollah/utils/__init__.py
+++ b/src/eynollah/utils/__init__.py
@@ -1881,7 +1881,7 @@ def return_boxes_of_images_by_order_of_reading_new(
                                       y_mid[nxt]])
                         # dbg_plt(boxes[-1], "recursive column %d:%d box [%d]" % (column, last, len(boxes)))
                         column = last
-                        if last == x_ending[nxt] and x_ending[nxt] <= x_ending[cur] and nxt in args:
+                        if last == x_ending[nxt] and x_ending[nxt] <= x_ending[cur] and x_starting[nxt] >= x_starting[cur] and nxt in args:
                             # child – recur
                             # print("recur", nxt, y_mid[nxt], "%d:%d" % (x_starting[nxt], x_ending[nxt]))
                             args.remove(nxt)

And this is what it would then looks like:

… instead of the current result:

So let's collect more data on that, so perhaps we can make a better decision.

Robert Sachunsky added 21 commits October 20, 2025 17:40

binarization: add option --overwrite, skip existing outputs

086c188

(also, simplify `run` and separate `run_single`)

find_num_cols: re-sort peaks when cutting n-best num_col_classifier

184927f

find_num_col: simplify, add better plotting (but commented out)

48761c3

order_of_regions: filter out-of-image peaks

c43a825

order_of_regions: add better plotting (but commented out)

d3d599b

find_number_of_columns_in_document: simplify, rename line→seps

542d38a

find_number_of_columns_in_document: improve splitter rule

5a0e4c3

extend horizontal separators to full img width if they do not overlap any other regions (only as regards to returned `splitter_y` result, but without changing returned separators mask)

find_number_of_columns_in_document: split headings at top+baseline

cd35241

regarding `splitter_y` result, for headings, instead of cutting right through them via center line, add their toplines and baselines as if they were horizontal separators

return_boxes_of_images_by_order_of_reading_new: simplify

7c3e418

- enumeration instead of indexing - array instead of list operations - add better plotting (but commented out)

return_x_start_end_mothers_childs_and_type_of_reading_order: fix+1

b2a79cc

when calculating `reading_order_type`, upper limit on column range (`x_end`) needs to be `+1` here as well

find_number_of_columns_in_document: simplify

acee4c1

contours_in_same_horizon: simplify

5d15941

- array instead of list operations - return array of index pairs instead of list objects

find_num_col: add better plotting (but commented out)

6cc5900

return_boxes_of_images_by_order_of_reading_new: indent

3ebbc2d

(by removing unnecessary conditional)

delete_separator_around: simplify, eynollah: identifiers

a2a9fe5

- use array instead of list operations - rename identifiers: - `pixel` → `label` - `line` → `sep`

return_boxes_of_images_by_order_of_reading_new: change arg order

3367462

bertsky force-pushed the ro-fixes branch from f45df87 to 19b2c3f Compare October 25, 2025 11:36

kba mentioned this pull request Oct 30, 2025

Documentation updates #209

Open

Robert Sachunsky added 6 commits November 14, 2025 13:08

do_order_of_regions: round contour centers

1a76ce1

(so we can be sure they do not fall through the "pixel cracks": bboxes are delimited by integers, and we do not want to assign contours between boxes)

rename some more identifiers:

95f7608

- `lines` → `seps` (to distinguish from textlines) - `text_regions_p_1_n` → `text_regions_p_d` (because all other deskewed variables are called like this) - `pixel` → `label`

back to rotate_image instead of rotation_image_new for deskewing

3c15c4f

(because the latter does not preserve coordinates; it scales, even when resizing the image; this caused coordinate problems when matching deskewed contours)

contour matching for deskewed image: ensure matches for both sides

5a77800

Robert Sachunsky added 2 commits November 15, 2025 14:34

fix typos found by ruff

49ab269

bertsky marked this pull request as ready for review November 15, 2025 16:12

adapt ocrd-sbb-binarize

028ed16

Robert Sachunsky added 14 commits November 16, 2025 16:32

fixup 72d059f: forgot to update other writer calls

406288b

ensure separators stay within image bounds

e428e7a

contours_in_same_horizon: fix 5d15941

ee59a68

combine_hor_lines_and_delete_cross_points: get external contours

38d9167

instead of tree without looking at the actual hierarchy (to prevent retrieving holes as separators)

combine_hor_lines_and_delete_cross_points: fix 1-off px bug

06cb9d1

when eroding the vertical separator mask (by slicing), avoid leaving 1px strips

combine_hor_lines_and_delete_cross_points: simplify and rename

5c12b6a

- `x_width_smaller_than_acolumn_width` → `avg_col_width` - `len_lines_bigger_than_x_width_smaller_than_acolumn_width` → `nseps_wider_than_than_avg_col_width` - `img_in_hor` → `img_p_in_hor` (analogous to vertical)

return_boxes_of_images_by_order_of_reading_new: fix 4abc2ff

b71bb80

(forgot to also flip `regions_with_separators` if right2left)

return_boxes_of_images_by_order_of_reading_new: improve

84d1096

- when searching for multi-col box makers, pick the right-most allowable column, not the left-most

deskewing: add a 2nd stage for precision

56e73bf

after selecting the optimum angle on the original search range, narrow down around in the vicinity with half the range (adding computational costs, but gaining precision)

bertsky mentioned this pull request Dec 2, 2025

Reduce complexity #210

Open

9 tasks

Robert Sachunsky added 3 commits December 2, 2025 16:35

writer: also ensure validity after scaling

43a9584

separate_lines/_vertical: gen cv2-like contours (w/ ndim=3, as in all…

ad8f816

… other places)

utils_ocr.return_textline_contour: gen cv2-like contours (w/ ndim=3, …

9fdae72

…as in all other places)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

improving heuristic reading order #206

improving heuristic reading order #206

Uh oh!

bertsky commented Oct 20, 2025 •

edited

Loading

Uh oh!

bertsky commented Oct 25, 2025

Uh oh!

bertsky commented Nov 15, 2025 •

edited

Loading

Uh oh!

bertsky commented Nov 15, 2025

Uh oh!

bertsky commented Nov 28, 2025

Uh oh!

bertsky commented Dec 4, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

improving heuristic reading order #206

Are you sure you want to change the base?

improving heuristic reading order #206

Uh oh!

Conversation

bertsky commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bertsky commented Oct 25, 2025

Uh oh!

bertsky commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bertsky commented Nov 15, 2025

Uh oh!

bertsky commented Nov 28, 2025

Uh oh!

bertsky commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bertsky commented Oct 20, 2025 •

edited

Loading

bertsky commented Nov 15, 2025 •

edited

Loading

bertsky commented Dec 4, 2025 •

edited

Loading