-
Notifications
You must be signed in to change notification settings - Fork 32
improving heuristic reading order #206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
(also, simplify `run` and separate `run_single`)
extend horizontal separators to full img width if they do not overlap any other regions (only as regards to returned `splitter_y` result, but without changing returned separators mask)
regarding `splitter_y` result, for headings, instead of cutting right through them via center line, add their toplines and baselines as if they were horizontal separators
- enumeration instead of indexing - array instead of list operations - add better plotting (but commented out)
- when handling lines without mother, and biggest line already accounts for all columns, but some are too close to the top and therefore must be removed, avoid invalidating `biggest` index, causing `IndexError` - remove try-catch (now unnecessary) - array instead of list operations
simplify and document - simplify - rename identifiers to make readable: - `y_sep` → `y_mid` (because the cy gets passed) - `y_diff` → `y_max` (because the ymax gets passed) - array instead of list operations - add docstring and in-line comments - return (zero-length) numpy array instead of empty list
when calculating `reading_order_type`, upper limit on column range (`x_end`) needs to be `+1` here as well
- array instead of list operations - return array of index pairs instead of list objects
- array instead of list operations - add better plotting (but commented out) - add more debug printing (but commented out) - add more inline comments for documentation - rename identifiers to make more readable: - `cy_hor_diff` → `y_max_hor_some` (because the ymax gets passed) - `lines` → `seps` - `y_type_2` → `y_mid` - `y_diff_type_2` → `y_max` - `y_lines_by_order` → `y_mid_by_order` - `y_lines_without_mother` → `y_mid_without_mother` - `y_lines_with_child_without_mother` → `y_mid_with_child_without_mother` - `y_column` → `y_mid_column` - `y_column_nc` → `y_mid_column_nc` - `y_all_between_nm_wc` → `y_mid_between_nm_wc` - `lines_so_close_to_top_separator` → `seps_too_close_to_top_separator` - `y_in_cols` and `y_down` → `y_mid_next` - use `pairwise()` `nc_top:nc_bot` instead of `i_c` indexing
when y slice (`top:bot`) is not a significant part of the page, viz. less than 22% (as in `find_number_of_columns_in_document`), avoid forcing `find_num_col` to reach `num_col_classifier` (allows large headers not to be split up and thus better ordered)
(by removing unnecessary conditional)
- use array instead of list operations - rename identifiers: - `pixel` → `label` - `line` → `sep`
- drop connected components analysis to test overlaps between horizontal separators and (horizontal) neighbours (introduced in ab17a92) - instead of converting headings to topline and baseline during `find_number_of_columns_in_document` (introduced in 9f1595d7), add them to the matrix unchanged, but mark as extra type (besides horizontal and vertical separtors) - convert headings to toplines and baselines no earlier than in `return_boxes_of_images_by_order_of_reading_new` - for both headings and horizontal separators, if they already span multiple columns, check if they would overlap (horizontal) neighbours by looking at successively larger (left and right) intervals of columns (and pick the largest elongation which does not introduce any overlaps)
|
Sorry for the force-push! I had accidentally rebased back to a2a06a8 (which now became cd35241). Still have not addressed the big TODO (which is coming shortly), but found some more useful changes along the way:
There are also lots of new plotting directives (commented out). I'll upload some explanatory images next. |
(so we can be sure they do not fall through the "pixel cracks": bboxes are delimited by integers, and we do not want to assign contours between boxes)
- `lines` → `seps` (to distinguish from textlines) - `text_regions_p_1_n` → `text_regions_p_d` (because all other deskewed variables are called like this) - `pixel` → `label`
- rename `return_x_start_end_mothers_childs_and_type_of_reading_order`
→ `return_multicol_separators_x_start_end`, and drop all the analysis
pertaining to mother/child relationships and full-span separators,
also drop the separator unification rules;
instead of the latter, try to combine neighbouring separators more
generally: join column spans iff there is nothing in between
(which also necessitates passing the region mask), and keep only
one of every such redundant pair;
add the top (of each page part) as full-span separator up front,
and return separators already ordered by y
- `return_boxes_of_images_by_order_of_reading_new`:
- also pass regions with separators, so they do not have to be
reconstructed from the separator coordinates, and also contain
images and other non-text region types, when trying to elongate
separators to maximize their span (without introducing overlaps)
- determine connected components of the region mask, i.e. labels
and their respective bboxes, in order to
1. gain additional multi-column separators, if possible
2. avoid cutting through regions which do cross column boundaries
later on
- whenever adding a new bbox, first look up the label map to see if
there are any multi-column regions extending to the right of the
current column; if there are, then advance not just one column
to the right, but as many as necessary to avoid cutting through
these regions
- new core algorithm: iterate separators sorted by y and then column
by column, but whenever the next separator ends in the same column
as the current one or even further left, recurse (i.e. finish that
span first before continuing with the top iteration)
- reduce `sigma` for smoothing of input to `find_peaks` (so we get deeper gaps between columns) - allow column boundaries closer to the margins (50 instead of 100 or 200 px, 170 instead of 370 px) - allow column boundaries closer to each other (300 instead of 400 px) - add a secondary `grenze` criterion for depth of gap (relative to lowest minimum, if that is smaller than the old criterion relative to lowest maximum) - for calls to `find_num_col` within parts of a page, do allow unbalanced column boundaries
(because the latter does not preserve coordinates; it scales, even when resizing the image; this caused coordinate problems when matching deskewed contours)
- `do_order_of_regions`: simplify aggregating per-box orders for paragraphs and headings to overall order passed to `xml_reading_order`; no need for `order_and_id_of_texts`, no need to return `id_of_texts_tot` - `do_order_of_regions_with_model`: no need to return `region_ids` - writer: no need to pass `id_of_texts_tot` in `build_pagexml`
|
Done! Let me explain…
|
instead of tree without looking at the actual hierarchy (to prevent retrieving holes as separators)
when eroding the vertical separator mask (by slicing), avoid leaving 1px strips
- `x_width_smaller_than_acolumn_width` → `avg_col_width` - `len_lines_bigger_than_x_width_smaller_than_acolumn_width` → `nseps_wider_than_than_avg_col_width` - `img_in_hor` → `img_p_in_hor` (analogous to vertical)
- avoid unnecessary `fillPoly` (we already have the mask) - do not merge hseps if vseps interfere - remove old criterion (based on total length of hseps) - create new criterion (no x overlap and x close to each other) - rename identifiers: * `sum_dis` → `sum_xspan` * `diff_max_min_uniques` → `tot_xspan` * np.std / np.mean → `dev_xspan` - remove rule cutting around the center of crossing seps (which is unnecessary and creates small isolated seps at the center, unrelated to the actual crossing points) - create rule cutting hseps by vseps _prior_ to merging
(forgot to also flip `regions_with_separators` if right2left)
- when analysing regions spanning across columns, disregard tiny regions (smaller than half the median size) - if a region spans across columns just by a tiny fraction, and therefore is not good enough for a multi-col separator, then it should also not be good enough for a multi-col box maker
- when searching for multi-col box makers, pick the right-most allowable column, not the left-most
when searching for gaps between text regions, consider the vertical separator mask (if given): add the vertical sum of vertical separators to the peak scores (making column detection more robust if still slighly skewed or partially obscured by multi-column regions, but fg seps are present)
- `find_number_of_columns_in_document`: retain vertical separators
and pass to `find_num_col` for each vertical split
- `return_boxes_of_images_by_order_of_reading_new`: reconstruct
the vertical separators from the segmentation mask and the separator
bboxes; pass it on to `find_num_col` everywhere
- `return_boxes_of_images_by_order_of_reading_new`: no need to
try-catch `find_num_col` anymore
- `return_boxes_of_images_by_order_of_reading_new`: when a vertical
split has too few columns,
* do not raise but lower the threshold `multiplier` responsible for
allowing gaps as column boundaries
* do not pass the `num_col_classifier` (i.e. expected number of
resulting columns) of the entire page to the iterative
`find_num_col` for each existing column, but only the portion
of that span
when passing the text region mask, do not apply erosion only if there are more than 2 columns, but iff `not erosion_hurts` (consistent with `find_num_col`'s expectations and making it as easy to find the column gaps on 1 and 2-column pages as on multi-column pages)
after selecting the optimum angle on the original search range, narrow down around in the vicinity with half the range (adding computational costs, but gaining precision)
|
@vahidrezanezhad thanks to your regression test, I was able to address remaining issues with another series of commits:
All of the regressions are gone, and I have not yet found any new ones (but I will search more intensively this time). Here are some diagnostic example of the above:
← This shows how a missed out horizontal separator merge caused suboptimal RO
← This shows how in certain cases (despite more or less correct deskewing, but in this case still without erosion) the gaps are just a little to weak to meet the
|
|
@vahidrezanezhad the new recursive RO algorithm is described briefly here, and its implementation follows in just a few lines. Regarding the question of whether we should (as a general rule) A: "finish top mothers first" (as required by the
… or B: "recurse into next mothers first" (as illustrated by the
Here is the change needed to make my implementation behave effectively as yours did in that circumstance: --- a/src/eynollah/utils/__init__.py
+++ b/src/eynollah/utils/__init__.py
@@ -1881,7 +1881,7 @@ def return_boxes_of_images_by_order_of_reading_new(
y_mid[nxt]])
# dbg_plt(boxes[-1], "recursive column %d:%d box [%d]" % (column, last, len(boxes)))
column = last
- if last == x_ending[nxt] and x_ending[nxt] <= x_ending[cur] and nxt in args:
+ if last == x_ending[nxt] and x_ending[nxt] <= x_ending[cur] and x_starting[nxt] >= x_starting[cur] and nxt in args:
# child – recur
# print("recur", nxt, y_mid[nxt], "%d:%d" % (x_starting[nxt], x_ending[nxt]))
args.remove(nxt)And this is what it would then looks like:
… instead of the current result:
So let's collect more data on that, so perhaps we can make a better decision. |

















WIP, starting off with regressions from 0.5.0 and old issues (IndexError etc)TODO:
return_boxes_of_images_by_order_of_reading_newsuch that it becomes mildly recursive, in order to avoid cutting through regions: if (for some y slice) some columns have much higher peaks than others, then pick those first and search for new y splitters within the others