LP#
1947173: Speed up the symspell part of ingest
For certain data, and certain data set sizes, merging the suggestion
arrays used by the symspell algorithm is noticably expensive. This is
the case for suggestion arrays containing many thousands of entries.
These suggestion sets are not only slow, but generally not useful. We
avoid the creation of such overly long suggestion sets using several
word filters that take advantage of our knowledge of the incoming data
to optimize for what is useful in a bibliographic context. The
mechanisms employed by this patch are:
- Omit suggestions whose length is longer than the max prefix key length
when the prefix key length is less than or equal to the maximum prefix
key length minus the maximum edit distance.
- Omit words that contain a run of 5 or more digits. This will drop most
identifiers from the dictionary while still allowing suggestions to
happen for year values.
- Omit empty keys from the dictionary. This should have been the case
already but is now enforced directly.
- Add a small speedup to evergreen.text_array_merge_unique() by making
it assume that arrays passed to it do not have null values, which we
intentionally avoid, and against which we protect in other ways in the
commit.
Besides improving reingest speed, the patches will also make the
search.symspell_dictionary table significantly smaller.
Signed-off-by: Mike Rylander <mrylander@gmail.com>
Signed-off-by: Galen Charlton <gmc@equinoxOLI.org>