Fix Unicode mangling in clean_marc function
authorDan Scott <dscott@laurentian.ca>
Sun, 4 Mar 2012 07:41:11 +0000 (02:41 -0500)
committerJason Stephenson <jstephenson@mvlc.org>
Wed, 7 Mar 2012 20:57:41 +0000 (15:57 -0500)
commita278aba2c4bbcb6aeaeac0c6af6d851fd8ad6d76
tree122dabbc2e719c9f3d9502c1fa1f0877a8c5f555
parent6237b1fafc84b76b2aa393f09c8a7aaca38d6ac5
Fix Unicode mangling in clean_marc function

Calling s/\p{Cc}//go; before entityize() was resulting in all xFFFD
entities being returned for the upper case diacritic characters, which
in turn caused the new unit test to fail (yay unit tests). I added a
corresponding unit tese for entityize() to ensure that the problem
wasn't coming from that function. Switching the order in which the p{Cc}
regex and entityize() calls resolved the corruption in the unit test.

This suggests that Vandelay may be introducing significant corruption to
imported records and that backporting of this commit to the inline
Vandelay variants from previous releases may be warranted.

Signed-off-by: Dan Scott <dscott@laurentian.ca>
Signed-off-by: Jason Stephenson <jstephenson@mvlc.org>
Open-ILS/src/perlmods/lib/OpenILS/Utils/Normalize.pm
Open-ILS/src/perlmods/t/01-OpenILS-Application.t
Open-ILS/src/perlmods/t/14-OpenILS-Utils.t