Goals

Glossary

Implementation steps

  1. Enforce declaration of encoding with charset corpus property in registry files
  2. Upgrade regular expression functions in cl/regopt.c to multi-charset implementation
  3. All low-level CL functions must validate MBCs in UTF-8 input strings to ensure well-defined behaviour
  4. Extend code in cl/special_chars.c to support multiple character sets
  5. Revise escape codes for non-ASCII characters (cl_string_latex2iso())
  6. Unicode string normalisation
  7. CQP commands sort, count and tabulate should work out of the box, since they rely on CL string folding
  8. If CQP defines built-in functions for string processing (cqp/builtins.c), they must also be adapted to multiple character sets
  9. Proper handling of fixed-character context in kwic output (cat) will require a major rewrite
  10. If a "heavy" Unicode support library (e.g. ICU) is used, local installation has to be included in binary distribution

Unicode software options (external libraries)

Discussion of support libraries

Regex library benchmarking exercise