Although any collation software can compare texts on a character-by-character basis, in the more common use case, before collation each text (or comparand) is normally split up into segments or tokens and compared on the level of the token rather than on the character-level. This familiar step in text (pre)processing, called ‘tokenization’, is performed by a tokenizer and can happen on any level of granularity, for instance, on the level of syllables, words, lines, phrases, verses, paragraphs, text nodes in a normalized XML DOM instance, or any other unit suitable to the texts at hand.

Contributed by Caroline. View changelog.