token

A Single event which happens once and whose  identity is limited to that one happening or a Single object or thing which is in some single place at any one instant of time, such event or thing being significant only as occurring just when and where it does, such as this or that word on a single line of a single page of a single copy of a book, I will venture to call a Token.

(Peirce 1906, 506)

Contributed by Caroline. View changelog.

The types referred to here are not to be confused with the datatypes of programming languages, nor with the types in Russell’s theory of types; they do, however, include the ‘types’ of the type/token distinction introduced by C. S. Peirce, for which the locus classicus appears to be (Pierce, 106, pp. 423-24).

(Huitfeldt et al. 2008, 309, note 1)

Contributed by Caroline. View changelog.

By a document we understand an individual object containing marks. A mark is a perceptible feature of a document (normally something visible, e.g. a line in ink). Marks may be identified as tokens in so far as they are instances of types, and collections of marks may be identified as sequences of tokens in so far as they are instances of sequences of types. In other words, a mark is a token if, but only if, it is understood as instantiating a type1.

Contributed by Caroline. View changelog.

The model is agnostic about whether the types (and tokens) it is concerned with are those at the character level or those at the level of words and lexical items.

Contributed by Caroline. View changelog.

Although any collation software can compare texts on a character-by-character basis, in the more common use case, before collation each text (or comparand) is normally split up into segments or tokens and compared on the level of the token rather than on the character-level. This familiar step in text (pre)processing, called ‘tokenization’, is performed by a tokenizer and can happen on any level of granularity, for instance, on the level of syllables, words, lines, phrases, verses, paragraphs, text nodes in a normalized XML DOM instance, or any other unit suitable to the texts at hand.

Contributed by Caroline. View changelog.