Whenever an author writes anything, he or she “marks it up.” For example, spaces between words indicate word boundaries, commas indicate phrase boundaries, and periods indicate sentence boundaries. This fact is widely ignored: indeed, markup is usually treated as an unfortunate requirement of using electronic text-processing systems, that is, as something to be avoided. A careful analysis, however, reveals that authors regularly use two types of markup in their manuscripts: punctuational, for example, placing periods at ends of sentences; and presentational, for example, numbering pages. Thus, markup cannot be escaped because our writing systems require it.
With the advent of text-processing systems came new types of markup and new types of processing. When prepared for reading, either on screen or on paper, documents are marked up scribally. But, when stored in electronic files, documents may be marked up scribally or with special electronic types of markup designed for processing by computer applications. One uses procedural markup to indicate the procedures that a particular application should follow (e.g., .sk to skip a line), descriptive markup to identify the entity type of the current token (e.g. <p> for paragraphs), referential markup to refer to entities external to the document (e.g. — for an em dash), and metamarkup to define or control the processing of other forms of markup (e.g. <! ENTITY dem “Association for Computing Machinery”> to define the referential markup &acm;).
The word markup was originally used to describe annotation or other marks within a text intended to instruct a compositor or typist how a particular passage should be printed or laid out. Examples, familiar to proofreaders and others, include wavy underlining to indicate boldface, special symbols for passages to be omited or printed in a particular font, and so forth. As the production of texts was automated, the term was extended to cover all sorts of special “markup codes” inserted into electronic texts to govern formatting, printing, or other processing.
Generalizing from that sene [sic], we define markup, or (synonymously) encoding, as any means of making explicit an interpretation of a text. At a banal level, all printed texts are encoded in this sense: punctuation marks, use of capitalization, disposition of letters around the page, even the spaces between words, might all be regarded as a kind of markup, the function of which is to help the human reader determine where one word ends and another begins, or how to identify gross structural features such as headings, and syntactic units such as dependent clauses or sentences. Encoding a text for a computer processing is in principle, like transcribing a manuscript from scriptio continua, a process of making explicit what is conjectural or implicit. It is a process of directing the user as to how the content of the text should be interpreted.
The term ‘markup’ appears to be a neologism, derived from the ‘mark-up’ instructions inserted by designers into manuscripts intended for printing (OED). Contrary to this etymology, Coombs et al (1987), Sperberg-McQueen (1991) and Raymond et al. (1992) all claim that markup has been with us for centuries in the form of spaces between words and punctuation. By this they appear to mean that spaces and punctuation are a kind of markup distinct from markup in its purely computational sense. In XML, markup is clearly distinguished form the text: everything between and including pairs of angle brackets, and the white space used to format it, constitutes markup, while the rest oft he document is content (Bray et al. 2008, Ch. 2.4). But they are also aware of the more formal definition: ‘Markup is the use of embedded codes, known as tags, to describe a document’s structure, or to embed instructions that can be used by a layout processor or other document management tools’ (Raymond et al., 1992, p. 1). ‘By markup I mean all the information in the document other than the ‘contents’ of the document itself’.
Markup (alle additionele informatie die aan een tekst wordt toegevoegd) expliciteert voor de computer wat de menselijke lezer impliciet leest, en is dus noodzakelijk voor de creatie van een machine-readable tekst. Historisch gezien werd het woord markup gebruikt om annotaties of andere tekens binnen een tekst te benoemen die de scriptor, “componist” of typist duidelijk maakte hoe de tekst gedrukt, getypt of gelay-out moest worden. Met de automatisering van het tekstbedrijf werd het woord overgenomen voor alle specifieke codering die het uitzicht van een tekst bepaalt.
Although Renaer appears to reject Coombs’s idea that punctuation is a kind of markup, he still sees it as embodied in the formatting information inserted by WYSIWYG word-processors (Renaer, 1997, p. 109). But here, too, a distinction must be drawn between the data structures employed by word processor programs, which use text ranges with standoff binary attributes, and explicit markup languages such as HTML, in which the formatting codes are embedded directly in the text. In the early 90s humanists may be preferred a more expanded definition of markup because they needed to overcome their colleagues’ resistance to its use, by arguing that it was only a variation on something they already used, such as punctuation and spaces, or word-processors. Since the historical discussion that follows is not bound by this constraint, this article will revert to the original definition of markup, like that provided by the OED, as embedded textual codes. References to ‘markup’ without further qualification also assume that markup is embedded in the text that it describes.