An introduction to the Text Encoding Initiative for the 2015 Digital Antiquarian Workshop at the American Antiquarian Society.

TEI

Dawn Childress   @kirschbombe
Digital Antiquarian Workshop   #daw2015

Digital Scholarly Editing with

XML is an “eXtensible Markup Language”; or,
a metalanguage for defining a set of tags.   XML

creates smart documents that contain tags
describing their own structure and content.

 

What is XML / TEI ?

There are many varieties of XML tag sets:
EAD,  XHTML,  MODS,  MARC XML,  &  TEI...

Schema:  defines the rules for encoding an XML document. 

So, the TEI Guidelines define a set of XML tags
for encoding text documents, and provide extensive
documentation on how to use these tags... 

What is XML / TEI ?

...as well as allow us to describe various features
of a text, like...

  • Physical features (paper, condition, binding...)

  • Metadata (authorship, provenance, encoding rationale...)

  • Structure (title, chapter, paragraph, line...)

  • Presentation (italics, underlined, centered...)

  • Context (named entities, dates, references...)

  • Editorial (additions, deletions, marginalia, corrections...)

  • Linguistic (parts of speech)

TEI embeds information about textual features within
the text itself
and records this in an explicit, standard,
and machine-readable way,  which enables us to

analyze, share, and preserve texts. 

TEI is
Scholarly

  • It creates a strategic representation of a text

  • It is not simple, neutral or objective

  • It involves analysis and decision-making

  • It requires specialized knowledge of the discipline, texts, and research methods

  • It is intrinsic to the scholarly research that will be performed on the text

Why use TEI ?

Working with XML/TEI:

The basics

XML Components

Processing instructions

Elements (tags)

Attributes

Values

Comments

<?xml version="1.0" encoding="UTF-8"?>
<body>
    <!-- text goes here -->
    <head rend="bold">Heading of an XML example.</head>
    <p n="1">Paragraph of this example.</p>
</body>

XML Syntax

Basic XML element and attribute syntax:
<element attribute="value"> </element>
 

Every start tag has a closing tag:
<element> </element>

 

Tags must nest cleanly:
<publicationStmt><p>Not for distribution.</p></publicationStmt>

 

Tags are case sensitive:

<titlePage></Titlepage>

TEI Document Structure

There are other required elements within the <teiHeader> 
and <text>
 elements which we'll discuss in the upcoming sections. 

Every TEI document consists of a 
TEI Header <teiHeader> and Text <text> section, 
all enclosed within the<TEI> element.

<TEI xmlns="http://www.tei-c.org/ns/1.0">
    <teiHeader>
          <!--  -->
    </teiHeader>
    <text>
          <!--  -->
    </text>
</TEI>

Contextual

 identification of names, titles, places, languages, emphasis, etc.

Editorial/analytic

 annotation, explication, correction, normalization, etc.

Structural

divisions, chapters, lists, paragraphs, tables, line groups, lines, etc.

Presentational

 typographic features like bold, italics, small case, indentations, etc.

Categories of TEI Markup

Structural markup

(and getting familiar with oXygen)

  • Front  <front>  *optional
    may contain <titlePage>, <div type="preface”("preface", "contents", etc.)>
     
  • Body  <body>  *required
    contains the main body of the text, <p>, <div>, <lg> and <l>, <speaker>, etc.
     
  • Back  <back>  *optional
    may contain <div type="index" ("appendix", "glossary", "editorial", "colophon", "bibliogr", etc.)>

Three main sections within <text>

Within the  <text> element, the  <body> element is required

We end up with something like this:

<TEI xmlns="http://www.tei-c.org/ns/1.0">
      <teiHeader>
           <!-- Header goes here -->
      </teiHeader>
      <text>
          <front><!-- Front matter --></front>
          <body><!-- Main body of text --></body>
          <back><!-- Back matter --></back>
      </text>
</TEI>

Exercise 1

front, body, back

Mark up the Poe print handout with the major text divisions:  front, body, and back.  What other structural elements do we have? 

Some common structure tags

TEI has predefined tags for some structural elements. For other structural elements, we use the <div> element, with or without the type="" attribute...

  • <p> paragraph

  • <lg> line group

  • <l> line

  • <head> heading

  • <fw> forme work

  • <item>

  • <table>

  • <list>

Exercise 1

Examples of <div type=" "> values

Front

  • "preface"

  • "ack"

  • "dedication"

  • "abstract"

  • "contents"

  • "frontispiece"

Body

  • "volume"

  • "book"

  • "chapter"

  • "section"

  • "part"

  • "letter"

  • "act"

  • "scene"

  • "poem"

  • "canto"

Back

  • "appendix"

  • "glossary"

  • "notes"

  • "bibliogr"

  • "index"

  • "colophon"

Exercise 1

Exercise 2

now let's do this in oXygen!

Open the poe.txt.xml document in oXygen and encode the text structures you defined in your print out. (See the TEI Exercises.pdf handout)

<front>
        <titlePage>
            <titlePart type="main">THE INCOMPLETE WORKS OF EDGAR ALLAN POE:</titlePart>
            <titlePart type="sub">A VERY BRIEF ANTHOLOGY</titlePart>

            <docImprint>
                <publisher>Association of College and Research Libraries</publisher>
                <docDate>2012</docDate>
            </docImprint>
        </titlePage>
        <pb/>
        <div type="contents">
            <head>TABLE OF CONTENTS.</head>

            <list>
                <item>The Raven ................................ 1</item>
                <item>The Angel of the Odd ..................... 2</item>
                <item>Scenes from “Politian”.................... 7</item>
                <item>Notes .................................... 10</item>
            </list>
        </div>
    </front>
    <pb n="1"/>
    <body>
        <div type="poem">
            <head>THE RAVEN.</head>

            <lg type="stanza">
                <l>Once upon a midnight dreary, while I pondered, weak and weary,</l>
                <l>Over many a quaint and curious volume of forgotten lore,</l>
                <l>While I nodded, nearly napping, suddenly there came a tapping,</l>
                <l>As of some one gently rapping, rapping at my chamber door.</l>
                <l>“'Tis some visiter,” I muttered, “tapping at my chamber door— </l>
                <l>Only this, and nothing more.”</l>

The teiHeader

The <teiHeader> is mandatory and contains descriptive metadata about the document.  

<fileDesc> is required and must contain:

  • <titleStmt>: information about the title and the parties responsible for its content

  • <publicationStmt>: publication details about the digital text

  • <sourceDesc>: bibliographic details about the physical source of the digital text

Exercise 3

applying a TEI schema & the teiHeader

Create a new TEI Lite document, complete the teiHeader, and add out tagged text. (See the TEI Exercises.pdf handout)

 <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>The Incomplete Works of Edgar Allan Poe, Digital Edition</title>
        <respStmt>
          <resp>Encoded with basic TEILite tags</resp>
          <name>Dawn Childress</name>
        </respStmt>
      </titleStmt>
      <publicationStmt>
        <p>Produced for 'Introduction to TEI' at the Digital Antiquarian Workshop 2015.</p>
        </publicationStmt>
      <sourceDesc>
        <p>Excerpted from electronic texts at the University of Virginia Library.</p>
      </sourceDesc>
    </fileDesc>
  </teiHeader>

A few odds & ends

(milestones, presentation, semantics)

Milestones (self-closing tags)

Milestones help solve the problem of "overlapping heirarchies".

Milestones are used to mark-up physical and presentational boundaries such as pages, gatherings, and columns that may not coincide with the structure of the text.

  • <pb /> page break

  • <cb /> column break

  • <lb /> line break

  • <gb /> gathering or folio

  • <milestone /> generic

Presentational markup

The <hi rend=" "> element is a generic tag for marking a word or phrase as typographically distinct

@rend (attribute), when used with any element, allows us to describe how text is rendered in the original document.

  • <head rend="center">

  • <emph rend="italic">

  • <p rend="indent">

  • <del rend="strikethrough">

  • <hi rend="blackletter">

Language markup

Codes for the Representation of Names of Languages: 
www.loc.gov/standards/iso639-2/php/code_list.php

@xml:lang (attribute), when used with a WC3 value, is used to note or define the language of the text (or other languages present in the text).

  • "en" : : English

  • "ang" : : English, Old

  • "de" : : German

  • "cze" : : Czech

  • "kor" : : Korean

<emph xml:lang="de">Schmetterling</emph>

Exercise 4

milestones, presentation, semantics

Add some milestones, @rend values, and tag a language. (See the TEI Exercises.pdf handout)

Contextual matters

(people, places, dates...)

Just as we can use TEI to represent the structure of our document, we can use it to define and provide contextual information for things mentioned in the text, like named entities, dates, geographic features, and interpretive information like themes or keywords. 

Encoding dates

The WCS standard date value is a machine-readable date (can include time) and must be constructed with accurate punctuation. For more info, visit: 
http://www.w3.org/TR/NOTE-datetime

TEI uses the <date> element with an attribute and WC3 standard date values (yyyy-mm-dd) to encode dates.

  • @when

  • @notBefore

  • @notAfter

  • @from

<date when="1792-02-28">Feb. 28, 1792</date>

Encoding named entities

The generic elements <rs> (referring string) and <name> can be used with @type to distinguish the type of entity being named...

  • <persName> proper name

  • <placeName> named place

  • <geogName> river, mountain, etc.

  • <orgName> name of organization

  • <roleName> social roles or titles

<name type="person">Isaiah Thomas</name>

...but there are also specialized tags for many named entities...

The TEI allows us to do more than simply tag a name or place; it provides structures for encoding the contextual information about them -- for example, in the form of "ographies". 

"Ographies"

"Ographies" can be kept in your encoded file, usually in a <div> in the <back>, or in an external file that is referenced in your encoded file. 

"Ographies" are structured lists that provide a place to define these named entities. Think of these as local authority files or lists that are created to give context to parts or all of the text. These can be as simple or complex as you need.

  • personography

  • placeography

  • orgography

  • bibliography

  • shipography

<div type="editorial">
            
  <listPlace>
      <place type="state" xml:id="l_rhode_island">
        <placeName>The State of Rhode Island and Providence Plantations</placeName>
        <country>United States of America</country>
        <region>New England</region>
      </place>
  </listPlace>

  <listOrg>
    <org xml:id="o_federal_reserve">
      <orgName>The Federal Reserve</orgName>
      <desc>Bank through which the US currency ... </desc>
    </org>
  </listOrg>
  
  <listBibl>
    <bibl xml:id="b_lee_1964">
      <author>Harper Lee</author>
      <title>To Kill a Mockingbird</title>
      <date>1964</date>
    </bibl>
  </listBibl>
  
  <list type="filmography">
    <item xml:id="raiders">
      <name>Raiders of the Lost Ark</name>
      <desc>First in a series of action-adventure movies 
        detailing the career of Indiana Jones ... </desc>
    </item>
  </list>
  
</div>

Referencing & the xml:id

The people, places, and other named entities defined in our "ographies" can be referenced in the text using the @ref and @xml:id attributes.

In the text:   <persName ref="#RLP">Richard L. Parker</persName>

In the "ography":   <person xml:id="RLP"><!--  --></person>

<text>
    <persName ref="#P1234"> Elder Edmond Lougee </persName>
</text>
<!-- .... -->
<back>
    <person xml:id="P1234">
       <p>Edmund or Edmond Lougee was born in Exeter Newmarket, Rockingham, 
       New Hampshire, USA on 1731 to John Lougee and Anne Gilman. He married 
       Hannah Lord and had 7 children. He passed away on 3 Jun 1807 in Loudon, 
       New Hampshire, USA.</p>
    </person>
</back>

Exercise 5

contextual matters

Tag a few people, places, or other named entities, then define and reference them using the provided "ographies" . (See the TEI Exercises.pdf handout)

Editorial matters

(document, edit, annotate...)

Encoding the writing process

<subst> surrounds <add> & <del> tags when the added text is a substitution for the deleted text.

TEI allows us to record deletions, additions, corrections, and other evidence of the writing process, whether by the author of a literary text or by a scribe copying out a manuscript.

  • <add> addition to the text

  • <del> deleted text

  • <subst> substitution

  • <sic> an error in the text

  • <corr> a corrected error

  • <supplied> 

  • @resp : who's responsible? usually with defined xml:id

  • @place : supralinear, margin, above, below, etc.

  • @rend : strikethrough, etc.

use with @resp to show something was added by the editor

Normalized vs. diplomatic

In some cases, there may be more than one transcription or encoding possibility to choose from, or the editor may want to normalize a part of the text; in these cases, TEI provides the  <reg> and <orig> elements with the <choice> wrapper element.

<l>My
    <choice>
        <reg>Mistress'</reg><orig>Mistres</orig>
    </choice> eyes are nothing like the
    <choice>
        <reg>sun</reg><orig>Sunne</orig>
    </choice>,
</l>
<l>
    <choice>
        <reg>Coral</reg><orig>Curral</orig>
    </choice> is far more red
    <choice>
        <reg>than</reg><orig>then</orig>
    </choice> her lips red,
</l>

Later, the interface can be designed to display either the normalized or the diplomatic transcription, or both.

  • A note is any additional comment found in a text, marked in some way as being out of the main textual stream.

  • Notes can be marked using the <note> tag

  • Notes may be in a different location, hand or typeface, may be authorial or editorial, and may have been added later

  • It is important to distinguish between authorial and editorial notes — use @type for this

  • Headings or descriptive labels for a passage, can use <label> instead of <note>

Notes & annotation I

Use attributes to specify these and other characteristics of notes (@place, @resp, @type)

Notes & annotations II

Notes can be encoded directly at the point of attachment...

<p>Why does the language-maven in the street 
   (or the senior common-room, or the bar at the Groucho
   Club <note>An establishment patronized by media folk
   in London (provided the club will have them as
   members).</note>) have such a low opinion of 
   linguists? Because...</p>

Notes & annotations III

...or can be added to a <div type="notes"> in the back matter, using the <ref> and <ptr> tags at the point of attachment pointing back to the note.

<div>
    <head>Beyond "anything goes"</head>
    <p> Why does the language-maven in the street
    (or the senior common-room, or the bar at the Groucho
    Club <ptr target="#note6"/>) have such a low opinion of
    linguists? Because...</p>
</div>
<back>
    <head>Notes</head>
    <!-- other notes here -->
    <note xml:id="note6">An establishment patronized
    by media folk in London (provided the club will have 
    them as members).
    </note> <!-- and here -->
</back>

Exercise 6

editorial matters

Tag some additions, deletions, corrections, notes, etc. (See the TEI Exercises.pdf handout)

Publishing/transforming TEI

  • OxGarage

  • The TAPAS Project

  • JuxtaCommons / JuxtaEditions

  • TEI Boilerplate

  • VersioningMachine 

  • Maps & Texts (in development)

  • EpiDoc

  • XSLT

  • JavaScript

  • CSS 

Get some help...

... or roll your own

Keep encoding!

  • TEI P5 Guidelines

  • TEI by Example

  • Women Writers Project (resources)

  • JuxtaCommons / JuxtaEditions