ODIN 2.1 Readme

# ODIN 2.1

<!--
NOTE: this is a MarkDown document. It is plain text, but is best viewed
      in a MarkDown viewer (see: https://en.wikipedia.org/wiki/Markdown).
-->

ODIN, the Online Database of INterlinear glossed text, is a collection of
IGT extracted from linguistics documents on the web.

This release encodes the data into the
[Xigt](http://depts.washington.edu/uwcl/xigt/)
format, which provides several benefits:

  * The XML data format is more explicit than plain text, so it's more
    interpretable
  * The Xigt data model is extensible, so additional layers of annotation can
    be provided on top of the existing data
  * The Xigt project provides tools for querying and processing the data,
    as well as an API for working with the data programmatically

In addition, this release improves the original ODIN text data in some ways:

  * The original sentences have been cleaned and noise/errors from the
    PDF extraction process have been reduced
  * Basic alignments between the language, gloss, and translation lines
    have been created where possible
  * Enrichments to the data (POS tags, phrase structure, bilingual alignments,
    and dependencies) are provided, where possible, via the
    [INTENT project](http://intent-project.info/)

This release contains:

  * 2026 source documents
  * 158,007 IGTs
  * 1496 languages by ISO-639-3 code
    - 157,363 IGTs (1494 languages) with an identified code
    - 637 IGTs for language varieties without a code (given code `???`)
    - 7 IGTs for artificial languages (given code `*xxx*`)

## License and Attribution

The ODIN data are released under the [Creative Commons Attribution 4.0
International License](http://creativecommons.org/licenses/by/4.0/)), requiring
only attribution. Users may fulfill the attribution requirement by linking to
the ODIN website, and/or by citing the ODIN publications.

When using individual IGT, such as in a linguistics paper, it is good practice
to also cite the original source. Each IGT in ODIN has a `doc-id` attribute,
which is an identifier for the document the IGT was extracted from. This
identifier can be looked up in the `citations.txt` file to find information
about the source document, such as the title, year, and author.

## Corpus Contents

The ODIN data exist under the `data/` subdirectory. There are six compressed
archives containing different levels of annotation and different views of the
data, as described below. The files are `tar` archives with `bzip` compression.
On machines with the `tar` command-line program, the files can be extracted with
the following commands (e.g., using `data/by-doc-id/txt.tbz2`):

    cd data/by-doc-id/
    tar xf txt.tbz2

Alternatively, or on Windows system, a program such as
[7Zip](http://www.7-zip.org/) can extract the files.

There are three levels of annotation: (1) the original text data, (2) the data
imported into the Xigt XML format, and (3) the Xigt XML data augmented with
inferred and enriched annotation tiers. There are then two views on the data,
where the views group IGTs into corpus files by (a) the document ID or (b) the
subject language (as determined by their assigned ISO-639-3 code). There are
thus two subdirectories under `data/`, `data/by-doc-id/` and `data/by-lang/`,
and a total of six corpus collections:

1. `data/by-doc-id/`
  * `txt`
  * `xigt`
  * `xigt-enriched`
2. `data/by-lang/`
  * `txt`
  * `xigt`
  * `xigt-enriched`

The corresponding collections across the views (e.g., `by-doc-id/txt` and
`by-lang/txt`) will contain the same data. The only difference is how the IGTs
are grouped into corpus files (by document ID or by ISO-639-3 code).

Under each view's subdirectory (`by-doc-id/` or `by-lang/`) there is a
`languages.txt` file that lists, for each file, the languages used and the
number of IGTs for those languages. For example, the entry for the `ben` corpora
(`ben.txt` or `ben.xml`) under the `by-lang/` view shows the following:

    ben:
    231    Bangla (ben)
    117    Bengali (ben)
    4      Bengalisch (ben)

In each Xigt collection (`xigt` or `xigt-enriched`), there is a `summary.txt`
file that gives an overview of the number of items, tier types, etc. found in
each corpus file. The abbreviated entry for the same `ben` corpora is as
follows:

    ben.xml:
       352   IGTs
        36   source documents
         3   languages (by name)
         1   languages (by ISO-693-3 language code)
       127   IGTs with tiers: odin, odin, odin, phrases, translations, words, words, glosses, glosses, morphemes, pos, pos, bilingual-alignments, phrase-structure, dependencies
        89   IGTs with tiers: odin, odin, odin, phrases, translations, words, words, glosses, glosses, morphemes, pos, pos, pos, bilingual-alignments, phrase-structure, phrase-structure, dependencies, dependencies
        36   IGTs with tiers: odin, odin, odin, phrases, words, glosses, glosses, morphemes, pos
       ...   ...
      1056   tiers of type: odin
       632   tiers of type: pos
       606   tiers of type: glosses
       ...   ...

The "IGTs with tiers" lines show how many IGTs exist with the specified
collection of tier types. The "tiers of type" lines show how many tiers of the
specified type exist across all IGTs in the corpus. Differences across IGTs are
due to the available data in the original representation and how well we were
able to infer and enrich the data (we are able to enrich clean data better than
noisy data). The accompanying `enrichment_flowchart.pdf` file illustrates this
process in more detail. Also note that some tier types are repeated (e.g.,
`odin`, `words`, etc.). For the `odin` tiers, these are for the `raw`,
`cleaned`, and `normalized` versions of the text data (a `state` attribute on
those tiers is used to indicate the level of cleaning done). For the other tier
types, it either represents an alternative analysis (as in `dependencies` or
`pos`), or when the same tier type is used on different data sources (as in
`words` for the language line or the translation line).

## Xigt format

The [Xigt project](http://depts.washington.edu/uwcl/xigt/) has documentation
about the structure of the XML, but we provide a brief explanation here.

The data contains only four levels of nesting: the root element,
`<xigt-corpus>`, contains a list of `<igt>` elements, which contain `<tier>`
elements, which in turn contain `<item>` elements. The actual IGT data is
expressed in the `<item>` elements, and `<tier>` elements group `<item>`
elements of the same type (e.g. all glosses). In addition, `<metadata>`
elements may appear at the `<xigt-corpus>`, `<igt>`, or `<tier>` levels,
before the other kinds of child elements.

Here's a selected (and simplified) example:

```xml
  
     
       
         Mandarin
         English
       
     
     
       hua kaishi hong le
     
     
       
       
       
       
     
     
       flower
       begin
       red
       Prc
     
    
      Flowers started to turn red.
    
  

```
> (This example is taken from Wu, Jiun-Shiung. *Modeling temporal progression
> in Mandarin: aspect markers and temporal relations*. Diss. Tex. Austin, 2008.)

As the Xigt-encoded ODIN corpus is automatically created from IGTs extracted
from PDFs, we try to avoid losing information during encoding by using a
pseudo-standoff annotation on the original extracted text. For instance, the
`phrases` tier in the above example would select its content from an `odin` text
tier like this:

```xml
...
  
    hua   kaishi hong    le
    ...
  
  
    
  
...
```

Note that this `odin` tier has its `state` attribute set to `normalized`. The
original text would be in a first `odin` tier with `state` set to `raw` (not
shown). Some corruptions of the text extracted from the PDF can be automatically
recovered, so a second `odin` tier with `state` set to `cleaned` (also not
shown) follows the `raw` tier. The text form of an IGT often have portions that
are not part of the language, gloss, or translation lines, such as item numbers,
author or language names, or extraneous whitespace. The normalized tier
attempts to remove these portions. Finally, the inferred structure tiers
(`phrases`, `glosses`, etc.) annotate the data in the normalized tier, and the
enrichment tiers (`pos`, `phrase-structure`, etc.) annotate these structure
tiers.

## Acknowledgments

Work on ODIN (and related projects) has been funded in part by the following
grants:

* National Science Foundation Grant Nos. BCS-1160274 and BCS-0748919. Any
  opinions, findings, and conclusions or recommendations expressed in this
  material are those of the author(s) and do not necessarily reflect the views
  of the National Science Foundation.
* Singapore Ministry of Education Tier 2 grant (grant number MOE2013-T2-1-016)