ODIN Changelog

# Change Log

## [v2.1] - 2016.03.14

This release fixes the bugs from v2.0 and introduces enriched tiers
from the [INTENT](http://intent-project.info/) project.

### Added

Where possible, the following inferred tiers are added:

* phrases
* words
* morphemes
* glosses
* translations

And, where possible, the following enriched tiers are added:

* pos
* bilingual-alignment
* phrase-structure
* dependencies

Also, the `odin-xigt.rnc` RelaxNG schema is included under the `schema/`
directory for validating the XigtXML files.

### Changed

* IDs:
  - `iX` style IDs (e.g., `i2`) for `<igt>` elements are now `igtD-X` where `D`
    is the doc-id (document ID) and X is the IGT number for that document (e.g.,
    `igt1260-7`)
  - Some `<igt>` elements had `.txt` at the end of the ID, which came from
    errors in the original text corpus. This suffix has been removed
    from both the text corpus and the IDs in the XML. (also see "Filenames"
    below)
  - All IDs with integers now begin from 1 instead of 0
  - IDs on `<meta>` are now of the form `metaX` (e.g., `meta1`)
* Metadata
  - The `odin-source` metadata is deprecated in favor of attributes on
    the `<igt>` elements:
    ```xml
    <igt id="igt123" tag-types="L G T" line-range="234-238" doc-id="1">
    ```
  - The `language` meta type is deprecated in favor of OLAC-style
    metadata, such as:
    ```xml
    <metadata>
      <meta id="meta1">
        <dc:subject xsi:type="olac:language" olac:code="...">...</dc:subject>
        <dc:language xsi:type="olac:language" olac:code="...">...</dc:language>
      </meta>
    </metadata>
    ```
  - Namespaces for the OLAC-style metadata are placed on `<xigt-corpus>`
  - Metadata in the ODIN text format are simplified for release
* Tiers
  - The cleaning and normalizing of ODIN data is now done with separate
    tiers. Cleaning should only attempt to fix errors in the input, and
    normalization can alter text (e.g. remove example numbers, rejoin
    lines, etc.).
  - The ODIN tiers are unified and distinguished with a `state` attribute:
    - `type="odin-raw"` becomes `type="odin" state="raw"`
    - `type="odin-clean"` becomes `type="odin" state="cleaned"` and
       `type="odin" state="normalized"`
* Judgments in the text for `L` or `T` lines are extracted and a `judgment`
  attribute is added to the `<item>`. Judgments are only extracted when
  one or more of `*`, `?`, or `#` appear at the beginning of the line.
  Note that this won't be 100% accurate, nor does it attempt to extract
  judgments from the middle of sentences (e.g. for alternations).
* Translation lines
  - Attempts are made to separate multiple translations into individual
    items, with the secondary ones getting tags like `+AL` for "alternate"
    and `+LT` for "literal"
  - Notes on translations (like `intended:` or `literally:`) get moved to
    a `note` attribute on the `<item>`.
* Filenames
  - Data subdirectories are now collected under a `data/` directory
  - Corpus collections of the same view are placed in a common subdirectory
    (e.g., `data/by-doc-id/` and `data/by-lang/`), and the collections are
    named by their format:
    - `data/by-doc-id/txt`
    - `data/by-doc-id/xigt`
    - `data/by-doc-id/xigt-enriched`
    - `data/by-lang/txt`
    - `data/by-lang/xigt`
    - `data/by-lang/xigt-enriched`
  - The `languages.txt` files are now grouped under a view directory (e.g.,
    `data/by-doc-id/languages.txt`), since they apply to all collections under
    that directory.
  - Some files had two extensions (*.txt.txt); these now have one (*.txt).
    (Also see "IDs" above)
  - Colons are not valid characters in Windows filenames, so the "by-lang"
    filenames like "aer:are.txt" are now hyphen-separated ("aer-are.txt")

### Removed

* The `full/` directory is removed


## [v2.0] - 2014.07.05

The 2.0 release of ODIN provides both the textual ODIN corpus and the
Xigt-encoded version XML version.

### Overview

There are five subdirectories:

* `full/` - The whole corpus in one large XigtXML file
* `by-doc-id/` - A XigtXML file for each source document
* `by-lang/` - A XigtXML file for each language code
* `txt-by-doc-id/` - The original text corpus, split by source document
* `txt-by-lang/` - The original text corpus, split by language

The XigtXML subdirectories also contain two additional files:

* `summary.txt` - an overview of the counts of items, languages, etc.
   for each file
* `languages.txt` - a listing of the languages found in each XML file

### Known bugs

(fixed in [v2.1](#v21---20151106))

* Inferred "glosses" and "translations" tiers in the XigtXML files do
  not have the "alignment" reference attribute specified, even when
  their items do specify it
* Inferred "glosses" and "translations" tiers use the "content"
  reference attribute to refer to a non-existent "p1" item (when it
  should be "p0")


[v2.0]: http://depts.washington.edu/uwcl/odin/
[v2.1]: http://depts.washington.edu/uwcl/odin/