# Change Log ## [v2.1] - 2016.03.14 This release fixes the bugs from v2.0 and introduces enriched tiers from the [INTENT](http://intent-project.info/) project. ### Added Where possible, the following inferred tiers are added: * phrases * words * morphemes * glosses * translations And, where possible, the following enriched tiers are added: * pos * bilingual-alignment * phrase-structure * dependencies Also, the `odin-xigt.rnc` RelaxNG schema is included under the `schema/` directory for validating the XigtXML files. ### Changed * IDs: - `iX` style IDs (e.g., `i2`) for `<igt>` elements are now `igtD-X` where `D` is the doc-id (document ID) and X is the IGT number for that document (e.g., `igt1260-7`) - Some `<igt>` elements had `.txt` at the end of the ID, which came from errors in the original text corpus. This suffix has been removed from both the text corpus and the IDs in the XML. (also see "Filenames" below) - All IDs with integers now begin from 1 instead of 0 - IDs on `<meta>` are now of the form `metaX` (e.g., `meta1`) * Metadata - The `odin-source` metadata is deprecated in favor of attributes on the `<igt>` elements: ```xml <igt id="igt123" tag-types="L G T" line-range="234-238" doc-id="1"> ``` - The `language` meta type is deprecated in favor of OLAC-style metadata, such as: ```xml <metadata> <meta id="meta1"> <dc:subject xsi:type="olac:language" olac:code="...">...</dc:subject> <dc:language xsi:type="olac:language" olac:code="...">...</dc:language> </meta> </metadata> ``` - Namespaces for the OLAC-style metadata are placed on `<xigt-corpus>` - Metadata in the ODIN text format are simplified for release * Tiers - The cleaning and normalizing of ODIN data is now done with separate tiers. Cleaning should only attempt to fix errors in the input, and normalization can alter text (e.g. remove example numbers, rejoin lines, etc.). - The ODIN tiers are unified and distinguished with a `state` attribute: - `type="odin-raw"` becomes `type="odin" state="raw"` - `type="odin-clean"` becomes `type="odin" state="cleaned"` and `type="odin" state="normalized"` * Judgments in the text for `L` or `T` lines are extracted and a `judgment` attribute is added to the `<item>`. Judgments are only extracted when one or more of `*`, `?`, or `#` appear at the beginning of the line. Note that this won't be 100% accurate, nor does it attempt to extract judgments from the middle of sentences (e.g. for alternations). * Translation lines - Attempts are made to separate multiple translations into individual items, with the secondary ones getting tags like `+AL` for "alternate" and `+LT` for "literal" - Notes on translations (like `intended:` or `literally:`) get moved to a `note` attribute on the `<item>`. * Filenames - Data subdirectories are now collected under a `data/` directory - Corpus collections of the same view are placed in a common subdirectory (e.g., `data/by-doc-id/` and `data/by-lang/`), and the collections are named by their format: - `data/by-doc-id/txt` - `data/by-doc-id/xigt` - `data/by-doc-id/xigt-enriched` - `data/by-lang/txt` - `data/by-lang/xigt` - `data/by-lang/xigt-enriched` - The `languages.txt` files are now grouped under a view directory (e.g., `data/by-doc-id/languages.txt`), since they apply to all collections under that directory. - Some files had two extensions (*.txt.txt); these now have one (*.txt). (Also see "IDs" above) - Colons are not valid characters in Windows filenames, so the "by-lang" filenames like "aer:are.txt" are now hyphen-separated ("aer-are.txt") ### Removed * The `full/` directory is removed ## [v2.0] - 2014.07.05 The 2.0 release of ODIN provides both the textual ODIN corpus and the Xigt-encoded version XML version. ### Overview There are five subdirectories: * `full/` - The whole corpus in one large XigtXML file * `by-doc-id/` - A XigtXML file for each source document * `by-lang/` - A XigtXML file for each language code * `txt-by-doc-id/` - The original text corpus, split by source document * `txt-by-lang/` - The original text corpus, split by language The XigtXML subdirectories also contain two additional files: * `summary.txt` - an overview of the counts of items, languages, etc. for each file * `languages.txt` - a listing of the languages found in each XML file ### Known bugs (fixed in [v2.1](#v21---20151106)) * Inferred "glosses" and "translations" tiers in the XigtXML files do not have the "alignment" reference attribute specified, even when their items do specify it * Inferred "glosses" and "translations" tiers use the "content" reference attribute to refer to a non-existent "p1" item (when it should be "p0") [v2.0]: http://depts.washington.edu/uwcl/odin/ [v2.1]: http://depts.washington.edu/uwcl/odin/