Linguistic corpora in Mehri and Jibbāli

Participants : all.


Coordinators : Sabrina Bendjaballah (LLING) & Clément Plancq (LaTTiCe).

Our aim is to create a corpus of recordings, along with their transcription and translation, based on fieldwork with native speakers. We concentrate on the elicitation of morphological paradigms, since this type of recording is not available for Modern South Arabian Languages yet. As for the technical aspects, this corpus will be linked to the existing catalogues of databases on Afroasiatic languages.

Constitution of questionnaires

For both languages, data collection focusses on two domains: i) verb paradigms, ii) the nominal phrase.

Mehri of Oman

  1. Verbal system: digitalization and organisation of verb paradigms, taken from the following 3 sources:
    • the Introduction of the Mehri Lexicon (Johnstone 1987), approximately 70 paradigms
    • the verb forms given in the lexical entries of the Mehri Lexicon
    • new paradigms collected during fieldwork, on the basis of the work conducted by S. Bendjaballah, A. Lonnet and P. Ségéral on the Introduction of the Mehri Lexicon (cf. Task 3).
  2. Nominal system: digitalization and organization of the nominal forms given in the Mehri Lexicon, as well as the data elicited during the fieldwork. In particular, this database will make it possible to clarify the following issues, still poorly understood (cf. Task 4):
    • the organisation of the Mehri nominal templates
    • the well-known allomorphy (a- ~ h-/ħ- ~ Ø) of the morpheme analyzed by Johnstone (1970) as a "definite article"
    • the paradigm of N+Possessive pronouns

Jibbāli

  1. Verbal system. There is no equivalent of the Introduction of the Mehri Lexicon for Jibbali. We are systematically elicitating verb paradigms in order to collect a representative set of paradigms. These paradigms will constitute a corpus comparable to the Mehri corpus described above.
  2. Nominal system. Much in the same way as Mehri, Jibbali nouns and adjectives sometimes bear a prefix that Johnstone (1970) analyses as a definite article. The shape and the contexts in which this morpheme surfaces are problematic.

Technical aspects

For each step, we follow the recommandations put forth by the TGIR Huma-Num (http://www.huma-num.fr/) (joint of the former TGE Adonis and TGIR Corpus IR) more specifically that of the IRCOM consortium (Corpus Oraux et Multimodaux http://ircom.huma-num.fr/).

Encoding and structure of the data and the metadata

  1. Metadata. We adopt the metadata set defined in OLAC (Open Language Archives Community, http://www.language-archives.org/), which is the standard for linguistic data.
  2. Data structure. The paradigms are coded in LMF (Lexical Markup Framework, ISO-24613:2008, XML). This format is well-suited for the encoding of paradigms in a given language. However, the structured representations of XML are less adapted for the manipulation of data coming from various languages that are crucial for Task 5. For this reason, we also represent our data in the format developed by Gene Gragg (Univ. of Chicago) for his morphological database "Afroasiatic Morphological Archive" (AAMA, http://nelc.uchicago.edu/faculty/gragg).

Storage, long-term archival facilities, access to data

Our electronic corpora will join the « Grid of services » Adonis. We will deposit/file our ressources at CoCoON (http://cocoon.huma-num.fr), within a new collection, specific to the project.

Interrogation, exploitation

A specific tool will be developed, consisting of a web application, that will be proposed to the linguistic community (design of a query langage, design and development of a web interface for query entry and data display)