UAIC-RoDiaDepTb – Treebank diacronic de limbă română

Accesul la această resursă este restricționat. Pentru a o descărca adresați-vă unui membru al echipei.

Autori: Cătălina Mărănduc, Augusto Perez

Un treebank balansat, documentele au în titlu CHAT (social media – 2 fișiere xml) – 2500 fraze, CONT (contemporan – 9 fișiere xml) – 8444 fraze, OLD (limbă veche – 33 fișiere) – 20000 fraze și  POP (folclor – 5 fișiere xml) – 25000 fraze. Dimensiunea totală a resursei: 38.600 de fraze adnotate manual.

Exemplu 1. Corpus contemporan – 1984, George Orwell

      <treebank id=”CONT_1984_orwel”>

 <sentence id=”8″ parser=”” user=”augusto” date=”2016-05-27″>

  <word id=”1″ form=”Pe” lemma=”pe” postag=”Spsa” head=”15″ chunk=”” deprel=”c.c.l.”/>

  <word id=”2″ form=”fiecare” lemma=”fiecare” postag=”Di3-sr” head=”3″ chunk=”” deprel=”a.adj.”/>

  <word id=”3″ form=”palier” lemma=”palier” postag=”Ncmsrn” head=”1″ chunk=”” deprel=”prep.”/>

  <word id=”4″ form=”,” lemma=”,” postag=”COMMA” head=”5″ chunk=”” deprel=”punct.”/>

  <word id=”5″ form=”așezată” lemma=”așeza” postag=”Vmp–sf-p–r” head=”15″ chunk=”” deprel=”el.pred.”/>

  <word id=”6″ form=”faţă în faţă” lemma=”faţă_în_faţă” postag=”Rg” head=”5″ chunk=”” deprel=”c.c.l.”/>

  <word id=”7″ form=”cu” lemma=”cu” postag=”Spsa” head=”6″ chunk=”” deprel=”c.c.soc.”/>

  <word id=”8″ form=”ușa” lemma=”ușă” postag=”Ncfsry” head=”7″ chunk=”” deprel=”prep.”/>

  <word id=”9″ form=”liftului” lemma=”lift” postag=”Ncmsoy” head=”8″ chunk=”” deprel=”a.subst.”/>

  <word id=”10″ form=”,” lemma=”,” postag=”COMMA” head=”5″ chunk=”” deprel=”punct.”/>

  <word id=”11″ form=”figura” lemma=”figură” postag=”Ncfsry” head=”15″ chunk=”” deprel=”sbj.”/>

  <word id=”12″ form=”cea” lemma=”cel” postag=”Tdfsr” head=”13″ chunk=”” deprel=”det.”/>

  <word id=”13″ form=”enormă” lemma=”enorm” postag=”Afpfsrn” head=”11″ chunk=”” deprel=”a.adj.”/>

  <word id=”14″ form=”îl” lemma=”el” postag=”Pp3msa——–w” head=”15″ chunk=”” deprel=”c.d.”/>

  <word id=”15″ form=”privea” lemma=”privi” postag=”Vmii3s” head=”0″ chunk=””/>

  <word id=”16″ form=”fix” lemma=”fix” postag=”Rg” head=”15″ chunk=”” deprel=”c.c.m.”/>

  <word id=”17″ form=”din” lemma=”din” postag=”Spca” head=”15″ chunk=”” deprel=”c.c.l.”/>

  <word id=”18″ form=”perete” lemma=”perete” postag=”Ncmsrn” head=”17″ chunk=”” deprel=”prep.”/>

  <word id=”19″ form=”.” lemma=”.” postag=”PERIOD” head=”15″ chunk=”” deprel=”punct.”/>

 </sentence>

    ….

</treebank>

Exemplu 2. Corpus de limbă veche, secolul XVI, Pravila lui Coresi, 1560

<treebank id=”OLD_XVI_CORESI_Prav_1560″>

<sentence id=”2″ parser=”Victoria&apos;s parser” user=”ugla” date=”2020-27-23″>

  <word id=”1″ form=”Nu” lemma=”nu” postag=”Qz” head=”2″ chunk=”” deprel=”neg.”/>

  <word id=”2″ form=”priimeşti” lemma=”priimeşti” postag=”Vmip2s” head=”0″ chunk=””/>

  <word id=”3″ form=”Dumnezeu” lemma=”Dumnezeu” postag=”Npmsrn” head=”2″ chunk=”” deprel=”sbj.”/>

  <word id=”4″ form=”,” lemma=”,” postag=”COMMA” head=”5″ chunk=”” deprel=”punct.”/>

  <word id=”5″ form=”ce” lemma=”ce” postag=”Ccssp” head=”2″ chunk=”” deprel=”coord.”/>

  <word id=”6″ form=”priimeaşte” lemma=”primi” postag=”Vmip3s” head=”5″ chunk=”” deprel=”coord.”/>

  <word id=”7″ form=”Dumnezeul” lemma=”Dumnezeu” postag=”Npmsry” head=”6″ chunk=”” deprel=”sbj.”/>

  <word id=”8″ form=”acela” lemma=”acela” postag=”Dd3msr—o” head=”6″ chunk=”” deprel=”c.d.”/>

  <word id=”9″ form=”ce” lemma=”ce” postag=”Pw3–r” head=”8″ chunk=”” deprel=”a.vb.”/>

  <word id=”10″ form=”se” lemma=”sine” postag=”Px3–a——–w” head=”11″ chunk=”” deprel=”refl.”/>

  <word id=”11″ form=”roagă” lemma=”ruga” postag=”Vmip3s” head=”9″ chunk=”” deprel=”subord.”/>

  <word id=”12″ form=”bine” lemma=”bine” postag=”Rg” head=”11″ chunk=”” deprel=”c.c.m.”/>

  <word id=”13″ form=”.” lemma=”.” postag=”PERIOD” head=”2″ chunk=”” deprel=”punct.”/>

 </sentence>

….

</treebank>

  • corpus
  1. Perez, C.-A, Linguistic Resources for Natural Language Processing, PhD thesis, Al. I. Cuza University, Iași, 2014.
  2. Perez, C-A., A Syntactically Annotated Treebank Corpus for the Romanian Language, in the 14th International Conference of the Department of Linguistics, organized by the Faculty of Letters, University of Bucharest, 2014.
  3. Perez, C.-A., C. Mărănduc, R. Simionescu, Including Social Media, a Very Dynamic Style, in the Corpora for Processing Romanian Language, in Proceedings at EUROLAN 2015. Springer Publishing, Switzerland, 139–153, 2016. https://link.springer.com/chapter/10.1007/978-3-319-32942-0_10
  4. Perez, C.-A., C. Mărănduc, R. Simionescu, Social Media – Processing Romanian Chats and Discourse Analysis, Computación y Sistemas 20, 3, 404–414, 2016. http://dx.doi.org/10.13053/cys-20-3-2453.
  5. Mărănduc, C., C.-A. Perez, A Resource for the Written Romanian: the UAIC Dependency Treebank, in Proceedings of ConsILR, Mălini, 27-29 Oct. pp. 79-90, 2016.
  6. Mărănduc, C., F. Hociung, V. Bobicev, Treebank Annotator for multiple formats and conventions. 2017b Proceedings of The 4th Conference of Mathematical and Computer Science Society of the Republic of Moldova, pp. 529-534, 2017.
  7. Mărănduc C., L. Malahov, C.-A. Perez, A. Colesnicov, RoDia project of a regional and historical corpus for Romanian, in Proceedings of MFOI, Chișinău, p. 268-284, 2016.
  8. Mărănduc C., V. Bobicev, R. Untilov, Morpho-Syntactic Regularities in UD_Romanian-Nonstandard Parsing, in Proceedings of ConsILR, Cluj, 18-20 Nov. 2019, Iași, Al. I. Cuza University Publishing House, 2019.

Leave a Reply

Your email address will not be published. Required fields are marked*