Accesul la această resursă este restricționat. Pentru a o descărca adresați-vă unui membru al echipei.
Autori: Cătălina Mărănduc, Augusto Perez
Un treebank balansat, documentele au în titlu CHAT (social media – 2 fișiere xml) – 2500 fraze, CONT (contemporan – 9 fișiere xml) – 8444 fraze, OLD (limbă veche – 33 fișiere) – 20000 fraze și POP (folclor – 5 fișiere xml) – 25000 fraze. Dimensiunea totală a resursei: 38.600 de fraze adnotate manual.
Exemplu 1. Corpus contemporan – 1984, George Orwell
<treebank id=”CONT_1984_orwel”>
<sentence id=”8″ parser=”” user=”augusto” date=”2016-05-27″>
<word id=”1″ form=”Pe” lemma=”pe” postag=”Spsa” head=”15″ chunk=”” deprel=”c.c.l.”/>
<word id=”2″ form=”fiecare” lemma=”fiecare” postag=”Di3-sr” head=”3″ chunk=”” deprel=”a.adj.”/>
<word id=”3″ form=”palier” lemma=”palier” postag=”Ncmsrn” head=”1″ chunk=”” deprel=”prep.”/>
<word id=”4″ form=”,” lemma=”,” postag=”COMMA” head=”5″ chunk=”” deprel=”punct.”/>
<word id=”5″ form=”așezată” lemma=”așeza” postag=”Vmp–sf-p–r” head=”15″ chunk=”” deprel=”el.pred.”/>
<word id=”6″ form=”faţă în faţă” lemma=”faţă_în_faţă” postag=”Rg” head=”5″ chunk=”” deprel=”c.c.l.”/>
<word id=”7″ form=”cu” lemma=”cu” postag=”Spsa” head=”6″ chunk=”” deprel=”c.c.soc.”/>
<word id=”8″ form=”ușa” lemma=”ușă” postag=”Ncfsry” head=”7″ chunk=”” deprel=”prep.”/>
<word id=”9″ form=”liftului” lemma=”lift” postag=”Ncmsoy” head=”8″ chunk=”” deprel=”a.subst.”/>
<word id=”10″ form=”,” lemma=”,” postag=”COMMA” head=”5″ chunk=”” deprel=”punct.”/>
<word id=”11″ form=”figura” lemma=”figură” postag=”Ncfsry” head=”15″ chunk=”” deprel=”sbj.”/>
<word id=”12″ form=”cea” lemma=”cel” postag=”Tdfsr” head=”13″ chunk=”” deprel=”det.”/>
<word id=”13″ form=”enormă” lemma=”enorm” postag=”Afpfsrn” head=”11″ chunk=”” deprel=”a.adj.”/>
<word id=”14″ form=”îl” lemma=”el” postag=”Pp3msa——–w” head=”15″ chunk=”” deprel=”c.d.”/>
<word id=”15″ form=”privea” lemma=”privi” postag=”Vmii3s” head=”0″ chunk=””/>
<word id=”16″ form=”fix” lemma=”fix” postag=”Rg” head=”15″ chunk=”” deprel=”c.c.m.”/>
<word id=”17″ form=”din” lemma=”din” postag=”Spca” head=”15″ chunk=”” deprel=”c.c.l.”/>
<word id=”18″ form=”perete” lemma=”perete” postag=”Ncmsrn” head=”17″ chunk=”” deprel=”prep.”/>
<word id=”19″ form=”.” lemma=”.” postag=”PERIOD” head=”15″ chunk=”” deprel=”punct.”/>
</sentence>
….
</treebank>
Exemplu 2. Corpus de limbă veche, secolul XVI, Pravila lui Coresi, 1560
<treebank id=”OLD_XVI_CORESI_Prav_1560″>
…
<sentence id=”2″ parser=”Victoria's parser” user=”ugla” date=”2020-27-23″>
<word id=”1″ form=”Nu” lemma=”nu” postag=”Qz” head=”2″ chunk=”” deprel=”neg.”/>
<word id=”2″ form=”priimeşti” lemma=”priimeşti” postag=”Vmip2s” head=”0″ chunk=””/>
<word id=”3″ form=”Dumnezeu” lemma=”Dumnezeu” postag=”Npmsrn” head=”2″ chunk=”” deprel=”sbj.”/>
<word id=”4″ form=”,” lemma=”,” postag=”COMMA” head=”5″ chunk=”” deprel=”punct.”/>
<word id=”5″ form=”ce” lemma=”ce” postag=”Ccssp” head=”2″ chunk=”” deprel=”coord.”/>
<word id=”6″ form=”priimeaşte” lemma=”primi” postag=”Vmip3s” head=”5″ chunk=”” deprel=”coord.”/>
<word id=”7″ form=”Dumnezeul” lemma=”Dumnezeu” postag=”Npmsry” head=”6″ chunk=”” deprel=”sbj.”/>
<word id=”8″ form=”acela” lemma=”acela” postag=”Dd3msr—o” head=”6″ chunk=”” deprel=”c.d.”/>
<word id=”9″ form=”ce” lemma=”ce” postag=”Pw3–r” head=”8″ chunk=”” deprel=”a.vb.”/>
<word id=”10″ form=”se” lemma=”sine” postag=”Px3–a——–w” head=”11″ chunk=”” deprel=”refl.”/>
<word id=”11″ form=”roagă” lemma=”ruga” postag=”Vmip3s” head=”9″ chunk=”” deprel=”subord.”/>
<word id=”12″ form=”bine” lemma=”bine” postag=”Rg” head=”11″ chunk=”” deprel=”c.c.m.”/>
<word id=”13″ form=”.” lemma=”.” postag=”PERIOD” head=”2″ chunk=”” deprel=”punct.”/>
</sentence>
….
</treebank>
- corpus
- Perez, C.-A, Linguistic Resources for Natural Language Processing, PhD thesis, Al. I. Cuza University, Iași, 2014.
- Perez, C-A., A Syntactically Annotated Treebank Corpus for the Romanian Language, in the 14th International Conference of the Department of Linguistics, organized by the Faculty of Letters, University of Bucharest, 2014.
- Perez, C.-A., C. Mărănduc, R. Simionescu, Including Social Media, a Very Dynamic Style, in the Corpora for Processing Romanian Language, in Proceedings at EUROLAN 2015. Springer Publishing, Switzerland, 139–153, 2016. https://link.springer.com/chapter/10.1007/978-3-319-32942-0_10
- Perez, C.-A., C. Mărănduc, R. Simionescu, Social Media – Processing Romanian Chats and Discourse Analysis, Computación y Sistemas 20, 3, 404–414, 2016. http://dx.doi.org/10.13053/cys-20-3-2453.
- Mărănduc, C., C.-A. Perez, A Resource for the Written Romanian: the UAIC Dependency Treebank, in Proceedings of ConsILR, Mălini, 27-29 Oct. pp. 79-90, 2016.
- Mărănduc, C., F. Hociung, V. Bobicev, Treebank Annotator for multiple formats and conventions. 2017b Proceedings of The 4th Conference of Mathematical and Computer Science Society of the Republic of Moldova, pp. 529-534, 2017.
- Mărănduc C., L. Malahov, C.-A. Perez, A. Colesnicov, RoDia project of a regional and historical corpus for Romanian, in Proceedings of MFOI, Chișinău, p. 268-284, 2016.
- Mărănduc C., V. Bobicev, R. Untilov, Morpho-Syntactic Regularities in UD_Romanian-Nonstandard Parsing, in Proceedings of ConsILR, Cluj, 18-20 Nov. 2019, Iași, Al. I. Cuza University Publishing House, 2019.