Tải bản đầy đủ - 0 (trang)
3 Train, Dev and Test Sets

3 Train, Dev and Test Sets

Tải bản đầy đủ - 0trang

Vive la Petite Diff´erence!



Language Models

We used KenLM language modelling toolkit [6] to create two separate 3-gram

language models: one for male texts and one for female texts. During the classification, the class with the higher probability is simply chosen. Even though this

method was substantially different from the Vowpal Wabbit classifier the results

were quite similar (67.98 %, {85317f}). That’s why we decided to combine both

methods with another layer of neural network. This way, the best result so far

was achieved: 71.06 % {12b6d8}.3


Morphosyntactic Tags Only

In order to completely avoid topical imbalance in vocabulary, we also trained

classifiers using only morphosyntactic tags. TreeTagger [13] with a Polish model

was used to tag the texts, i.e. each word was assigned its part of speech and

other tags such as person, number, gender, tense.

A classifier was trained with Vowpal Wabbit with the following features:

– part-of-speech 3-grams,

– morphosyntactic tags for a given word treated as one feature,

– morphosyntactic tags for a given word treated as separate features.

The accuracy on the test set for this classifier was 59.17 % {b4e142}.

A similar, but slightly better result (60.58 %) was obtained with 6-gram language models trained on morphosyntactic tags {edfdfc}.

We find these results quite satisfactory as they were obtained on short tags

using only morphosyntactic tags.


Back to the Corpus

The models described in Sect. 4 were analysed to find the most distinctive features. It turned out that a large number of them are actually “leaks” (expressions

which should have been identified as gender-specific and normalised when HSSS

was created but were not):

– gender-specific inflected forms of verbs absent from the lexicon of inflected


– frequent inflected forms written with a spelling mistake (in particular, without

a diacritic),

– verb by´c (be) with a longer adjective phrase (e.g. jestem bardzo zadowolony/

zadowolona = I am very glad ),

– the word sam/sama (= myself, which has a different masculine and feminine

form and which could mean himself /herself ),

Some other problems were also identified in the balanced corpus: automatically generated spam not filtered out and gender-specific forms found in the film



The output files and source codes are available for inspection and reproduction at Git

repository git://gonito.net/petite-difference-challenge, branch submission-00085.



F. Grali´

nski et al.


The paper presented research on gender classification performed on a corpus

created in a different way than in the work described so far. Gender annotations

in the corpus were obtained by exploiting certain linguistic features of the Polish

language, rather than by relying on meta-data. Furthermore, for the needs of the

experiments the corpus was balanced by websites, in order to minimize the effect

of gender and topic bias. Training data prepared in this manner is unique at least

for the Polish language.

Developed classification algorithm achieved a maximum gender prediction

accuracy of 71.06 %. The algorithm relied on language modelling (KenLM

toolkit) and the Vowpal Wabbit machine learning system. The two methods were

combined using a neural network. Classification results revealed some noise in

the training data that can and should be filtered out. Nonetheless, the prediction

accuracy of above 70 % can be viewed as a success, considering the competitiveness of the task.

Future work plans include further filtering of the corpus based on the information obtained during the gender classification task. Classification algorithms

themselves will also be further optimized. This work will be facilitated by the

Gonito.net platform.

Acknowledgements. Work supported by the Polish Ministry of Science and

Higher Education under the National Programme for Development of the

Humanities, grant 0286/NPRH4/H1a/83/2015: “50,000 sl´

ow. Indeks tematycznochronologizacyjny 1918–1939”.


1. Argamon, S., Koppel, M., Fine, J., Shimoni, A.R.: Gender, genre, and writing style

in formal written texts. TEXT 23, 321–346 (2003)

2. Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Mining the blogosphere:

age, gender and the varieties of self-expression. First Mon. 12(9) (2007). http://


3. Bartle, A., Zheng, J.: Gender Classification with Deep Learning (2015)

4. Buck, C., Heafield, K., van Ooyen, B.: N-gram counts and language models from

the common crawl. In: Proceedings of the Language Resources and Evaluation

Conference, Reykjavk, Icelandik, Iceland, May 2014

5. Grali´

nski, F., Borchmann, L., Wierzcho´

n, P.: “He said she said” – male/female

corpus of polish. In: Proceedings of the Language Resources and Evaluation Conference LREC 2016 (2016)

6. Heafield, K.: KenLM: faster and smaller language model queries. In: Proceedings of

the EMNLP 2011 Sixth Workshop on Statistical Machine Translation, Edinburgh,

Scotland, UK pp. 187–197, July 2011. http://kheafield.com/professional/avenue/


7. Kivinen, J., Warmuth, M.K.: Additive versus exponentiated gradient updates for

linear prediction. In: Proceedings of the Twenty-seventh Annual ACM Symposium on Theory of Computing, STOC 1995, pp. 209–218. ACM, New York (1995).


Vive la Petite Diff´erence!


8. Lakoff, R.: Language and woman’s place. Harper colophon books, Harper & Row

(1975). https://books.google.pl/books?id=0dFoAAAAIAAJ

9. Langford, J., Li, L., Zhang, T.: Sparse online learning via truncated gradient. In:

Advances in Neural Information Processing Systems, NIPS 2008, vol. 21, pp. 905–

912 (2009)

10. Mukherjee, A., Liu, B.: Improving gender classification of blog authors. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language

Processing, EMNLP 2010, pp. 207–217. Association for Computational Linguistics, Stroudsburg (2010). http://dl.acm.org/citation.cfm?id=1870658.1870679

11. Sarawgi, R., Gajulapalli, K., Choi, Y.: Gender attribution: tracing stylometric evidence beyond topic and genre. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, CoNLL 2011, pp. 78–86. ACL, Stroudsburg

(2011). http://dl.acm.org/citation.cfm?id=2018936.2018946

12. Schler, J., Koppel, M., Argamon, S., Pennebaker, J.: Effects of age and gender on blogging. In: Proceedings of AAAI Spring Symposium on Computational

Approaches for Analyzing Weblogs, March 2006

13. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the International Conference on New Methods in Language Processing, vol.

12, pp. 44–49 (1994)

14. Yan, X., Yan, L.: Gender classification of weblog authors. In: Proceedings of the

AAAI Spring Symposia on Computational Approaches, pp. 27–29 (2006)

Towards It-CMC: A Fine-Grained POS Tagset

for Italian Linguistic Analysis

Claudio Russo(B)

Foreign Languages and Cultures Department, University of Turin,

Via Verdi 8, 10124 Turin, Italy



Abstract. The present work introduces It-CMC, a fine-grained POS

tagset that aims at combining linguistic accuracy and computational sustainability. It-CMC is tailored on Italian data from Computer-Mediated

Communication (CMC) and, across the sections of the paper, a sistematically comparison with the current tagset of the La Repubblica corpus

is provided. After an early stage of performance monitoring carried out

with Schmid’s TreeTagger, the tagset is currently involved in a workflow

that aims at creating an Italian parameter file for RFTagger.

Keywords: Morphological tagging · Syntactic tagging

POS tagset · Italian corpora · Linguistic analysis





Recent decades have witnessed a very strong commitment towards POS-tagging

tasks across a constantly increasing number of languages. Within this framework, Italian falls among the lucky well-documented languages: its very first

POS tagset recommendation came from Monachini [11] within the EAGLES

guidelines specification; over the years, some more tagsets have been proposed:

Stein [18] drafted the tagset used in TreeTagger’s first parameter files, a finegrained tagset has been drafted by Barbera [3,4] for the Corpus Taurinense 1 and

the first, inductively generated tagset has been designed by [6]; more recently,

Petrov et al. [12] included Italian in their universal set of progressively updated

POS mappings generated inductively.

Among such resources, probably the best known tagset – among linguists has been illustrated in Baroni et al. [5]: conceived to tag the La Repubblica newspaper corpus [9], it is currently used as the official Italian tagset of the Sketch

Engine platform as well as being included, sometimes with slight adaptations,

in many other linguistic studies. Despite its deserved success, this tagset still

seems to be too coarse-grained a resource to perform some particular linguistic

analysis: on the one hand, some of its generalizations can be tackled by specific

queries, where the lemma is explicitly specified; on the other hand, uniting several linguistic phenomena under the same tag makes searching the tagged corpus


A POS-tagged corpus of ancient Italian.

c Springer International Publishing Switzerland 2016

P. Sojka et al. (Eds.): TSD 2016, LNAI 9924, pp. 62–73, 2016.

DOI: 10.1007/978-3-319-45510-5 8

Towards It-CMC: A Fine-Grained POS Tagset


a trickier task (although it is an effective solution against data sparseness): the

most evident example of such generalization is the CHE tag, which includes any

token of che, which, from a linguistic standpoint, falls under five different categories (namely interrogative pronoun, exclamative pronoun, relative pronoun,

indefinite pronoun and subordinative conjunction) according to the grammar

published by [17]2 .

The present work introduces a finer-grained POS tagset that combines linguistic accuracy and computational sustainability. As a tagset tailored on Italian

in Computer-Mediated Communication (CMC), some of its features may be

perceived as too specific: to prevent the computational backfire generated by

this increased complexity, an easy remapping to Petrov’s universal tagset has

been a condition pursued across the tagset’s development processes. The tagset’s

hierarchy-defined features (HDF) will be presented and compared with the tagset

derived from Baroni et al. [5] and currently used in the La Repubblica corpus;

in a second stage, its morpho-syntactic features (MSF) will be listed. The concluding section contains the results of two parallel tagging processes, made with

two POS-taggers made respectively by Schmid [15] and Schmid and Laws [16],

namely TreeTagger and RFTagger.


The It-CMC Tagset: Introduction and General

Comparison with La Repubblica

The It-CMC tagset counts 66 labels divided into 14 categories. For sake of

brevity, the theoretical discussion about the terminological differences between

It-CMC and the La Repubblica tagset has been skipped; a mapping table

between the two tagsets is available in Appendix 6.

The two tagsets count 23 equivalent matchings on similar phenomena such

as, for instance, qualifying adjectives, conjunctions and articulated prepositions.

Some La Repubblica tags are missing in the It-CMC tagset, namely the negation

tag NEG, the extralinguistic category NOCAT, the seven VER:. . . :cli tags and

the tag CHE ; the abbreviation phenomena were originally included in Baroni’s

tagset but have been deleted from the current La Repubblica tagset: in this work,

they have been processed as MSFs, according to the workflow in which It-CMC is

involved. On the other hand, the It-CMC tags 77 (Formula) and 69 (Emoticon)

were added, clearly because they proved necessary to tag linguistic data from

CMC; the tag 36 (Relative Pronoun-Determiner) partly covers the CHE -tagged

phenomena: their treatment will be explained in a further, dedicated section of

this work. Only two La Repubblica tags, DET:wh and WH, have been united into

the coarser It-CMC tag 35 (Interrogative Pronoun-Determiners), but most of its

inherent linguistic items usually fall under several categories. La Repubblica’s

tags which underwent the heaviest modifications are the verbal tags VER:fin,

AUX:fin and VER2:fin: each of them has been divided into 9 tags, one for each


Chosen over the monumental work by Renzi et al. [13] thanks to its wider consensuality, following the recommendations by Leech [10].


C. Russo

mode-tense combination available in Italian. Each of the tags ADV, ART, CON

and PRE have been split in two, to deal with general vs. connective adverbs,

determinative vs. non-determinative articles, coordinating vs. subordinating conjunctions and simple vs. peudo-prepositions3 .


Analysis of It-CMC and La Repubblica’s Tagset


One-to-One Correspondences and Partial Overlapping

As mentioned in the general presentation, both tagsets include dedicated tags

for 23 phenomena, namely:

Proper and common nouns;

Qualifying adjectives;

Indefinite and possessive pronouns;

Articulated prepositions;

Numeral adjectives/pronouns;

Final and non-final punctuation;

Loan words;


Present and past participle, infinitive forms and gerundive forms of main,

auxiliary and modal verbs.

While such tags correspond at a superficial reckoning, their usage is not

always equivalent across the two corpora: an instance of known discrepancy in

the annotation processes is represented by ordinal numerals (tagged as such with

the It-CMC and as ADJ in La Repubblica).



Pronouns and Determiners Across the Two Tagsets. The tagset presented in Baroni et al. [5], as well as its modified version currently used in the

La Repubblica corpus, categorizes pronouns and determiners in separate sets.

The It-CMC tagset follows a different path and keeps closer to the EAGLES

guidelines, so that pronouns and determiners are merged into the single PronounDeterminer category (P-D) [2].

Strong and Weak Demonstrative P-Ds. The It-CMC tagset includes two

contiguous tags for strong (30) and weak (31) demonstrative P-Ds. Such a distinction has been made because of the diastratic variation range of the corpus:

sometimes the informal weak forms ’sto,’sta, ’sti and ’ste are preferred to their

strong counterparts questo, questa, questi and queste. As a newspaper corpus,

La Repubblica contains no examples of this particular phenomenon and its two

tags, PRO:demo and DET:demo, suitably reflects this sociolinguistic feature.


[17] defines pseudo-preposition as “words that are mainly adverbs used in prepositional function [...]”.

Towards It-CMC: A Fine-Grained POS Tagset


Personal P-Ds and Clitics. Four different tags for personal P-Ds and pronominal clitics have been included in the It-CMC tagset, namely 37 for P-Ds in

nominative case, 38 for stressed P-Ds in oblique case (i.e. any case but nominative), 39 for unstressed P-Ds in oblique case. As a matter of fact, tags 38

and 39 allow users to distinguish accusative (stressed) pronouns from dative

(unstressed) pronouns. According to this classification, the final user would also

be able to distinguish extended dative constructions (1) from contracted ones

(2) with one single query.

1. Questo `e il mio cane. Lo affido a te.

This is my dog. I entrust it to you.



2. Questo `e il mio cane. Te lo affido.

This is my dog. I entrust it to you.



Tag 37 is applied to P-Ds in nominative case, which do not suffer from

homography with other pronouns; the pronouns in (1) and (2) are tagged with

the broader tag PRO:pers in the La Repubblica corpus.

Ci, vi and ne are also potentially ambiguous because of their adverbial homographic counterparts (tagged in It-CMC with the label 46, particle adverb).

According to their context, these particles can either carry locative value as in

(3) or pronominal personal value, as in (4)

3. Ci




There be-1SG.PRES go.PFV.M.SG yesterday

I went there yesterday

4. Ci



They let us out



make-PFV.M.PL exit-INF

Such phenomena are correctly tagged as CLI according to the tagset of

La Repubblica; nonetheless, such a tag simply supplies a distributional description, since clitics either appear before a verb or attached to it in final position.

A combination of a phase of clitic identification combined with the two It-CMC

tags (46 and 39 ) would empower the users to look for such distinct phenomena

more effectively.

General and Connective Adverbs. Although [17] performed a deeper semantic analysis of the adverbial values (thus identifying seven categories), It-CMC

presents only two continuous tags for general (45 ) and connective (47 ) adverbs:

this choice was made mainly to prevent data sparseness, but the implementation

of such information may be taken into account in the near future.


C. Russo

Within the training corpus, adverbs with connective value used in sentenceinitial position have been tagged with tag 47, as in (5):

5. Beninteso,

Of course



nothing of

male: . . .

evil: . . .

Of course, there’s nothing bad about it: . . .

Any other adverb has been tagged as general (45 ). Within La Repubblica’s

current tagset, adverbs are divided into adverbs with -mente endings and any

other adverb. The It-CMC tagset does not provide such a distinction at the

current stage, leaving the search of -mente adverbs to the user, for example by

means of regular expression operators and/or advanced CQL queries.

Simple and Pseudo-Prepositions. Another slight modification to the tagset

used in La Repubblica involves the treatment of simple and pseudo-prepositions.

It-CMC tags simple prepositions with the tag 56 and pseudo-prepositions4 with

the tag 59 ; within the La Repubblica corpus, instead, there is no such distinction and any non-articulated preposition is tagged as PRE. Tagging simple and

pseudo-prepositions separately may prove useful, as far as linguistic inquiry is

concerned: since pseudo-prepositions involve a much larger number of items than

their counterpart, a dedicated tag can let the user search them selectively by simply inserting the POS in the query. On the other hand, this adverb/preposition

overlapping is usually harder to detect by POS-tagging software and can require

a remarkable number of distinctive items within the training corpus.

Definite vs. Indefinite Articles. The It-CMC tagset includes two tags for

definite (60 ) and indefinite articles (61 ). This higher degree of specification does

not burden the computational sustainability and, compared to La Repubblica’s

single ART label, it allows users to perform quicker queries.

Finite Forms of Main, Auxiliary and Modal Verbs. The most prolific

splitting operation has been performed on La Repubblica’s three broad finite

verb tags. The labels for auxiliary, modal and main verbs’ finite forms (namely

tokens of the indicative, subjunctive, conditional and imperative mode) amount

to 24 distinct tags: each of the three verbal categories counts 4 tags for indicative

verbs, 2 for subjunctive, one for conditional, and one for imperative. On the one

hand, such complexity proves to be problematic, at least in the earliest tagging

stages, because of the formal correspondence of a number of noun and verb

endings in Italian; on the other hand, such label precision is priceless in terms of

linguistic analysis: a more detailed verbal tagging would enable sentence analysis

in terms of hypothetical constructions (whose configuration can use six different

verbal choices), temporal subordination and its narrative development (both


[17] defines pseudo-preposition as “[...] words that are mainly adverbs but are used

in prepositional function: [...]”.

Towards It-CMC: A Fine-Grained POS Tagset


progressive and regressive), finer analysis of morphological errors in learners’

corpora and many other potential topics.

As far as modal verbs are concerned, choosing between a present conditional

verbal realization and a present indicative one carries solid social implications5 :

a single tag VER2:fin, unfortunately, prevents such queries from being carried

out swiftly.

The CHE Tag and Its It-CMC Counterparts. From a linguistic perspective, the CHE tag in the La Repubblica corpus represents the most difficult label

to handle when searching the corpus. The It-CMC proposal includes five distinct

labels, based on [17]: 32 for che as indefinite P-D, 35 for interrogative P-D, 36

for relative P-D, 40 for exclamative P-D and 51 for subordinating conjunction;

some simple examples and the corresponding It-CMC tags are listed in (6)–(10).

6. Ha

un che

have.3SG.PRES a


(He/She) is somewhat familiar.

di familiare.

of familiar.

- 32

7. Che


What do.2SG.PRES

What do you do?

8. Ti


- 35



thank.1SG.PRES for













I thank you for the gift, which has been appreciated.

9. Che


What boredom!

How boring!

10. Non ha

Not have.3SG.PRES

- 36

- 40


che sia

said.PFV.M.SG that be.CONJ.PRES.3SG



(He/She) didn’t say that it would be profitable.

- 51

Such variety of labeling surely requires a much larger amount of training for

any statistical POS-tagger, but also has remarkable potential in terms of linguistic queries and might improve the performance of Italian NLP tools mostly


See Austin [1].


C. Russo

related to parsing and anaphora resolution. The most difficult disambiguation

task lies in separating the tokens of che as a relative P-D from those with conjunctive function; nonetheless, such a distinction might be exploited by NLP

tools to improve the verbs’ referent identification rate, since the subject of the

main proposition is (most likely) different from the subject of the subordinate

one where che has subordinating value.

Excluded Tags. Eleven tags currently used in the La Repubblica corpus have

not been included in It-CMC: ten of such tags are coveded by the set VER:. . . :cli,

conceived to annotate verbs combined with clitic particles. As stated above, ItCMC is part of a workflow that involves a stage of clitic recognition and separation; after such splittings, clitics have been tagged with the weak personal P-D

tag 39. The remaining excluded tag is the residual tag NOCAT, originally used

to tag non-linguistic elements such as percentage signs, interjections, hyphens

and arithmetic signs.

Newly-Added Tags. It-CMC counts 2 newly added tags which are required

by the nature of the linguistic data collected in the corpus: the first tag, 69,

is dedicated to emoticons, whose notable presence is attested throughout the

corpus; the other one, 77, is used to label formulae, such as arithmetic signs and

bits of programming code. A tag for interjections (INT) had been included in

Baroni [5] and suppressed in the current La Repubblica tagset: as it is useful in

labeling CMC data, the interjection tag 68 has been included in the It-CMC



It-CMC Tagset: Morpho-Syntactic Features (MSF)

At the current stage, six categories of morphosyntactic information are represented in the It-CMC tagset. The tagset’s morphosyntactic annotation is filled

in six slots with fixed value, with the following structure:

entry POS.person.gender.number.degree.abbreviation.dialectalism

where values are separated by a full stop (.) and each position is filled with a

suitable value from the set presented in Table 1.

During the tagging process, each tag’s appropriate feature is filled in the

corresponding slot; non-pertinent features are marked with 0. The last two slots

(namely abbreviation and dialectalism) only admit values 1 or 0: only when

a particular item appears in an abbreviated form or comes from a dialectal

inventory, does the appropriate value shift to 1; this tagging strategy aims at

covering the ADJ:abbr, NOM:abbr, NPR:abbr and ADV:abbr tags in Baroni [5]6 ,

in order to prevent the tagset to cause further computational burden.

Sometimes, the mere morphology of some Italian words does not supply

enough information to specify their gender or number7 : in such instances, the



Such tags have been suppressed in the current La Repubblica tagset.

In such cases, gender/number information can be inferred from the word’s context.

Towards It-CMC: A Fine-Grained POS Tagset


Table 1. MSF representation in the It-CMC tagset


Value Feature






pers = 1

pers = 2

pers = 3






gend = masc 3

gend = fem

gend = c





numb = sg

numb = pl

numb = n





degr = pos


degr = comp

degr = sup

Abbreviation 1

Dialectalism 1

abbr = true

dial = true




tags bear a value of 4;5 to represent common gender and the number slot is filled

with the invariant value 6;7, before undergoing manual disambiguation during

the training sessions. As for the superlative degree of adjectives, the value 10

does not cause any ambiguity, since it appears among MFS separators like any

other value.

An example of a fully-formed tag for a feminine, singular qualifying adjective

in its superlative degree would appear (in its explicit form) as



and in its numeric form as



Tagging Sessions and Tagset Comparison

A TreeTagger parameter file modeled on the La Repubblica tagset has been

used to tag a sub-corpus of general and informal CMC data: in this annotation

process, the output correctly tagged 94.9 % of the tokens. Tagging the same subcorpus with It-CMC outcame in 93.9 % of correct guesses. Although 98.8 % of the

instances of che have been correctly identified, more tests on samples with wider

syntactic variation are still needed to reliably detect potential data sparseness.

For both taggers, the most tricky textual features are represented by multiple

orthographic realizations8 and diatopically-marked lexical items.


e.g. ´e, e and `e vs. the standard form `e.

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

3 Train, Dev and Test Sets

Tải bản đầy đủ ngay(0 tr)