Tải bản đầy đủ - 0trang
1 Morphological Analysis, Tagging, and Guessing
a named entity is left on the tagger. With this improvement, the number of
unknown words decreases. Such words are processed by the guesser.
We evaluated the new processing pipeline on a small, “clean” (i.e. containing mostly edited texts) corpus – DESAM . Originally, DESAM contained
1,173,835 tokens, including 722 multi-word tokens. We selected DESAM for two
reasons: it has annotation of MWEs and it was the only resource with marked
foreign words. The disadvantage of DESAM is its cleanness – we expected that
English mixing occurs less frequently in DESAM than in web corpora.
We retokenized DESAM data with the new pipeline (tokenization including
MWEs, foreign chunks detection, analysis, tagging, guessing). We reduced the
number of tokens processed by the guesser from 28,120 to 10,649 (4,094 tokens
were found in gazetteers, 12,345 were detected as foreign, 1,032 were annotated
as parts of MWEs).
We compared MWEs from the original (semi-automatically annotated)
DESAM (409 unique MWEs) with the new one (410 unique MWEs): 50 MWEs
were the same. The disagreement is caused by the unclear deﬁnition of a MWE.
In semi-automatically annotated DESAM, MWEs were law numbers, foreign
phrases, and multi-word named entities. In newly annotated DESAM, most
MWEs were multi-word named entities.
In the next stage, we focused on frequent English-Czech homographs. In Table 3,
it can be seen that our approach sometimes identiﬁes a Czech word as foreign
but in many cases (in bold in the table), it can identify English homographs with
100 % accuracy. The number of tokens in DESAM processed with the standard
pipeline is sometimes higher than the number of tokens in DESAM processed
with our new pipeline because some tokens became part of a MWE (e.g. a
priori ) and are no longer annotated separately. We evaluated manually how
many tokens annotated as foreign were really foreign and for some homographs,
we also evaluated how many tokens annotated as native were really native.
Conclusion and Future Work
We used a complex approach to corpus annotation of Czech texts with language
mixing. We modiﬁed the standard processing pipeline in order to detect MWEs
and foreign language chunks. We used collections of MWEs and gazetteers to
reduce dramatically the number of unknown tokens. This reduction can lead to
Annotation of Czech Texts with Language Mixing
Table 3. Number of interlingual homographs in DESAM annotated by the standard
pipeline (2nd column), new pipeline (3rd column), number of detected foreign tokens,
number of correct annotation of foreign tokens (5th column) and native tokens (7th
column). The homographs are ordered by their length (2 letters, 3 letters, 4 letters).
Dash means that correct annotation was not evaluated.
Token Std DESAM New DESAM Annotated Foreign Annotated Native
as foreign correct as native correct
more appropriate use of the guesser. To our knowledge, this work is the ﬁrst
attempt to annotate Czech texts with language mixing. We plan to use our new
pipeline for annotation of web corpus of Czech where we expect the phenomenon
of language mixing to be more signiﬁcant than in DESAM.
Acknowledgments. The research leading to these results has received funding from
the Norwegian Financial Mechanism 2009–2014 and the Ministry of Education, Youth
and Sports under Project Contract no. MSMT-28477/2014 within the HaBiT Project
This work has been partly supported by the Ministry of Education of CR within
the national COST-CZ project LD15066.
1. Auer, P.: From codeswitching via language mixing to fused lects. Int. J. Bilingualism 3(4), 309–332 (1999)
2. Alex, B.: Automatic detection of english inclusions in mixed-lingual data with an
application to parsing. Ph.D. thesis. School of Informatics, University of Edinburgh, Edinburgh (2008)
afer, R., Bildhauer, F.: Web Corpus Construction. Synthesis Lectures on
Human Language Technologies. Morgan and Claypool, San Francisco etc. (2013).
4. Alex, B., Dubey, A., Keller, F.: Using foreign inclusion detection to improve parsing
performance. In: Proceedings of the 2007 Joint Conference on Empirical Methods
in Natural Language Processing and Computational Natural Language Learning
(EMNLP-CoNLL), pp. 151–160 (2007)
5. Kˇren, M., Cvrˇcek, V., Capka,
a, A., Hn´
a, M., Chlumsk´
Jel´ınek, M., Kov´
a, D., Petkeviˇc, V., Proch´
azka, P., Skoumalov´
a, H., Skrabal,
M., Truneˇcek, P., Vondˇriˇcka, P., Zasina, A.: SYN2015: reprezentativn´ı korpus psan´e
ˇceˇstiny [SYN2015: Representative Corpus of Written Czech] (2015)
6. Suchomel, V., Pomik´
alek, J.: Eﬃcient web crawling for large text corpora. In:
Adam Kilgarriﬀ, S.S. (ed.) Proceedings of the Seventh Web as Corpus Workshop
(WAC7), Lyon, pp. 39–43 (2012)
7. Baldwin, T., Lui, M.: Language identiﬁcation: the long and the short of the matter.
In: Human Language Technologies: The 2010 Annual Conference of the North
American Chapter of the Association for Computational Linguistics, HLT 2010,
pp. 229–237. Association for Computational Linguistics, Stroudsburg (2010)
8. Lui, M., Baldwin, T.: Langid.Py: an oﬀ-the-shelf language identiﬁcation tool. In:
Proceedings of the ACL 2012 System Demonstrations, ACL 2012, pp. 25–30. Association for Computational Linguistics, Stroudsburg (2012)
9. Eskander, R., Al-Badrashiny, M., Habash, N., Rambow, O.: Foreign words and
the automatic processing of Arabic social media text written in Roman script.
In: Proceedings of the First Workshop on Computational Approaches to Code
Switching, pp. 1–12. Association for Computational Linguistics, Doha, October
10. Ahmed, B.U.: Detection of foreign words and names in written text. Ph.D. thesis.
Pace University, New York, NY, USA (2005). AAI3172339
a, J.: Morphological guesser of Czech words. In: Matouˇsek, V.,
Mautner, P., Mouˇcek, R., Tauser, K. (eds.) TSD 2001. LNCS (LNAI), vol. 2166,
pp. 70–75. Springer, Heidelberg (2001)
12. Hana, J., Zeman, D., Hajiˇc, J., Hanov´
a, H., Hladk´
a, B., Jeˇr´
abek, E.: Manual for
morphological annotation, revision for the Prague dependency treebank 2.0. Tech´
nical report TR-2005-27, UFAL
MFF UK, Prague, Czech Rep. (2005)
13. Smerk, P., Sojka, P., Hor´
ak, A.: Towards Czech morphological guesser. In: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN
2008, pp. 1–4. Masarykova univerzita, Brno (2008)
a, Z.: Annotation of multi-word expressions in Czech texts. In:
ak, A., Rychl´
y, P. (eds.) Ninth Workshop on Recent Advances in Slavonic
Natural Language Processing, pp. 103–112. Tribun EU, Brno (2015)
15. Jakub´ıˇcek, M., Kilgarriﬀ, A., Kov´
aˇr, V., Rychl´
y, P., Suchomel, V.: The TenTen
corpus family. In: 7th International Corpus Linguistics Conference, CL 2013, pp.
125–127. Lancaster (2013)
P.: Fast morphological analysis of Czech. In: Proceedings of the Raslan
Workshop 2009. Masarykovauniverzita (2009)
P.: K morfologick´e desambiguaci ˇceˇstiny [Towards morphologicaldisambiguation of Czech]. Thesis proposal. Masaryk University (2008)
y, P., Smerk,
P., Pala, K.: DESAM morfologicky oznaˇckovan´
y korpus ˇcesk´
u. Technical report. Masaryk University (2010)
Evaluation and Improvements in Punctuation
Detection for Czech
aˇr1(B) , Jakub Machura1 , Krist´
a1 , and Michal Rott2
NLP Centre, Faculty of Informatics, Masaryk University,
a 68a, 602 00 Brno, Czech Republic
Institute of Information Technology and Electronics,
Technical University of Liberec, Studentsk´
a 2, 461 17 Liberec, Czech Republic
Abstract. Punctuation detection and correction belongs to the hardest
automatic grammar checking tasks for the Czech language. The paper
compares available grammar and punctuation correction programs on
several data sets. It also describes a set of improvements of one of the
available tools, leading to signiﬁcantly better recall, as well as precision.
Punctuation detection belongs to important tasks in automatic checking of grammar, especially for Czech language. However, it is one of the most diﬃcult tasks
as well, unlike e.g. correcting simple spelling errors.
Czech, a free-word-order language with a rich morphology, has a complex
system of writing commas in sentences – partly because the language norm
deﬁnes it in very complicated and unintuitive way, partly because commas often
aﬀect the semantics of the utterances. It is based on linguistic structure of the
sentences, and even native speakers of Czech, including educated people, have
often problems with correct placement of punctuation.
There are several automatic tools that partly solve the problem: Two commercial grammar checkers for Czech (which also try to correct diﬀerent types of
errors, but here we exploit their punctuation correction features only) and some
academic projects mainly focused on correcting punctuation. We list and brieﬂy
describe them in the next section.
This paper contains two important results: The ﬁrst one is a signiﬁcant accuracy improvement of one of the open-source academic tools, the SET system
[6,7]; the other is a comprehensive comparison and evaluation of all the available tools that was missing for this task so far.
The structure of the paper is as follows: The next section brieﬂy describes
the past work done on the punctuation detection problem. Then we describe the
punctuation detection in the SET tool and our improvements to it. Section 5
presents comparison and evaluation of all the available tools.
c Springer International Publishing Switzerland 2016
P. Sojka et al. (Eds.): TSD 2016, LNAI 9924, pp. 287–294, 2016.
DOI: 10.1007/978-3-319-45510-5 33
aˇr et al.
The two mentioned commercial grammar checking systems are:
– Grammar checker built into the Microsoft Oﬃce, developed by the Institute
of the Czech language [9,11],
– Grammaticon checker created by the Lingea company .
Both systems aim at manual description of erroneous constructions, and minimizing the number of false alerts. From the available comparisons [1,10,11] it
seems that the Grammar checker generally outperforms Grammaticon; however,
the testing data are rather small and present only the general results, whereas
we are interested purely in punctuation corrections.
The following systems emerged from the academic world:
– Holan et al.  proposed using automatic dependency parsing for punctuation
detection, but the result is only a prototype and not usable in practice.
– Jakub´ıˇcek and Hor´
ak  exploit the grammar of the Synt parser  to detect
punctuation, with both precision and recall slightly over 80 percent. The tool
is unfortunately not operational, otherwise it would be very interesting to
include it into our comparison.
aˇc et al.  used punctuation detection for post-processing of automatic
speech recognition. In their approach, a set of N-gram rules (with N up to 4,
including the commas) was statistically induced from the training Czech corpora of news texts and books. Application of these rules is implemented via
weighted ﬁnal-state transducers1 which enable both inserting a comma and
suppressing the insertion by a more speciﬁc rule. We have included 2 versions
of this system into our comparison – the original one, as referenced above, and
a recent one – they are further referred to as FST 1 and FST 2.
aˇr  reported on using the open-source SET parser  for punctuation
detection, with promising results. Building on the existing grammar, we have
signiﬁcantly improved it. In the next sections, we describe the principle of
punctuation detection within the SET parser, and some of our important
As for comparison presented in Sect. 5, there is no similar work, to our best
Punctuation Detection with SET – Initial State
The SET Parser
The SET system2 is an open-source pattern matching parser designed for natural languages . It contains manually created sets of patterns (“grammars”)
SET is an abbreviation of “syntactic engineering tool”.
Evaluation and Improvements in Punctuation Detection for Czech
for Czech, Slovak and English – in the process of analysis, these patterns are
searched within the sentence and compete with each other to build a syntactic
tree. Primary output of the parser is a so-called “hybrid tree” that combines
dependency and phrase structure features.
Before the syntactic analysis, the text has to be tokenized and tagged – for
this purpose, we used the unitok tokenizer  and the desamb tagger for Czech
 in all our experiments.
Punctuation Detection with SET
For the purpose of punctuation detection, a special grammar has been developed
aˇr , producing pseudo-syntactic trees that only mark where a comma
should, or should not be placed (by hanging the particular word under a or
node), as illustrated in Fig. 1.
Fig. 1. One of the existing punctuation detection rules in SET, and its realization on
a sample Czech sentence: “Nev´ı na jak´
a j´ıt”. (missing comma before “na” –
“(He) does not know what bureau to go in.”) The rule matches preposition (k7) followed
by relative pronoun (k3.*y[RQ]) or adverb (k6.*y[RQ]), not preceded by preposition
or conjunction (k8) or relative pronoun/adverb and few other selected words (the tag
not and word not lines express negative condition – token must not match any of the
listed items). Ajka morphological tagset is used ; the example is borrowed from .
Punctuation Detection with SET – Improvements
The original grammar was a rather simplistic one – it contained only about
10 rules and despite relatively good results  it was clear that there is room
aˇr et al.
for improvements. In this section, we describe our main changes to the original
grammar, and motivation behind them.
The original set of rules is still available in the SET project repository for
All of the following changes have been tested on a development set of texts
before including into the ﬁnal grammar, and all of them proved a positive eﬀect
on precision and/or recall on this development set. Thanks to their good results
also on the evaluation test sets (see Sect. 5), they were accepted into the SET
parser repository and are also available online.4 For this reason, we do not include
the particular rules here in the paper – some of them are rather extensive and
would unnecessarily ﬁll the place available.
Adjusting to the Tagger Imperfections
As every tagger, desamb makes mistakes. For instance, relative pronouns co and
coˇz are sometimes tagged as particles and therefore not covered by the original
grammar. We have added a special rule covering these particular words, independently on morphological tags. Also, several inconsistencies in the tagset were
detected and ﬁxed, e.g. the tag k3.*y[RQ] from Fig. 1 was sometimes recorded
as k3.*x[RQ], so we added these variants.
The original grammar did not handle commas connected to conjuctions like ale,
jako or neˇz, as their behaviour is complicated in this respect: ale is preceded by
comma unless it is in the middle of a clause (which is hard to detect). Jako and
neˇz are preceded by comma only if they connect clauses (rather than phrases).
In addition to that, all of these words can function (and be tagged as) a particle.
The rule covering ale relies on the tagger abilities to distinguish conjunction
from the particle – in this case, we ignore the particles. We also approximate
the middle-clause position by listing a few stop words that often co-occur with
ale in this position.
In case of jako and neˇz, we have extended the rule for general conjunctions
and place comma before them only if there is a ﬁnite verb later in the sentence.
Commas After Inserted Clauses
One of the punctuation rules in Czech orders delimiting insterted clauses by
commas from both sides. The left side is usually easier, as it contains a conjunction or a relative pronoun. We implemented an approximate detection of the
right side by two groups of rules:
Evaluation and Improvements in Punctuation Detection for Czech
– There are two ﬁnite verbs close to each other – then the comma is placed
before the second one.
– There is a clitic in a Wackernagel’s position – that means that the inserted
clause is the ﬁrst constituent in the sentence, and the comma belongs right
before the clitic.
We were experimenting also with other options (including detecting clauses
by the full SET grammar for Czech) but they were not improving the results –
in more complicated cases, placement of commas usually depends on semantics
and pattern-based solution was not able to describe it suﬃciently.
We have also introduced a number of rather small, or technical improvements.
We do not include the full list here but the resulting grammar can be easily
accessed in the SET repository (as referenced above).
In this section we present a thorough evaluation of the systems mentioned in the
paper so far, that are currently operational. Namely, Grammaticon, Grammar
checker, FST 1, FST 2, original SET punctuation grammar (SET:orig) and our
improved set of rules (SET:new).
Our aim was to conduct a rigorous evaluation of the available tools, so we put
quite a lot of eﬀort into selecting and processing the testing data. No part of the
data was used for any part of the development.
We have collected 8 sets of Czech texts with diﬀerent text types, as summarized in Table 1. The text types are:
examples on punctuation from an oﬃcial internet education portal
original Czech ﬁction (2 authors, one classic, one contemporary)
Czech translations of English ﬁction (2 authors)
basic and high school dictation transcriptions (focused on punctuation, with
real mistakes recorded)
Blogs and horoscopes were manually corrected by 3 independent correctors
so that we can be very certain that there are no mistakes in the data sets. The
agreement rate among the correctors was very high, all 3 correctors agreed in
96.3 % (blogs) and 98.2 % (horoscopes) of cases. The other texts come from highquality sources, so they were probably carefully corrected before publication and
do not contain signiﬁcant amount of errors.
Unfortunately, Grammaticon and Grammar checker do not oﬀer any type of
batch-processing mode, and each correction must be manually accepted. This
feature makes it impossible to test them using large texts; for this reason, only 3
data sets were used for comparison of all the available programs: Blogs, language
reference examples and dictations. The rest could be used only for evaluation of
the newly introduced changes to the SET punctuation grammar and FSTs.
aˇr et al.
Table 1. Data sets used for testing.
# words # commas
Internet Language Reference Book
– selected novels
a – Zenu
J.K. Rowling – Harry Potter 1 (translation)
Neil Gaiman – The Graveyard Book (translation) 55, 444
Diﬀerent people make diﬀerent errors and it is not clear how a grammar checker
should be properly evaluated when we do not have a huge corpus of target user
errors. At this state, the most fair-play method is probably the one used already
ak and Jakub´ıˇcek : To remove all the commas and measure how many
can be properly placed by a checker. We used this method with all the data sets,
except the dictations.
The dictation transcriptions do oﬀer a small corpus of real people mistakes,
so we are able to measure the real power of correcting commas, at least to
some extent. Of course, the data is rather small and such evaluation is very
speciﬁc to the text type and to the user type (basic school students), but it is
an interesting number for comparison. We performed such measure (i.e. kept
the original commas and let the tools correct the text as it is) on the dictations
We use standard precision and recall on detected commas or ﬁxed errors,
The results of the comparison are presented in Table 2. We can clearly see that
Grammar checker outperforms Grammaticon, namely in recall (precisions are
comparable). Similarly, our new SET grammar is always better than the original
one, in both precision and recall.
Grammar checker compared to SET:new shows that SET has signiﬁcantly
higher recall whereas Grammar checker wins in precision. If we measured F-score,
SET would be better – on the other hand, precision is more important in case
of grammar checkers, as we want to minimize the number of false alerts; it is
diﬃcult to pick a winner from these two. Similarly, SET outperforms both ﬁnalstate transducers in terms of F-score, although they reached higher precision in
Evaluation and Improvements in Punctuation Detection for Czech
Table 2. Results of the comparison. P, R stands for Precision, Recall (in percent).
“Gr..con” stands for Grammaticon, “GCheck” is Grammar Checker. Note that the
dictations testing set uses diﬀerent method (ﬁxed errors, rather than detected commas), as explained in Sect. 5.2. Best precision and recall result for each testing set is
97.5 10.8 97.3 28.3 89.0 48.8 88.8 49.4 86.8 42.7 89.7 58.2
Language ref. 89.1 9.8
92.0 19.4 78.1 27.3 78.1 28.3 75.8 22.5 87.3 36.2
93.5 53.8 92.2 54.3 89.5 46.7 94.1 64.8
84.1 30.0 75.1 32.4 85.6 32.3 87.2 34.6
86.8 47.8 86.2 49.2 82.1 51.0 84.0 53.8
J.K. Rowling -
90.0 47.7 82.8 49.0 87.8 50.2 89.7 53.4
91.5 41.3 80.5 42.0 88.0 47.8 89.4 48.4
93.3 14.6 60.9 18.2 51.0 17.4 68.2 14.7 78.7 27.3
Clearly some text types are easier than others – common internet texts and
translations seem generally easier than original Czech ﬁction. Language reference
examples are hard as well – but that was expected, as it contains examples for
all the borderline cases that the automatic procedures cannot cover (yet) for
In general, the results do not indicate that the tools will soon be able to
correct all mistakes in a purely automatic way, the error rates are still too high
and the recalls too low. But probably both Grammar checker, FSTs and the new
SET grammar are able to productively assist people when correcting their texts.
In the paper, we described an improvement to an existing punctuation correction tool for Czech, and performed a thorough evaluation of all the available
tools that are able to detect punctuation in Czech sentences automatically. The
evaluation shows that our changes signiﬁcantly improved the accuracy of the
tool. Currently it has lower precision and signiﬁcantly higher recall than stateof-the-art commercial grammar checker.
However, the ﬁgures also indicate that the results are still not good enough
for purely automatic corrections of punctuation. Further research will be needed,
probably aimed at exploiting more complicated syntactic and semantic features
of the language.
Acknowledgments. This work has been partly supported by the Grant Agency of CR
within the project 15-13277S. The research leading to these results has received funding
from the Norwegian Financial Mechanism 2009–2014 and the Ministry of Education,
aˇr et al.
Youth and Sports under Project Contract no. MSMT-28477/2014 within the HaBiT
Project 7F14047. This work was also partly supported by Student Grant Scheme 2016
of Technical University of Liberec.
un, D.: Kontrola ˇcesk´e gramatiky pro MS Oﬃce - konec korektor˚
u v Cech´
aˇc, M., Blavka, K., Kuchaˇrov´
a, M., Skodov´
a, S.: Post-processing of the recognized speech for web presentation of large audio archive. In: 2012 35th International Conference on Telecommunications and Signal Processing (TSP), pp. 441–
3. Holan, T., Kuboˇ
n, V., Pl´
atek, M.: A prototype of a grammar checker for Czech.
In: Proceedings of the 5th Conference on Applied Natural Language Processing,
pp. 147–154. Association for Computational Linguistics (1997)
ak, A.: Computer Processing of Czech Syntax and Semantics. Librix.eu, Brno
5. Jakub´ıˇcek, M., Hor´
ak, A.: Punctuation detection with full syntactic parsing. Res.
Comput. Sci. Spec. issue: Nat. Lang. Process. Appl. 46, 335–343 (2010)
aˇr, V.: Partial grammar checking for Czech using the set parser. In: Sojka, P.,
ak, A., Kopeˇcek, I., Pala, K. (eds.) TSD 2014. LNCS, vol. 8655, pp. 308–314.
Springer, Heidelberg (2014)
aˇr, V., Hor´
ak, A., Jakub´ıˇcek, M.: Syntactic analysis using ﬁnite patterns: a
new parsing system for Czech. In: Vetulani, Z. (ed.) LTC 2009. LNCS, vol. 6562,
pp. 161–171. Springer, Heidelberg (2011)
8. Lingea s.r.o.: Grammaticon (2003). www.lingea.cz/grammaticon.htm
9. Oliva, K., Petkeviˇc, V., Microsoft s.r.o.: Czech Grammar Checker (2005). http://
10. Pala, K.: Piˇste dopisy koneˇcnˇe bez chyb – Cesk´
y korektor pro Microsoft
Oﬃce. Computer, 13–14 (2005)
11. Petkeviˇc, V.: Kontrola ˇcesk´e gramatiky (ˇcesk´
y grammar checker). Studie z aplikovan´e lingvistiky-Stud. Appl. Linguist. 5(2), 48–66 (2014)
aˇcek, R., Smrˇz, P.: A new Czech morphological analyser ajka. In: Matouˇsek, V.,
Mautner, P., Mouˇcek, R., Tauser, K. (eds.) TSD 2001. LNCS (LNAI), vol. 2166,
pp. 100–107. Springer, Heidelberg (2001)
13. Suchomel, V., Michelfeit, J., Pomik´
alek, J.: Text tokenisation using unitok. In:
Eighth Workshop on Recent Advances in Slavonic Natural Language Processing,
pp. 71–75. Tribun EU, Brno (2014)
P.: Unsupervised learning of rules for morphological disambiguation. In:
Sojka, P., Kopeˇcek, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp.
211–216. Springer, Heidelberg (2004)