Tải bản đầy đủ - 0 (trang)
1 Morphological Analysis, Tagging, and Guessing

1 Morphological Analysis, Tagging, and Guessing

Tải bản đầy đủ - 0trang

284



Z. Nevˇeˇrilov´

a



a named entity is left on the tagger. With this improvement, the number of

unknown words decreases. Such words are processed by the guesser.



4



Evaluation



We evaluated the new processing pipeline on a small, “clean” (i.e. containing mostly edited texts) corpus – DESAM [18]. Originally, DESAM contained

1,173,835 tokens, including 722 multi-word tokens. We selected DESAM for two

reasons: it has annotation of MWEs and it was the only resource with marked

foreign words. The disadvantage of DESAM is its cleanness – we expected that

English mixing occurs less frequently in DESAM than in web corpora.

4.1



Multi-word Expressions



We retokenized DESAM data with the new pipeline (tokenization including

MWEs, foreign chunks detection, analysis, tagging, guessing). We reduced the

number of tokens processed by the guesser from 28,120 to 10,649 (4,094 tokens

were found in gazetteers, 12,345 were detected as foreign, 1,032 were annotated

as parts of MWEs).

We compared MWEs from the original (semi-automatically annotated)

DESAM (409 unique MWEs) with the new one (410 unique MWEs): 50 MWEs

were the same. The disagreement is caused by the unclear definition of a MWE.

In semi-automatically annotated DESAM, MWEs were law numbers, foreign

phrases, and multi-word named entities. In newly annotated DESAM, most

MWEs were multi-word named entities.

4.2



Interlingual Homographs



In the next stage, we focused on frequent English-Czech homographs. In Table 3,

it can be seen that our approach sometimes identifies a Czech word as foreign

but in many cases (in bold in the table), it can identify English homographs with

100 % accuracy. The number of tokens in DESAM processed with the standard

pipeline is sometimes higher than the number of tokens in DESAM processed

with our new pipeline because some tokens became part of a MWE (e.g. a

priori ) and are no longer annotated separately. We evaluated manually how

many tokens annotated as foreign were really foreign and for some homographs,

we also evaluated how many tokens annotated as native were really native.



5



Conclusion and Future Work



We used a complex approach to corpus annotation of Czech texts with language

mixing. We modified the standard processing pipeline in order to detect MWEs

and foreign language chunks. We used collections of MWEs and gazetteers to

reduce dramatically the number of unknown tokens. This reduction can lead to



Annotation of Czech Texts with Language Mixing



285



Table 3. Number of interlingual homographs in DESAM annotated by the standard

pipeline (2nd column), new pipeline (3rd column), number of detected foreign tokens,

number of correct annotation of foreign tokens (5th column) and native tokens (7th

column). The homographs are ordered by their length (2 letters, 3 letters, 4 letters).

Dash means that correct annotation was not evaluated.

Token Std DESAM New DESAM Annotated Foreign Annotated Native

as foreign correct as native correct

a



23,057



22,992



47



2



22,945



I



427



426



47



42



379







do



5,012



5,001



10



0



4,991







to



4,290



4,282



23



10



4,259







on



84



82



9



7



73



73



let



898



898



0



0



898



898



her



90



90



2



0



88



88



set



87



87



8



8



79



79



top



14



4



4



4



0



0



for



14



14



13



13



1



0



not



3



3



3



0



0



0



had



3



3



2



0



1



1



379



list



73



73



6



1



67



67



post



19



18



18



0



0



0



most



7



7



1



0



6



6



more appropriate use of the guesser. To our knowledge, this work is the first

attempt to annotate Czech texts with language mixing. We plan to use our new

pipeline for annotation of web corpus of Czech where we expect the phenomenon

of language mixing to be more significant than in DESAM.

Acknowledgments. The research leading to these results has received funding from

the Norwegian Financial Mechanism 2009–2014 and the Ministry of Education, Youth

and Sports under Project Contract no. MSMT-28477/2014 within the HaBiT Project

7F14047.

This work has been partly supported by the Ministry of Education of CR within

the national COST-CZ project LD15066.



References

1. Auer, P.: From codeswitching via language mixing to fused lects. Int. J. Bilingualism 3(4), 309–332 (1999)

2. Alex, B.: Automatic detection of english inclusions in mixed-lingual data with an

application to parsing. Ph.D. thesis. School of Informatics, University of Edinburgh, Edinburgh (2008)



286



Z. Neverilov

a



3. Schă

afer, R., Bildhauer, F.: Web Corpus Construction. Synthesis Lectures on

Human Language Technologies. Morgan and Claypool, San Francisco etc. (2013).

http://dx.doi.org/10.2200/S00508ED1V01Y201305HLT022

4. Alex, B., Dubey, A., Keller, F.: Using foreign inclusion detection to improve parsing

performance. In: Proceedings of the 2007 Joint Conference on Empirical Methods

in Natural Language Processing and Computational Natural Language Learning

(EMNLP-CoNLL), pp. 151–160 (2007)

ˇ

ˇ

5. Kˇren, M., Cvrˇcek, V., Capka,

T., Cerm´

akov´

a, A., Hn´

arkov´

a, M., Chlumsk´

a, L.,

ˇ

Jel´ınek, M., Kov´

aˇr´ıkov´

a, D., Petkeviˇc, V., Proch´

azka, P., Skoumalov´

a, H., Skrabal,

M., Truneˇcek, P., Vondˇriˇcka, P., Zasina, A.: SYN2015: reprezentativn´ı korpus psan´e

ˇceˇstiny [SYN2015: Representative Corpus of Written Czech] (2015)

6. Suchomel, V., Pomik´

alek, J.: Efficient web crawling for large text corpora. In:

Adam Kilgarriff, S.S. (ed.) Proceedings of the Seventh Web as Corpus Workshop

(WAC7), Lyon, pp. 39–43 (2012)

7. Baldwin, T., Lui, M.: Language identification: the long and the short of the matter.

In: Human Language Technologies: The 2010 Annual Conference of the North

American Chapter of the Association for Computational Linguistics, HLT 2010,

pp. 229–237. Association for Computational Linguistics, Stroudsburg (2010)

8. Lui, M., Baldwin, T.: Langid.Py: an off-the-shelf language identification tool. In:

Proceedings of the ACL 2012 System Demonstrations, ACL 2012, pp. 25–30. Association for Computational Linguistics, Stroudsburg (2012)

9. Eskander, R., Al-Badrashiny, M., Habash, N., Rambow, O.: Foreign words and

the automatic processing of Arabic social media text written in Roman script.

In: Proceedings of the First Workshop on Computational Approaches to Code

Switching, pp. 1–12. Association for Computational Linguistics, Doha, October

2014

10. Ahmed, B.U.: Detection of foreign words and names in written text. Ph.D. thesis.

Pace University, New York, NY, USA (2005). AAI3172339

11. Hlav´

aˇcov´

a, J.: Morphological guesser of Czech words. In: Matouˇsek, V.,

Mautner, P., Mouˇcek, R., Tauser, K. (eds.) TSD 2001. LNCS (LNAI), vol. 2166,

pp. 70–75. Springer, Heidelberg (2001)

12. Hana, J., Zeman, D., Hajiˇc, J., Hanov´

a, H., Hladk´

a, B., Jeˇr´

abek, E.: Manual for

morphological annotation, revision for the Prague dependency treebank 2.0. Tech´

nical report TR-2005-27, UFAL

MFF UK, Prague, Czech Rep. (2005)

ˇ

13. Smerk, P., Sojka, P., Hor´

ak, A.: Towards Czech morphological guesser. In: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN

2008, pp. 1–4. Masarykova univerzita, Brno (2008)

14. Nevˇeˇrilov´

a, Z.: Annotation of multi-word expressions in Czech texts. In:

Hor´

ak, A., Rychl´

y, P. (eds.) Ninth Workshop on Recent Advances in Slavonic

Natural Language Processing, pp. 103–112. Tribun EU, Brno (2015)

15. Jakub´ıˇcek, M., Kilgarriff, A., Kov´

aˇr, V., Rychl´

y, P., Suchomel, V.: The TenTen

corpus family. In: 7th International Corpus Linguistics Conference, CL 2013, pp.

125–127. Lancaster (2013)

ˇ

16. Smerk,

P.: Fast morphological analysis of Czech. In: Proceedings of the Raslan

Workshop 2009. Masarykovauniverzita (2009)

ˇ

17. Smerk,

P.: K morfologick´e desambiguaci ˇceˇstiny [Towards morphologicaldisambiguation of Czech]. Thesis proposal. Masaryk University (2008)

ˇ

18. Rychl´

y, P., Smerk,

P., Pala, K.: DESAM morfologicky oznaˇckovan´

y korpus ˇcesk´

ych

text˚

u. Technical report. Masaryk University (2010)



Evaluation and Improvements in Punctuation

Detection for Czech

Vojtˇech Kov´

aˇr1(B) , Jakub Machura1 , Krist´

yna Zemkov´

a1 , and Michal Rott2

1



NLP Centre, Faculty of Informatics, Masaryk University,

Botanick´

a 68a, 602 00 Brno, Czech Republic

xkovar3@fi.muni.cz,{382567,415795}@mail.muni.cz

2

Institute of Information Technology and Electronics,

Technical University of Liberec, Studentsk´

a 2, 461 17 Liberec, Czech Republic

michal.rott@tul.cz



Abstract. Punctuation detection and correction belongs to the hardest

automatic grammar checking tasks for the Czech language. The paper

compares available grammar and punctuation correction programs on

several data sets. It also describes a set of improvements of one of the

available tools, leading to significantly better recall, as well as precision.



1



Introduction



Punctuation detection belongs to important tasks in automatic checking of grammar, especially for Czech language. However, it is one of the most difficult tasks

as well, unlike e.g. correcting simple spelling errors.

Czech, a free-word-order language with a rich morphology, has a complex

system of writing commas in sentences – partly because the language norm

defines it in very complicated and unintuitive way, partly because commas often

affect the semantics of the utterances. It is based on linguistic structure of the

sentences, and even native speakers of Czech, including educated people, have

often problems with correct placement of punctuation.

There are several automatic tools that partly solve the problem: Two commercial grammar checkers for Czech (which also try to correct different types of

errors, but here we exploit their punctuation correction features only) and some

academic projects mainly focused on correcting punctuation. We list and briefly

describe them in the next section.

This paper contains two important results: The first one is a significant accuracy improvement of one of the open-source academic tools, the SET system

[6,7]; the other is a comprehensive comparison and evaluation of all the available tools that was missing for this task so far.

The structure of the paper is as follows: The next section briefly describes

the past work done on the punctuation detection problem. Then we describe the

punctuation detection in the SET tool and our improvements to it. Section 5

presents comparison and evaluation of all the available tools.

c Springer International Publishing Switzerland 2016

P. Sojka et al. (Eds.): TSD 2016, LNAI 9924, pp. 287–294, 2016.

DOI: 10.1007/978-3-319-45510-5 33



288



2



V. Kov´

aˇr et al.



Related Work



The two mentioned commercial grammar checking systems are:

– Grammar checker built into the Microsoft Office, developed by the Institute

of the Czech language [9,11],

– Grammaticon checker created by the Lingea company [8].

Both systems aim at manual description of erroneous constructions, and minimizing the number of false alerts. From the available comparisons [1,10,11] it

seems that the Grammar checker generally outperforms Grammaticon; however,

the testing data are rather small and present only the general results, whereas

we are interested purely in punctuation corrections.

The following systems emerged from the academic world:

– Holan et al. [3] proposed using automatic dependency parsing for punctuation

detection, but the result is only a prototype and not usable in practice.

– Jakub´ıˇcek and Hor´

ak [5] exploit the grammar of the Synt parser [4] to detect

punctuation, with both precision and recall slightly over 80 percent. The tool

is unfortunately not operational, otherwise it would be very interesting to

include it into our comparison.

– Boh´

aˇc et al. [2] used punctuation detection for post-processing of automatic

speech recognition. In their approach, a set of N-gram rules (with N up to 4,

including the commas) was statistically induced from the training Czech corpora of news texts and books. Application of these rules is implemented via

weighted final-state transducers1 which enable both inserting a comma and

suppressing the insertion by a more specific rule. We have included 2 versions

of this system into our comparison – the original one, as referenced above, and

a recent one – they are further referred to as FST 1 and FST 2.

– Kov´

aˇr [6] reported on using the open-source SET parser [7] for punctuation

detection, with promising results. Building on the existing grammar, we have

significantly improved it. In the next sections, we describe the principle of

punctuation detection within the SET parser, and some of our important

improvements.

As for comparison presented in Sect. 5, there is no similar work, to our best

knowledge.



3



Punctuation Detection with SET – Initial State



3.1



The SET Parser



The SET system2 is an open-source pattern matching parser designed for natural languages [7]. It contains manually created sets of patterns (“grammars”)

1

2



http://openfst.org/.

SET is an abbreviation of “syntactic engineering tool”.



Evaluation and Improvements in Punctuation Detection for Czech



289



for Czech, Slovak and English – in the process of analysis, these patterns are

searched within the sentence and compete with each other to build a syntactic

tree. Primary output of the parser is a so-called “hybrid tree” that combines

dependency and phrase structure features.

Before the syntactic analysis, the text has to be tokenized and tagged – for

this purpose, we used the unitok tokenizer [13] and the desamb tagger for Czech

[14] in all our experiments.

3.2



Punctuation Detection with SET



For the purpose of punctuation detection, a special grammar has been developed

by Kov´

aˇr [6], producing pseudo-syntactic trees that only mark where a comma

should, or should not be placed (by hanging the particular word under a or

node), as illustrated in Fig. 1.



Fig. 1. One of the existing punctuation detection rules in SET, and its realization on

a sample Czech sentence: “Nev´ı na jak´

yu

´ˇrad m´

a j´ıt”. (missing comma before “na” –

“(He) does not know what bureau to go in.”) The rule matches preposition (k7) followed

by relative pronoun (k3.*y[RQ]) or adverb (k6.*y[RQ]), not preceded by preposition

or conjunction (k8) or relative pronoun/adverb and few other selected words (the tag

not and word not lines express negative condition – token must not match any of the

listed items). Ajka morphological tagset is used [12]; the example is borrowed from [6].



4



Punctuation Detection with SET – Improvements



The original grammar was a rather simplistic one – it contained only about

10 rules and despite relatively good results [6] it was clear that there is room



290



V. Kov´

aˇr et al.



for improvements. In this section, we describe our main changes to the original

grammar, and motivation behind them.

The original set of rules is still available in the SET project repository for

reference.3

All of the following changes have been tested on a development set of texts

before including into the final grammar, and all of them proved a positive effect

on precision and/or recall on this development set. Thanks to their good results

also on the evaluation test sets (see Sect. 5), they were accepted into the SET

parser repository and are also available online.4 For this reason, we do not include

the particular rules here in the paper – some of them are rather extensive and

would unnecessarily fill the place available.

4.1



Adjusting to the Tagger Imperfections



As every tagger, desamb makes mistakes. For instance, relative pronouns co and

coˇz are sometimes tagged as particles and therefore not covered by the original

grammar. We have added a special rule covering these particular words, independently on morphological tags. Also, several inconsistencies in the tagset were

detected and fixed, e.g. the tag k3.*y[RQ] from Fig. 1 was sometimes recorded

as k3.*x[RQ], so we added these variants.

4.2



Special Conjunctions



The original grammar did not handle commas connected to conjuctions like ale,

jako or neˇz, as their behaviour is complicated in this respect: ale is preceded by

comma unless it is in the middle of a clause (which is hard to detect). Jako and

neˇz are preceded by comma only if they connect clauses (rather than phrases).

In addition to that, all of these words can function (and be tagged as) a particle.

The rule covering ale relies on the tagger abilities to distinguish conjunction

from the particle – in this case, we ignore the particles. We also approximate

the middle-clause position by listing a few stop words that often co-occur with

ale in this position.

In case of jako and neˇz, we have extended the rule for general conjunctions

and place comma before them only if there is a finite verb later in the sentence.

4.3



Commas After Inserted Clauses



One of the punctuation rules in Czech orders delimiting insterted clauses by

commas from both sides. The left side is usually easier, as it contains a conjunction or a relative pronoun. We implemented an approximate detection of the

right side by two groups of rules:



3

4



http://nlp.fi.muni.cz/trac/set/browser/punct.set.

http://nlp.fi.muni.cz/trac/set/browser/punct2.set.



Evaluation and Improvements in Punctuation Detection for Czech



291



– There are two finite verbs close to each other – then the comma is placed

before the second one.

– There is a clitic in a Wackernagel’s position – that means that the inserted

clause is the first constituent in the sentence, and the comma belongs right

before the clitic.

We were experimenting also with other options (including detecting clauses

by the full SET grammar for Czech) but they were not improving the results –

in more complicated cases, placement of commas usually depends on semantics

and pattern-based solution was not able to describe it sufficiently.

We have also introduced a number of rather small, or technical improvements.

We do not include the full list here but the resulting grammar can be easily

accessed in the SET repository (as referenced above).



5



Evaluation



In this section we present a thorough evaluation of the systems mentioned in the

paper so far, that are currently operational. Namely, Grammaticon, Grammar

checker, FST 1, FST 2, original SET punctuation grammar (SET:orig) and our

improved set of rules (SET:new).

5.1



Data



Our aim was to conduct a rigorous evaluation of the available tools, so we put

quite a lot of effort into selecting and processing the testing data. No part of the

data was used for any part of the development.

We have collected 8 sets of Czech texts with different text types, as summarized in Table 1. The text types are:















blog texts

examples on punctuation from an official internet education portal

horoscopes

original Czech fiction (2 authors, one classic, one contemporary)

Czech translations of English fiction (2 authors)

basic and high school dictation transcriptions (focused on punctuation, with

real mistakes recorded)



Blogs and horoscopes were manually corrected by 3 independent correctors

so that we can be very certain that there are no mistakes in the data sets. The

agreement rate among the correctors was very high, all 3 correctors agreed in

96.3 % (blogs) and 98.2 % (horoscopes) of cases. The other texts come from highquality sources, so they were probably carefully corrected before publication and

do not contain significant amount of errors.

Unfortunately, Grammaticon and Grammar checker do not offer any type of

batch-processing mode, and each correction must be manually accepted. This

feature makes it impossible to test them using large texts; for this reason, only 3

data sets were used for comparison of all the available programs: Blogs, language

reference examples and dictations. The rest could be used only for evaluation of

the newly introduced changes to the SET punctuation grammar and FSTs.



292



V. Kov´

aˇr et al.

Table 1. Data sets used for testing.

Testing set



# words # commas



Selected blogs



20, 883



1, 805



3, 039



417



Internet Language Reference Book



5.2



Horoscopes 2015

ˇ

Karel Capek

– selected novels

ˇ

Simona Monyov´

a – Zenu

ani kvˇetinou



57, 101



5, 101



46, 489



5, 498



33, 112



3, 156



J.K. Rowling – Harry Potter 1 (translation)



74, 783



7, 461



Neil Gaiman – The Graveyard Book (translation) 55, 444



5, 573



Dictation transcriptions



2, 092



17, 520



Method



Different people make different errors and it is not clear how a grammar checker

should be properly evaluated when we do not have a huge corpus of target user

errors. At this state, the most fair-play method is probably the one used already

by Hor´

ak and Jakub´ıˇcek [5]: To remove all the commas and measure how many

can be properly placed by a checker. We used this method with all the data sets,

except the dictations.

The dictation transcriptions do offer a small corpus of real people mistakes,

so we are able to measure the real power of correcting commas, at least to

some extent. Of course, the data is rather small and such evaluation is very

specific to the text type and to the user type (basic school students), but it is

an interesting number for comparison. We performed such measure (i.e. kept

the original commas and let the tools correct the text as it is) on the dictations

testing set.

We use standard precision and recall on detected commas or fixed errors,

respectively.

5.3



Results



The results of the comparison are presented in Table 2. We can clearly see that

Grammar checker outperforms Grammaticon, namely in recall (precisions are

comparable). Similarly, our new SET grammar is always better than the original

one, in both precision and recall.

Grammar checker compared to SET:new shows that SET has significantly

higher recall whereas Grammar checker wins in precision. If we measured F-score,

SET would be better – on the other hand, precision is more important in case

of grammar checkers, as we want to minimize the number of false alerts; it is

difficult to pick a winner from these two. Similarly, SET outperforms both finalstate transducers in terms of F-score, although they reached higher precision in

several cases.



Evaluation and Improvements in Punctuation Detection for Czech



293



Table 2. Results of the comparison. P, R stands for Precision, Recall (in percent).

“Gr..con” stands for Grammaticon, “GCheck” is Grammar Checker. Note that the

dictations testing set uses different method (fixed errors, rather than detected commas), as explained in Sect. 5.2. Best precision and recall result for each testing set is

highlighted.

Testing set



Gr..con

P

R



GCheck

P

R



FST 1

P

R



FST 2

P

R



SET:orig SET:new

P

R

P

R



Blogs



97.5 10.8 97.3 28.3 89.0 48.8 88.8 49.4 86.8 42.7 89.7 58.2



Language ref. 89.1 9.8



92.0 19.4 78.1 27.3 78.1 28.3 75.8 22.5 87.3 36.2



Horoscopes

ˇ

K. Capek



-



-



-



-



93.5 53.8 92.2 54.3 89.5 46.7 94.1 64.8



-



-



-



-



84.1 30.0 75.1 32.4 85.6 32.3 87.2 34.6



S. Monyov´

a



-



-



-



-



86.8 47.8 86.2 49.2 82.1 51.0 84.0 53.8



J.K. Rowling -



-



-



-



90.0 47.7 82.8 49.0 87.8 50.2 89.7 53.4



N. Gaiman



-



-



-



-



91.5 41.3 80.5 42.0 88.0 47.8 89.4 48.4



Dictations



96.4 5.7



93.3 14.6 60.9 18.2 51.0 17.4 68.2 14.7 78.7 27.3



Clearly some text types are easier than others – common internet texts and

translations seem generally easier than original Czech fiction. Language reference

examples are hard as well – but that was expected, as it contains examples for

all the borderline cases that the automatic procedures cannot cover (yet) for

various reasons.

In general, the results do not indicate that the tools will soon be able to

correct all mistakes in a purely automatic way, the error rates are still too high

and the recalls too low. But probably both Grammar checker, FSTs and the new

SET grammar are able to productively assist people when correcting their texts.



6



Conclusions



In the paper, we described an improvement to an existing punctuation correction tool for Czech, and performed a thorough evaluation of all the available

tools that are able to detect punctuation in Czech sentences automatically. The

evaluation shows that our changes significantly improved the accuracy of the

tool. Currently it has lower precision and significantly higher recall than stateof-the-art commercial grammar checker.

However, the figures also indicate that the results are still not good enough

for purely automatic corrections of punctuation. Further research will be needed,

probably aimed at exploiting more complicated syntactic and semantic features

of the language.

Acknowledgments. This work has been partly supported by the Grant Agency of CR

within the project 15-13277S. The research leading to these results has received funding

from the Norwegian Financial Mechanism 2009–2014 and the Ministry of Education,



294



V. Kov´

aˇr et al.



Youth and Sports under Project Contract no. MSMT-28477/2014 within the HaBiT

Project 7F14047. This work was also partly supported by Student Grant Scheme 2016

of Technical University of Liberec.



References

ˇ

1. Beh´

un, D.: Kontrola ˇcesk´e gramatiky pro MS Office - konec korektor˚

u v Cech´

ach?

(2005). https://interval.cz/clanky/kontrola-ceske-gramatiky-pro-ms-office-koneckorektoru-v-cechach

ˇ

2. Boh´

aˇc, M., Blavka, K., Kuchaˇrov´

a, M., Skodov´

a, S.: Post-processing of the recognized speech for web presentation of large audio archive. In: 2012 35th International Conference on Telecommunications and Signal Processing (TSP), pp. 441–

445 (2012)

3. Holan, T., Kuboˇ

n, V., Pl´

atek, M.: A prototype of a grammar checker for Czech.

In: Proceedings of the 5th Conference on Applied Natural Language Processing,

pp. 147–154. Association for Computational Linguistics (1997)

4. Hor´

ak, A.: Computer Processing of Czech Syntax and Semantics. Librix.eu, Brno

(2008)

5. Jakub´ıˇcek, M., Hor´

ak, A.: Punctuation detection with full syntactic parsing. Res.

Comput. Sci. Spec. issue: Nat. Lang. Process. Appl. 46, 335–343 (2010)

6. Kov´

aˇr, V.: Partial grammar checking for Czech using the set parser. In: Sojka, P.,

Hor´

ak, A., Kopeˇcek, I., Pala, K. (eds.) TSD 2014. LNCS, vol. 8655, pp. 308–314.

Springer, Heidelberg (2014)

7. Kov´

aˇr, V., Hor´

ak, A., Jakub´ıˇcek, M.: Syntactic analysis using finite patterns: a

new parsing system for Czech. In: Vetulani, Z. (ed.) LTC 2009. LNCS, vol. 6562,

pp. 161–171. Springer, Heidelberg (2011)

8. Lingea s.r.o.: Grammaticon (2003). www.lingea.cz/grammaticon.htm

9. Oliva, K., Petkeviˇc, V., Microsoft s.r.o.: Czech Grammar Checker (2005). http://

office.microsoft.com/word

ˇ

10. Pala, K.: Piˇste dopisy koneˇcnˇe bez chyb – Cesk´

y gramatick´

y korektor pro Microsoft

Office. Computer, 13–14 (2005)

11. Petkeviˇc, V.: Kontrola ˇcesk´e gramatiky (ˇcesk´

y grammar checker). Studie z aplikovan´e lingvistiky-Stud. Appl. Linguist. 5(2), 48–66 (2014)

12. Sedl´

aˇcek, R., Smrˇz, P.: A new Czech morphological analyser ajka. In: Matouˇsek, V.,

Mautner, P., Mouˇcek, R., Tauser, K. (eds.) TSD 2001. LNCS (LNAI), vol. 2166,

pp. 100–107. Springer, Heidelberg (2001)

13. Suchomel, V., Michelfeit, J., Pomik´

alek, J.: Text tokenisation using unitok. In:

Eighth Workshop on Recent Advances in Slavonic Natural Language Processing,

pp. 71–75. Tribun EU, Brno (2014)

ˇ

14. Smerk,

P.: Unsupervised learning of rules for morphological disambiguation. In:

Sojka, P., Kopeˇcek, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp.

211–216. Springer, Heidelberg (2004)



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

1 Morphological Analysis, Tagging, and Guessing

Tải bản đầy đủ ngay(0 tr)

×