Tải bản đầy đủ - 0 (trang)
CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockered

CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockered

Tải bản đầy đủ - 0trang


O. Bojar et al.

Table 1. CzEng 1.6 data size.


Sentence Pairs Czech







39.44 M

286.70 M 211.49 M 325.20 M 208.78 M

EU legislation

10.18 M

296.19 M 219.79 M 324.11 M 200.47 M


6.06 M

80.65 M

57.68 M

89.37 M

54.20 M

Parallel web pages

2.35 M

37.08 M

28.07 M

41.26 M

26.45 M

Technical documentation

2.00 M

12.92 M

10.10 M

13.82 M

9.65 M


1.53 M

22.30 M

16.67 M

23.08 M

15.29 M

PDFs from web

0.64 M

9.64 M

7.51 M

10.32 M

6.61 M


0.26 M

5.65 M

4.20 M

6.22 M

3.93 M


35.29 k

501.01 k

371.70 k

566.33 k

352.23 k


0.52 k

9.55 k

6.97 k

10.19 k

6.78 k


62.49 M

751.65 M 555.88 M 833.96 M 525.73 M

and punctuation symbols, with English sentences being by about 10 % longer due

to articles and other auxiliaries.

CzEng 1.6 is shuffled at the level of sequences of not more than 15 consecutive

sentences. The original texts cannot be reconstructed but some inter-sentential

relations are retained for automatic processing of coreference (Sect. 3.3). Only

sentences aligned 1-1 and passing the threshold of 0.3 of our filtering pipeline

were included, leading to (indicated) gaps in the sequences. The filtering pipeline

reduces the overall corpus size from 75 M sentence pairs to the 62 M reported in

Table 1.

We prefer to de-duplicate each source at the level of documents, where available, which necessarily leads to duplicated sentences. Comparing the overall size

with the previous release (de-duplicated in the same manner), we see a substantial growth in size: 62 M vs. 15 M sentence pairs.

The largest portions of CzEng 1.6 come from movie subtitles, European legislation and fiction, as it was the case in the past. In this release, we also attempt

to improve the coverage of the medical domain. In the following, we list changes

since the last release for specific data sources:

European legislation was previously based on DGT-Acquis2 in one of its preliminary versions, spanning the years 2004–2010 of the Official Journal of

the European Union. Since there is no recent update of the corpus, we now

use the search facility of EUR-Lex3 to get access to more recent documents.

The added benefit is that we can also obtain other documents in the collection, not only issues of the Official Journal. Particularly interesting are

the Summaries of EU legislation,4 which are written in less formal style and

intended for general audience.







CzEng 1.6: Czech-English Parallel Corpus with Tools Dockered


Movie subtitles are available in the OPUS corpus [4] and since the previous

CzEng release, the collection has been significantly extended with OpenSubtitles.5 Very recently, OPUS released yet another update6 but it did not

make it in time to be included in CzEng 1.6.

Subtitles of educational videos can be obtained from other sources and

represent a rather different genre than movie subtitles. Not only are the

topics slightly different (and mostly, there is one clear topic for each video),

but the register is different: the sentences are longer and there are nearly

no dialogues. CzEng 1.6 includes translated subtitles coming from Khan

Academy7 and TED talks.8

Medical domain is of special interest of several European research and innovation projects. We try to extend CzEng in this direction by specifically

crawling some parallel health-related web sites using Bitextor [5] and also by

re-crawling EMEA (European Medicines Agency)9 corpus because its OPUS

version10 suffers from tokenization issues (e.g., decimal numbers split) and it

is probably smaller than what can be currently obtained from the database.


Rich Annotation in CzEng 1.6

As in previous versions, CzEng is automatically annotated in the multi-purpose

NLP processing framework Treex [6]11 based on the theory of Functional Generative Description [7]. The core of the platform is available on CPAN12 and the

various NLP models get downloaded automatically.

Treex integrates many processing tools including morphological taggers, lemmatizers, named entity recognizers, dependency and constituency parsers, coreference resolvers, and dictionaries of various kinds. Many of these tools are

well-known third-party solutions, such as McDonald’s MST parser [8]; Treex

wraps them into a unified shape. The heart of the integration is the Treex data

format, where each of the processing modules (called “blocks” in Treex terminology) modifies the common data. Complete NLP applications such as a dialogue system or a transfer-based MT system are implemented using sequences

of processing blocks, called “scenarios”.

Figure 1 illustrates the core annotation of CzEng. The left-hand trees represent the Czech (top) and English (bottom) sentences at the surface-syntactic

layer of representation (analytical, a-tree), and include morphological analysis

(shown in teal). The dashed links between the trees show one of the automatic











http://www.khanacademy.org/ and http://www.khanovaskola.cz/.




http://ufal.mff.cuni.cz/treex, a web demo is available at http://lindat.mff.cuni.cz/




O. Bojar et al.

Fig. 1. One sentence pair from CzEng with the core parts of the automatic annotation.

word alignments provided in the data. The right-hand pair of trees are the deepsyntactic (tectogrammatical, t-tree) analyses and include again cross-language

alignment. The blue arrows are co-reference links. The rightmost tree (n-tree)

captures named entities in the sentence, as annotated by NameTag [9].

For the purposes of CzEng 1.6, we introduced several improvements into the

pipeline, mostly on the deep-syntactic layer. They concern semantic role labeling

(Sect. 3.1), word sense disambiguation (Sect. 3.2), and co-reference (Sect. 3.3).


Semantic Role Labels – Functors

The t-tree annotation involves assigning to each node its semantic role, functor

[10], similar to PropBank [11] labels (shown in capitals in t-trees in Fig. 1). Functor assignment in CzEng 1.0 was handled by a logistic regression classifier [12]

based on the LibLINEAR package [13]. Using Prague Dependency Treebank 2.5

[10] for Czech and Prague Czech-English Dependency Treebank (PCEDT) 2.0

[14] for English, we trained a new linear classifier using the VowpalWabbit library

[15] and features based on syntactic and topological context. Automatic analysis with projected gold-standard labels is used for training to make the method

CzEng 1.6: Czech-English Parallel Corpus with Tools Dockered


more robust. We achieve about 2 % accuracy gain in comparison to the previous

method; classification accuracy on the evaluation sections of the treebanks used

is 80.16 % and 78.12 % for Czech and English, respectively.


Verbal Word Sense Disambiguation

CzEng 1.6 includes word sense disambiguation in verbs [16], providing links to

senses in valency lexicons for Czech and English. The setup used in parallel

analysis exploits information from both languages (lemmas of aligned nodes)

and the CzEngVallex bilingual valency lexicon [17] to gain better annotation



Coreference Resolution

All the documents also contain annotation of coreference. In Czech, this is performed by the same Treex internal resolvers that were used in annotating CzEng

1.0. It covers coreference of pronouns and zeros. For English, coreference annotation has been provided mostly by BART 2.0 [18,19]. BART is a modular endto-end toolkit for coreference resolution. It covers almost all kinds of anaphoric

expressions including nominal phrases. Only relative pronouns must be processed

by the Treex internal resolver. To smooth down the processing pipeline, we set

limits on BART’s time and memory usage, which may cause that some documents are excluded from coreference annotation. However, it happens only

in around 1 % of CzEng documents. In addition, anaphoricity detection by the

NADA tool [20] is applied to instances of the pronoun it. For an instance declared

as non-anaphoric, a possible coreferential link assigned by BART is deleted.

Furthermore, coreferential expressions are exploited to improve word alignment. We use the approach presented in [21]. It is based on a supervised model

trained on 1,000 sentences from PCEDT 2.0 [14] with manual annotation of

word alignment for coreferential expressions. The only difference is that we do

the analysis of PCEDT completely automatically in order to obtain the features

distributed similarly to CzEng. Using this approach the alignment quality rises

from 78 % to 85 % and from 71 % to 85 % for English and Czech coreferential

expressions, respectively.


Analysis Dockered

While the whole Treex platform is in principle freely available and a lot of effort

is invested in lowering the entry barrier, it is still not very easy to get the pipeline

running, especially with all the processing blocks utilized in CzEng.

If a part of CzEng rich annotation is used as training data for an NLP task,

CzEng users naturally want to apply such models to new sentences. Indeed, we

have received several requests to analyze some data with the same pipeline as

CzEng was annotated.


O. Bojar et al.

With the current release, we want to remove this unfortunate obstacle, and

we release CzEng 1.6 with the complete monolingual analysis pipeline (for both

Czech and English) wrapped as a Docker13 container. Docker is a software bundle designed to provide a standardized environment for software development

and execution through container virtualization. Docker’s standardized containers allow us to make Treex installation automatic, even without the knowledge

of the user’s physical machine environment. This should make it very easy to

replicate the analysis on most operating systems.

An important added benefit is that the whole processing pipeline will be

frozen in the version used for CzEng. This strong replicability is very hard to

achieve even with current solid versioning, because some of the processing blocks

depend on models that cannot be included in the repository for space or licensing

reasons. Dockering CzEng analysis will thus help also ourselves.

We release two Treex-related Docker images: ufal/treex,14 which creates a

container with the latest release of Treex, and ufal/czeng16,15 which contains

Treex frozen on the revision that was used to process CzEng 1.6. Both images

are aimed at simplifying the process of Treex installation; the latter providing

the means to easily reproduce the CzEng 1.6 monolingual analysis scenario.

The Dockerfile that is used to build the ufal/czeng16 image simply specifies

all the dependencies that have to be installed to run the Treex modules correctly,

then it clones the Treex repository from GitHub16 and configures the necessary

environment variables. It also downloads and installs dependencies that are not

available in the repository (mainly Morˇce tagger and NADA tool).

To run the analysis pipeline, you just need to pull the CzEng 1.6 Docker image

from the Docker Hub repository and follow the instructions in the Readme file.

The pipeline is able to process data as a filter (read standard input and write

into standard output) as well as process multiple input files and save the results

into all CzEng 1.6 export formats.



CzEng 1.6 is freely available for educational and research non-commercial uses

and can be downloaded from the following website:


It is available in the following file formats; the first three are identical with the

previous release, the last one is new.

Plain text format is very simple and has only four tab-separated columns on

each line: sentence pair ID, filter score, Czech and English sentence.









CzEng 1.6: Czech-English Parallel Corpus with Tools Dockered


Treex XML format is the format used internally in the Treex NLP toolkit. It

can be viewed and searched with the TrEd tool.17

Export format is a line-oriented format that contains most of the annotation. It

uses one sentence per line, with tab-separated data columns (see the website

and [1] for details).

CoNLL-U is a new text-based format with one token per line, introduced within

the Universal Dependencies project18 and further enriched with word alignment and multilingual support within the Udapi project.19 The a-trees were

automatically converted to the Universal Dependencies style, the t-trees are

missing in CoNLL-U.



We introduced a new release of the Czech-English parallel corpus CzEng, version 1.6. We hope that the new release will follow the success and popularity of

the previous version, CzEng 1.0.

CzEng 1.6 is enlarged, contains a slightly improved and extended linguistic annotation, and the whole annotation pipeline is now available for simple

installation using the Docker tool. This makes it much easier to annotate other

data than what is already provided in the corpus, which has been one of the

major drawbacks of the previous CzEng release. We hope that by wrapping the

analysis pipeline as a Docker container, we make an important step to making

the annotation widely usable.

Acknowledgement. We would like to thank to Christian Buck for providing us with

raw crawled PDFs. This project received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement 645452 (QT21)

and 644402 (HimL), and also from FP7-ICT-2013-10-610516 (QTLeap), GA15-10472S

(Manyla), GA16-05394S, GAUK 338915, and SVV 260 333. This work has been using

language resources and tools developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech

Republic (project LM2015071).



1. Bojar, O., Zabokrtsk´

y, Z., Duˇsek, O., Galuˇsˇca


a, P., Majliˇs, M., Mareˇcek, D.,

Marˇs´ık, J., Nov´

ak, M., Popel, M., Tamchyna, A.: The joy of parallelism with CzEng

1.0. In: LREC, Istanbul, Turkey, pp. 3921–3928 (2012)

2. Callison-Burch, C., Koehn, P., Monz, C., Peterson, K., Przybocki, M., Zaidan, O.:

Findings of the 2010 joint workshop on statistical machine translation and metrics

for machine translation. In: Joint WMT and MetricsMATR, pp. 17–53 (2010)








O. Bojar et al.

3. Bojar, O., Chatterjee, R., Federmann, C., Haddow, B., Huck, M., Hokamp, C.,

Koehn, P., Logacheva, V., Monz, C., Negri, M., Post, M., Scarton, C., Specia, L.,

Turchi, M.: Findings of the 2015 workshop on statistical machine translation. In:

WMT, Lisboa, Portugal, pp. 1–46 (2015)

4. Tiedemann, J.: News from OPUS - a collection of multilingual parallel corpora

with tools and interfaces. In: RANLP, pp. 237–248 (2009)

5. Espl`

a-Gomis, M., Forcada, M.L.: Combining Content-based and URL-Based

Heuristics to Harvest Aligned Bitexts from Multilingual Sites With Bitextor.

Prague Bulletin of Mathematical Linguistics, vol. 93. Charles University (2010)


6. Popel, M., Zabokrtsk´

y, Z.: TectoMT: modular NLP framework. In: Loftsson, H.,

ognvaldsson, E., Helgad

ottir, S. (eds.) IceTAL 2010. LNCS, vol. 6233, pp. 293–

304. Springer, Heidelberg (2010)

7. Sgall, P., Hajiˇcov´

a, E., Panevov´

a, J.: The Meaning of the Sentence and Its Semantic

and Pragmatic Aspects. Academia/Reidel Publishing Company, Prague (1986)

8. McDonald, R., Pereira, F., Ribarov, K., Hajiˇc, J.: Non-projective dependency parsing using spanning tree algorithms. In: HLT/EMNLP, pp. 523–530 (2005)

9. Strakov´

a, J., Straka, M., Hajiˇc, J.: Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In: Proceedings of ACL: System

Demonstrations, Baltimore, Maryland, pp. 13–18. ACL (2014)

ˇ c´ıkov´

ˇ ep´

10. Bejˇcek, E., Panevov´

a, J., Popelka, J., Straˇ

ak, P., Sevˇ

a, M., Stˇ

anek, J.,



y, Z.: Prague dependency treebank 2.5 a revisited version of PDT 2.0.

In: Coling, pp. 231–246 (2012)

11. Palmer, M., Gildea, D., Kingsbury, P.: The proposition bank: an annotated corpus

of semantic roles. Comput. Linguist. 31, 71–106 (2005)

12. Mareek, D., Duek, O., Rosa, R.: Progress report on translation with deep generation. Project FAUST deliverable D5.5 (2012)

13. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library

for large linear classification. JMLR 9, 1871–1874 (2008)

14. Hajiˇc, J., Hajiˇcov´

a, E., Panevov´

a, J., Sgall, P., Bojar, O., Cinkov´

a, S.,



a, E., Mikulov´

a, M., Pajas, P., Popelka, J., Semeck´

y, J., Sindlerov´

a, J.,

ˇ ep´



anek, J., Toman, J., Ureˇsov´

a, Z., Zabokrtsk´

y, Z.: Announcing Prague

Czech-english dependency treebank 2.0. In: LREC, Istanbul, Turkey (2012)

15. Langford, J., Li, L., Strehl, A.: Vowpal Wabbit online learning project. Technical

report (2007)


16. Duˇsek, O., Fuˇc´ıkov´

a, E., Hajiˇc, J., Popel, M., Sindlerov´

a, J., Ureˇsov´

a, Z.: Using

parallel texts and lexicons for verbal word sense disambiguation. In: Depling, Uppsala, Sweden, pp. 82–90 (2015)


a, Z., Duˇsek, O., Fuˇc´ıkov´

a, E., Hajiˇc, J., Sindlerov´

a, J.: Bilingual English17. Ureˇsov´

Czech valency lexicon linked to a parallel corpus. In: LAW IX, Denver, Colorado,

pp. 124–128 (2015)

18. Versley, Y., Ponzetto, S.P., Poesio, M., Eidelman, V., Jern, A., Smith, J., Yang, X.,

Moschitti, A.: BART: a modular toolkit for coreference resolution. In: ACL-HLT,

pp. 9–12 (2008)

19. Uryupina, O., Moschitti, A., Poesio, M.: BART goes multilingual: the UniTN/Essex submission to the CoNLL-2012 shared task. In: EMNLP-CoNLL (2012)

20. Bergsma, S., Yarowsky, D.: NADA: a robust system for non-referential pronoun

detection. In: Hendrickx, I., Lalitha Devi, S., Branco, A., Mitkov, R. (eds.) DAARC

2011. LNCS, vol. 7099, pp. 12–23. Springer, Heidelberg (2011)

21. Nedoluzhko, A., Nov´

ak, M., Cinkov´

a, S., Mikulov´

a, M., M´ırovsk´

y, J.: Coreference

in Prague Czech-English dependency treebank. In: LREC (2016)

Using Alliteration in Authorship Attribution

of Historical Texts

Lubomir Ivanov(B)

Computer Science Department, Iona College,

715 North Avenue, New Rochelle, NY 10801, USA


Abstract. The paper describes the use of alliteration, by itself or in

combination with other features, in training machine learning algorithms

to perform attribution of texts of unknown/disputed authorship. The

methodology is applied to a corpus of 18th century political writings,

and used to improve the attribution accuracy.

Keywords: Authorship attribution

Machine learning





Lexical stress



Authorship attribution is an interdisciplinary field, aimed at developing methodologies for identifying the writer(s) of texts of unknown or disputed authorship. Historically, authorship attribution has been carried out by human experts.

Based on their knowledge, experience, and, sometimes, intuition, the experts put

forth hypotheses, and attempt to prove them through an exhaustive analysis of

the literary style and techniques used by the author, as well as by analyzing the

content of the work in the context of the historical and political realities of the

specific time period and the philosophical, political, and socio-economic views of

the author. Human expert attribution, however, is tedious and difficult to perform on large texts or corpora. It often misses subtle nuances of authors’ styles,

and may be tainted by the personal beliefs of the attribution expert. With the

rapid advance of data mining, machine learning, and natural language processing, novel authorship attribution methods have been developed, which can perform accurate computer analyses of digitized texts. There are many advantages

to automated attribution: Analysis of the large texts/corpora can be carried out

significantly faster and in greater depth, uncovering inconspicuous stylistic features used by authors consistently and without a conscious thought. Automated

attribution is not influenced by the subjective beliefs of the attribution expert,

and the results can be verified independently.

Automated authorship attribution has found a wide-spread application in

areas as diverse as literature, digital rights, plagiarism detection, forensic linguistics, and anti-terrorism investigation [1–5]. A variety of techniques have emerged

to handle specific types of attribution – poetry, long and short prose, historical

c Springer International Publishing Switzerland 2016

P. Sojka et al. (Eds.): TSD 2016, LNAI 9924, pp. 239–248, 2016.

DOI: 10.1007/978-3-319-45510-5 28


L. Ivanov

documents, electronic communication (e.g. email, Twitter messages), etc. The

most well-known authorship attribution success is the work of Mosteller and

Wallace on the Federalist papers [6]. Other notable attributions include works

by William Shakespeare [7,8], Jane Austen [9], and Greek prose [10]. A number

of important attribution studies of American and British writings of the late 18th

century include works of Thomas Paine, Anthony Benezet, and other political

writers from the time of the American and French Revolutions [11–13].

A new direction in authorship attribution is the use of prosodic features. This

paper focuses on the use of alliteration. We present an algorithm for extracting alliteration patterns from written text and using them for training machine

learning algorithms to perform author attribution. We also present results from

combining alliteration-based attribution with traditional methods. The results

are analyzed, pointing out the strengths and weaknesses of alliteration-based

attribution in contrast to lexical-stress attribution. Finally, the paper presents

future directions based on the use of other prosodic features.




Motivation and Definition

When analyzing the writings of Thomas Paine, a colleague commented that

“there is a distinct rhythm – almost a melody – to his writing that nobody else

has”. This casual observation led to our effort to quantify the notion of “melody

in text” and to use it for authorship attribution. In [12] we demonstrated how

lexical stress can be used for authorship attribution of 18th century historical

texts. Lexical stress, however, is confined to individual words, and, thus, has a

relatively small influence on the flow of melody in text. Lexical stress also requires

a significant literary sophistication on behalf of the author for expressing emotion

in text. Other prosodic features – intonation, alliteration – are more commonly

employed for providing emotional emphasis.

Alliteration is a prosodic literary technique used to emphasize and strengthen

the emotive effect of a group of words in a phrase, a sentence, or a paragraph.

Among the different definitions of alliteration [14–16] the most linguisticallyappropriate one is that alliteration is the repetition of the same consonant sound

in the primary-stressed syllables of nearby words. Some resources [17] indicate

that it is also appropriate to consider the repetition of initial vowel sounds in

nearby words as alliteration.

Simple examples of alliteration are common tongue-twisters such as “Peter

Piper Picked a Peck of Pickled Peppers” and “But a better butter makes a

batter better” as well as “catchy” company names like “Dunkin’ Donuts” and

“Bed, Bath, and Beyond”. Aside from children’s rhymes and popular advertising,

alliteration has traditionally been used both in poetry and prose, dating as far

back as 8th century and possibly much earlier.

For a technique so widely used in literature, alliteration has been analyzed

very little with modern computer-based methodologies. One study of alliteration

in “Beowulf’ is presented in [18]. More recently, alliteration was mentioned in

Using Alliteration in Authorship Attribution


the context of forensic linguistics [19]. An analysis of alliteration’s usefulness as

a stand-alone measure or in combination with other features was not presented.

The goal of this research is to investigate the appropriateness of using alliteration

for attribution. We consider alliteration with different factors of interleaving and

analyze the effectiveness of combining alliteration with other features for training

classifiers to perform author attribution.


Extracting Alliteration from Text

For alliteration to be usable as an attribution feature, we must show that authors

tend to use alliteration consistently and uniquely relative to other authors. Our

first task was to extract all alliteration patterns from our corpus. At present, our

text collection consists 224 attributed (Table 1) and 50 unattributed documents.

The documents’ attribution is fairly certain, although the age and nature of the

material must allow for a percentage of misattributions. Other issues include the

unequal number of documents per author, the different document lengths (950

to 20,000 words), and problems with the digitization of the original manuscripts.

To extract the alliteration patterns we use the Carnegie Mellon University

(CMU) pronunciation dictionary [20]. Each word in the dictionary is transcribed

for pronunciation into syllables with lexical stress indicated by three numeric values: 0 – no stress, 1 – primary stress, 2 – secondary stress. We use the CMU

dictionary to extract the main stress syllable, and, from it, its leading consonant, for every word in each text in our corpus. Connective function words are

Table 1. Authors of attributed documents


Num. of docs Author

John Adams


Joel Barlow


Num. of docs

James Mackintosh


William Moore


Anthony Benezet


William Ogilvie

James Boswell


Thomas Paine


James Burgh


Richard Price


Edmund Burke


Joseph Priestley


Charles Carroll


Benjamin Rush



John Cartwright


George Sackville


Cassandra (pseud. of J. Cannon) 4

Granville Sharp


Earl of Chatham (W. Pitt Sr.)

Earl of Shelburne (William Petty) 3


John Dickinson


Thomas Spence


Philip Francis


Charles Stanhope


Benjamin Franklin


Sir Richard Temple


George Grenville


John Horne Tooke



Samuel Hopkins


John Wesley

Francis Hopkinson


John Wilkes

Thomas Jefferson


John Witherspoon

Marquis de Lafayette


Mary Wollstonecraft


Thomas Macaulay


John Woolman




L. Ivanov

not processed. For example, the following excerpt from [21]: “Up the aisle, the

moans and screams merged with the sickening smell of woolen black clothes

worn in summer weather and green leaves wilting over yellow flowers.” (1) yields

the string "U A M S M S S W B K W S W G L W O Y F". Each input file produces a similar string of main-stress leading consonants. These strings are then

processed for alliteration using the algorithm described below.


Alliteration Algorithm

Notice that the excerpt (1) above contains alliteration, which is far more

complex than the simple tongue-twister alliterations presented earlier. The

“moans...merged” alliteration is interleaved with the “screams...sickening smell

...summer” alliteration, which, in turn, is interleaved with the “worn...weather

...wilting” alliteration. It is possible for one or more alliteration patterns to begin

in the middle of another alliteration pattern. The new pattern may be entirely

contained within the original pattern (a special case is referred to as symmetric

alliteration, e.g. “...better find friends brave...”), or may continue past the end of

the original pattern. To handle complex alliteration interleaving, we developed

the alliteration extraction algorithm below:

C r e a t e an empty l i s t o f < l e t t e r , a l l i t e r a t i o n C o u n t > p a i r s

C r e a t e a queue , Q, f o r s e a r c h s t a r t p o s i t i o n s and enqueue p o s i t i o n 0

w h i l e (Q i s n o t empty ) {

Dequeue Q: g e t n e x t s e a r c h p o s i t i o n , i , and s e t i t t o ” p r o c e s s e d ”

Set count = 1

f o r ( a l l p o s i t i o n j = i +1 t o end o f t h e ‘ ‘ Text ” s t r i n g ) {

i f ( Text [ j ] e q u a l s Text [ i ] ) {

c o u n t ++

Set p o s i t i o n j as ‘ ‘ processed ”


else {

Enqueue j a s a new s e a r c h p o s i t i o n i n Q

f o r ( a l l p o s i t i o n s p = j +1 t o j+s k i p R a n g e ) {

i f ( Text [ p ] e q u a l s Text [ i ] )

S e t i=p−1

i f ( no l e t t e r [ i ] f o u n d i n t h e s k i p r a n g e ) {

Add t o t h e a l l i t e r a t i o n l i s t

C o n t i n u e w i t h t h e n e x t i t e r a t i o n o f t h e main w h i l e l o o p




i f ( n o t a l r e a d y added ) Add t o t h e a l l i t e r . l i s t



Starting with the character in position 0, the algorithm searches for alliterated

characters in the input string “Text”. One by one, the subsequent characters of the

string are examined. If the next character matches the current search character,

the alliteration count is incremented. Otherwise either the alliteration sequence is

complete or the next alliterated word is not immediately adjacent. The skip range

value indicates the maximum distance, which words with the same main stressed

syllable leading consonant can be apart before they are no longer considered alliterated. Thus, when a different character is encountered, its position is enqueued

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockered

Tải bản đầy đủ ngay(0 tr)