Tải bản đầy đủ - 0 (trang)
7 Listeners’ perception and acoustic measurements

7 Listeners’ perception and acoustic measurements

Tải bản đầy đủ - 0trang

7.7 Listeners’ perception and acoustic measurements



87



formants LTF2 and LTF3, filled pauses, syllable rate and articulation rate. These

parameters were chosen because they are typically analyzed in forensic phonetic

contexts (cf. Jessen 2012) and because they are supposed to correlate best with

the perceptual feedback provided by the listeners of the present study (see section 7.7.2).



Figure 15: Sighted listeners’ responses in target-present voice lineups. Responses from lineups in studio, cell phone or whispered quality are displayed separately.

Fundamental frequency (F0)

All F0 measurements were done in Praat (version 5.4.01). Praat’s unvoice function was used to delete obviously incorrect measuring points of the automatic

pitch-tracking algorithm. Especially in cell phone quality recordings, many of

those corrections were necessary in order to obtain valid results. Sometimes it

was possible to adjust obviously incorrect measuring points instead of deleting

them. Fundamental frequency was only measured in sections in which a modal

voice was used; for example, sections containing creaky voice were excluded

from the analysis. Results are shown in Table 5.



88



7 Results

Voice sample



mean F0



SD



Varco



target studio



113.71



16.88



14.84



target cell phone



114.00



17.30



15.16



A studio



100.07



15.23



15.21



98.61



14.37



14.57



B studio



126.36



27.62



21.86



B cell phone



128.09



29.12



22.73



C studio



135.23



19.37



14.32



C cell phone



134.04



16.35



12.20



D studio



112.37



15.12



13.46



D cell phone



112.02



15.26



13.26



E studio



110.96



17.13



15.44



E cell phone



111.50



17.32



15.33



F studio



149.88



24.60



16.41



F cell phone



153.47



26.87



17.51



G studio



132.59



9.29



7.01



G cell phone



136.24



10.44



7.66



H studio



129.96



18.19



14.00



H cell phone



129.54



18.97



14.64



I studio



140.67



21.22



15.08



I cell phone



143.99



23.36



16.22



A cell phone



Table 5: Mean fundamental frequency (F0), standard deviation of F0 and varco

of F0 measured for all voice samples in studio and cell phone quality.

The first two rows refer to the target speaker’s voice samples from the

familiarization, all remaining rows refer to voice samples from speakers whose voices appeared in the voice lineups. Varco = 100 * (F0

SD/F0 mean); (cf. Jessen 2012, p. 86).

Long-term formants LTF 2 and LTF 3

Long-term formants LTF2 and LTF3 were measured in WaveSurfer (version

1.8.5; formant tracking based on Linear Predictive Coding) in all studio quality



7.7 Listeners’ perception and acoustic measurements



89



voice samples. For the measurement, the same settings as described in Moos

(2008) were used and the criteria for the selection of voice material were applied

(included: vowels and approximants with a clear formant structure as well as

filled pauses; excluded: nasalized vowels and nasals – because of the presence of

antiformants32, all further consonants). Errors of the automatic formant trackers

were corrected manually. Errors occurred, for instance, in formant measurements

of vowels in which the first (F1) and the second formant (F2) are close to each

other, e.g. [ɔ] (cf. Ladefoged and Johnson 2011, p. 194). In these cases, the formant tracking algorithm frequently mistook the third formant (F3) for the second

(cf. Künzel 2001). LTF1 was not measured because the first formant (F1) is

highly affected by the degraded signal in telephone speech and also contains less

speaker-specific information (cf. section 3.2.1.; cf. Clermont 1996). The results

are illustrated in Figure 16.



Figure 16: Measurements of LTF2 and LTF3 for all speakers. The results from

the target speaker are marked with two circles (T = voice sample

from the familiarization, D = voice sample from the lineup).

32



“…[A]ntiresonances or antiformants are frequency regions in which the amplitudes of the

source signal are attenuated because the nasal cavities absorb energy from the sound wave”

(Kerstens et al. 2001).



90



7 Results



Filled pauses

Phonetically, filled pauses consist of a vowel, [ə] or [ε], which can optionally be

followed by [m]. The latter can also occur in isolation (Wiese 1983, p. 127). All

three types of filled pauses (which are usually referred to as äh, ähm and m in

German) were observed in the speech material of the present study. Since cell

phone quality recordings were re-recordings of the voice samples in studio quality (i.e. the content was the same in both conditions), filled pauses were only

analyzed in studio quality recordings as well as in all voice samples of whispered

speech (see Figure 17 and 18). All voice samples were approximately 55 sec. in

length.



Figure 17: Number of filled pauses for all speakers measured in studio quality

recordings. The voice samples were approximately 55 sec. in length.

T = target speaker’s voice sample from the familiarization; D = target speaker’s voice sample from the voice lineup.



7.7 Listeners’ perception and acoustic measurements



91



Figure 18: Number of filled pauses for all speakers measured in whispered

voice samples. The voice samples were approximately 55 sec. in

length. T = target speaker’s voice sample from the familiarization;

D = target speaker’s voice sample from the voice lineup.

Syllable rate (SR) and articulation rate (AR)

Syllable rates33 and articulation rates34 were measured in Praat (version 5.4.01)

on all voice samples in studio and whispered quality. Again, voice samples in

cell phone quality were not considered for analysis because they were – apart

from signal transmission – identical with the studio quality recordings. Phonetic

syllables were counted, i.e. only syllables which were actually produced by the

speaker. AR was measured in several (fluently spoken) parts of a speaker’s utterance in order to calculate the mean as well as the standard deviation of AR

(Jessen 2012, p. 140). Calculating the standard deviation of AR has the advantage that also a speaker’s individual variation of AR can be analyzed. The

results are shown in Table 6.

33

34



Syllable rate (SR) = number of syllables / total length of the utterance [in sec.]

Articulation rate (AR) = number of syllables / (total length of the utterance [in sec.] – all

dysfluencies); dysfluencies: e.g. pauses, lengthening of syllables, slips of the tongue, etc. (cf.

Jessen 2012, p. 140-141)



92



7 Results

Voice sample



SR



AR



SD AR



target studio



2.31



5.40



0.70



target whispered



2.56



5.07



0.56



A studio



2.91



5.69



0.74



A whispered



2.76



5.72



0.72



B studio



2.28



6.14



0.82



B whispered



2.97



6.04



1.02



C studio



3.04



5.80



0.96



C whispered



2.66



5.51



0.46



D studio



3.17



5.62



0.74



D whispered



3.47



5.42



0.47



E studio



2.81



5.52



0.63



E whispered



2.80



5.19



0.84



F studio



3.16



6.29



1.16



F whispered



3.33



6.03



0.63



G studio



2.91



5.61



0.76



G whispered



3.20



5.68



0.77



H studio



2.76



4.50



0.29



H whispered



2.31



4.57



0.37



I studio



3.37



5.37



0.68



I whispered



4.36



5.72



0.55



Table 6: Syllable rates (SR), articulation rates (AR) and standard deviation of

AR measured on all voice samples in studio and whispered quality.

The first two rows refer to the target speaker’s voice samples from the

familiarization, all remaining rows refer to voice samples from speakers whose voices appeared in the voice lineups.

7.7.2 Perceptual feedback (listeners)

Since listeners were asked to provide reasons for their choice in the voice lineup

(i.e. why they did or did not pick a particular speaker), also perceptual data were



7.7 Listeners’ perception and acoustic measurements



93



available for analysis. The perceptual data were compared to the acoustic data

from section 7.7.1. in order to explore a) why certain distractor speakers were

more often mistaken for the target speaker than others and b) why exactly the

target voice was correctly identified by some listeners. In order to avoid getting

lost in too much detail, only perceptual and acoustic data of those speakers were

analyzed in more depth who were assumed to be the target speaker by at least

eight listeners (cf. Figure 12 - 15).

In whispered quality target-absent lineups, blind listeners frequently mistook speaker G for the target (cf. Figure 12). Unfortunately, listeners’ perceptive

feedback did not reveal any clear reason for this choice since nearly all blind

listeners provided different reasons for why they had chosen speaker G (see first

bar of Figure 19). For speaker A, who was frequently mistaken for the target

speaker in studio and cell phone quality target-absent lineups, a somewhat clearer pattern was found: blind listeners most often indicated that they picked this

speaker because of his characteristic, frequent use of filled pauses and because of

the timbre of his voice. Under cell phone quality conditions, blind listeners furthermore referred more frequently to the speaker’s (rather low) pitch (cf. bar 2-3

of Figure 19).

Speaker A’s lower pitched voice and his timbre were also the main reasons

why sighted listeners picked this speaker in target-absent (studio quality) as well

as target-present lineups (studio and cell phone quality): cf. bar 4, 5, and 6 of

Figure 19. Compared to the relevant acoustic measurements (section 7.7.1.) it

turns out that speaker A as well as the target speaker both have a lower mean

fundamental frequency than most of the other speakers in the lineups (see Table

5), which would explain the listener’s perception of a lower-pitched voice. Furthermore, both speakers have roughly similar long-term formants (see Figure 16)

which could – at least partially – explain why speaker A and the target speaker

were said to have a similar vocal timbre (cf. Cleveland 1977). Additionally, both

speakers made more frequent use of filled pauses (see Figure 17) – which could

explain the blind listeners’ feedback regarding this aspect.

When listeners’ perceptual feedback regarding correct identifications is

considered, the most stated reasons for choosing speaker D (i.e. the target speaker’s voice) were rather similar between blind and sighted listeners. In whispered

lineups, speaker D was most frequently chosen because of the timbre of his voice

and his use of pauses. Furthermore, sighted listeners referred rather often to

speaker D’s assumed pitch (cf. bar 7 and 8 of Figure 19). Under studio and cell

phone quality conditions, listeners’ main reasons for choosing speaker D were

his use of (filled) pauses as well as the pitch of his voice. Additionally, blind

listeners referred rather often to speaker D’s characteristic intonation while



94



7 Results



sighted listeners referred to the timbre of the target speaker’s voice (cf. bar 9 -12

of Figure 19).

Generally, listeners’ perceptive feedback35, which gave clues as to why they

did or did not mistake a particular distractor speaker for the target, was quite

mixed. The feedback from blind and sighted listeners was rather similar, but

overall perceptive data were only partially in line with the data obtained from

acoustic measurements. One of the most frequent reasons why distractor speaker

G was not chosen by listeners as the target speaker was the perception that

speaker G spoke too slowly (stated by 10% of the listeners who named a specific

reason). However, the acoustic measurements of speaker G’s speaking rate and

articulation rate (see Table 6) provide no indication for the aforementioned slow

speaking rate. Since speaker G’s variation of F0 is very small (see Table 5) and

8.9% of the listeners who stated a reason indicated correctly that speaker G

spoke too monotonously it is possible that some listeners were confused by these

two parameters.

A confusion between two different parameters seems to have also occurred

with regard to distractor speaker E. 32.5% of listeners’ feedback indicates that

speaker E was perceived as having a voice which is too low in pitch compared to

the target speaker’s voice. 24.5% of the listeners who provided feedback stated

that speaker E’s voice had a different timbre than the voice of the target speaker,

whereas most of the listeners described speaker E’s voice as sounding too dark

or dull. Acoustic measurements (see Figure 16) give stronger support only to the

perceived dark timbre but not to the perceived low-pitched voice since speaker

E’s mean fundamental frequency is only 2-3 Hz deeper than the target speaker’s

voice (see Table 5). Interestingly, speaker A, whose pitch is considerably deeper

than the target speaker’s voice (100.07 Hz vs. 113.71 Hz), received far less frequent feedback from listeners that his voice sounded too deep.

For distractor speaker H, 12.8% of all listeners who gave perceptive feedback noticed correctly that speaker H spoke considerably slower than the target

speaker, which is confirmed by the low articulation rate of speaker H compared

to all other speakers (see Table 6). Furthermore, 20.2% of all listeners who provided perceptive feedback indicated correctly that speaker H’s pitch is considerably higher compared with that of the target speaker. The latter can be confirmed

by comparing the perceived pitch to the mean F0, which is the acoustic correlate

of pitch (see Table 5). Information on the perceptive feedback regarding the

remaining distractor speakers can be found in appendix C.



35



Percentages of listeners who provided perceptive feedback for the nine speakers: Feedback for

speaker: A = 54.6%; B = 39.5%; C = 41.8%; D = 63.4%; E = 39.5%; F = 33.7%; G = 35.9%; H

= 35.0%; I = 35.9%.



Figure 19: Listeners’perceptive feedback regarding the selection of a particular speaker in the voice

lineup. Gray areas: feedback which was less frequently provided by the listeners; not further

specified. b = blind listeners; s = sighted listeners; TA = target-absent lineup; TP = targetpresent lineup; wh = whispered condition; ce = cell phone quality condition; st = studio quality

condition; A, D or G = labels for particular speakers in the voice lineups.



7.7 Listeners’ perception and acoustic measurements



Figure 19:



fig19



95



96



7 Results



Despite the fact that some interesting patterns emerge between listeners’ perceptive feedback and the acoustic measurements of the voice samples, it should be

emphasized that listeners’ overall feedback was – if it was provided at all (cf.

footnote on page 94) – still quite variable. It often contained rather broad descriptions of the (target) speaker’s voice (cf. also Yarmey 2003). Therefore,

listeners’ perceptive feedback can only serve as a rough indicator for why a particular speaker has been selected in the voice lineup.

7.8 Summary 2

In studio quality target-absent lineups, 13 out of 25 blind – but only 3 out of 26

sighted listeners – indicated (correctly) that the target speaker’s voice was not

present in the lineup. In target-absent lineups with whispered voice samples,

blind listeners frequently mistook speaker G for the target speaker, but the listeners’ reasons for their choice were manifold. A clearer pattern emerged for the

confusion with distractor speaker A: he was frequently mistaken for the target

speaker by blind as well as sighted listeners under studio and cell phone conditions. For blind listeners, the false identification of speaker A occurred mainly in

target-absent lineups. Sighted listeners, however, frequently misidentified speaker A in target-present as well as (studio quality) target-absent lineups.

A comparison between listeners’ perceptive feedback and acoustic measurements which were performed on the voice samples revealed a weak pattern:

several listeners noticed (correctly) that both the target speaker as well as distractor speaker A – who was frequently mistaken for the target under studio and cell

phone conditions – had lower pitched voices than most of the other speakers in

the lineups and were rather similar with regard to the timbre of their voices. Both

speakers made frequent use of filled pauses when they were speaking in a normal

tone of voice.



8 Interview with visually impaired forensic audio

analysts



In order to address the question whether blind individuals are particularly suitable for working in the field of forensic phonetics, it was considered necessary to

first look at the current situation of blind individuals who are already employed 36

in law enforcement institutions. (For a discussion of the topic, see section 9.10

and Chapter 10).

Several press releases reported that, in 2007, the Belgian police set up a unit

of blind officers who work as forensic audio analysts (Cleemput 2007; Bilefsky

2007; van Veen 2007). On the police’s website, it reads that these special positions were offered exclusively to blind and visually impaired citizens because of

their acute sense of hearing (Jobpol 2015). The idea of hiring blind people was

put into practice after changes in legislation allowed for it (Federale Politie

2015).

Within the scope of the present project, it was possible to arrange an interview with a blind and a partially blind officer from the Federal Belgian Police in

Antwerp (for the original interview, see appendix B). The following information

is based on this interview.

Apart from the two visually impaired police officers in Antwerp, four others

are based in Liège and Brussels. All of them are police officers with limited

authority. Originally, the plan was to raise the total number of visually impaired

police officers in Belgium to 33 (cf. Cleemput 2007); however, this has never

been put into practice. One reason for the latter might be that the employed visually impaired audio analysts were never able to set up wire taps themselves as it

turned out that their screen readers were not compatible with the surveillance

equipment at the police station. Currently, the primary task of the visually impaired employees in Antwerp is to transcribe the content of audio files which

they get from their sighted colleagues on a USB flash drive. When a phonetic

transcription is needed, the Latin alphabet is used to describe the perceived

sounds as accurately as possible. About 80% of the analyzed audio material

36



In Germany, only about 30% of the blind of employable age (20-60 years) hold a job (BSVSB

2014). Therefore, exploring new job opportunities for blind individuals, for example, in the

field of forensic phonetics would be beneficial in many respects.



© Springer Fachmedien Wiesbaden 2016

A. Braun, The Speaker Identification Ability of Blind and Sighted Listeners,

DOI 10.1007/978-3-658-15198-0_8



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

7 Listeners’ perception and acoustic measurements

Tải bản đầy đủ ngay(0 tr)

×