Perceptual Objective Listening Quality Analysis (POLQA) was the working title of an ITU-T standard that covers a model to predict speech quality by means of analyzing digital speech signals.[1] The model was standardized as Recommendation ITU-T P.863 (Perceptual objective listening quality assessment) in 2011. The second edition of the standard appeared in 2014, and the third, currently in-force edition was adopted in 2018 under the title Perceptual objective listening quality prediction.[2]
POLQA covers a model to predict speech quality,[3] [4] by means of digital speech signal analysis. The predictions of those objective measures should come as close as possible to subjective quality scores as obtained in subjective listening tests. Usually, a Mean Opinion Score (MOS) is predicted. POLQA uses real speech as a test stimulus for assessing telephony networks.
POLQA is the successor of PESQ (Recommendation ITU-T P.862).[5] POLQA avoids weaknesses of the current P.862 model and is extended towards handling of higher bandwidth audio signals. Further improvements target the handling of time called signals and signals with many delay variations. Similarly to P.862, POLQA supports measurements in the common telephony band (300–3400 Hz), but in addition it has a second operational mode for assessing HD-Voice in wideband and super-wideband speech signals (50–14000 Hz). POLQA also targets the assessment of speech signals recorded acoustically by an artificial head with mouth and ear simulators.
The POLQA activities started in ITU-T in early 2006 under the working title P.OLQA. In mid-2009, a competition was started to evaluate several candidate models. In May 2010, ITU-T selected candidate models from three companies (OPTICOM, SwissQual / Rohde & Schwarz and TNO (Netherlands Organisation for Applied Scientific Research)). The three companies merged their approaches to one single model, which was adopted as Recommendation ITU-T P.863.
ITU-T’s family of full reference objective voice quality measurements started in 1997 with Recommendation ITU-T P.861 (PSQM), which was superseded by ITU-T P.862 (PESQ) in 2001. P.862 was later complemented with Recommendations ITU-T P.862.1[6] (mapping of PESQ scores to a MOS scale), ITU-T P.862.2[7] (wideband measurements) and ITU-T P.862.3[8] (application guide). The first edition of ITU-T P.863 (POLQA) entered into force in 2011. An Application guide for Recommendation ITU-T P.863 was approved in 2019 and published as ITU-T P.863.1.[9]
In addition to the above listed full reference methods, the list of ITU-T’s objective voice quality measurement standards also includes ITU-T P.563[10] (no-reference algorithm).
POLQA, similar to P.862 PESQ, is a Full Reference (FR) algorithm that rates a degraded or processed speech signal in relation to the original signal. It compares each sample of the reference signal (talker side) to each corresponding sample of the degraded signal (listener side). Perceptual differences between both signals are scored as differences. The perceptual psycho-acoustic model is based on similar models of human perception as MP3 or AAC. Basically, the signals are analysed in the frequency domain (in critical bands) after applying masking functions. Unmasked differences between the two signal representations will be counted as distortions. Finally, the accumulated distortions in the speech file are mapped into a 1 to 5 quality scale as usual for MOS tests. FR measurements deliver the highest accuracy and repeatability but can only be applied for dedicated tests in live networks (e.g. drive test tools for mobile network benchmarks).
POLQA is a full-reference algorithm and analyzes the speech signal sample-by-sample after a temporal alignment of corresponding excerpts of reference and test signal. POLQA can be applied to provide an end-to-end (E2E) quality assessment for a network, or characterize individual network components.
POLQA results principally model mean opinion scores (MOS) that cover a scale from 1 (bad) to 5 (excellent).
The inputs to the algorithm are two waveforms represented by two data vectors containing 16 bit PCM samples. The first vector contains the samples of the (undistorted) reference signal, whereas the second vector contains the samples of the degraded signal. The POLQA algorithm consists of a temporal alignment block, a sample rate estimator of a sample rate converter, which is used to compensate for differences in the sample rate of the input signals, and the actual core model, which performs the MOS calculation. In a first step, the delay between the two input signals is determined and the sample rate of the two signals relative to each other is estimated. The sample rate estimation is based on the delay information calculated by the temporal alignment. If the sample rate differs by more than approximately 1%, the signal with the higher sample rate is down sampled. After each step, the results are stored together with an average delay reliability indicator, which is a measure for the quality of the delay estimation. The result from the re-sampling step, which yielded the highest overall reliability, is finally chosen. Once the correct delay is determined and the sample rate differences have been compensated, the signals and the delay information are passed on to the core model, which calculates the perceptibility as well as the annoyance of the distortions and maps them to a MOS scale. A much more detailed and comprehensive description of the algorithm can be found in. The next few sections are only intended to give an overview on the basics of POLQA’s internal structure.
The main element of the core model is the perceptual model which is calculated four times using different parameters in order to cope with different major distortion types. Those distortion types can be split into additive distortions and subtracted distortions. For both types a further distinction is made between very strong and weaker effects. The inputs to the perceptual models are waveforms and the delay information. The output is the Disturbance Density, which is a measure for the perceptibility of distortions in the signals. The perceptual model for the main branch also produces indicators for Frequency distortions, Noise and Reverberation distortions. A subsequent switch which is triggered by a detector for very strong distortions reduces the four Disturbance Density values down to two, one for added and one for subtracted distortions. So far the Disturbance Density is an indicator for the perceptibility of distortions only and cognitive effects are not yet taken into account. Cognitive aspects are however important when human beings are asked to score the quality of what they can perceive. Essentially they convert the perceptibility measure Disturbance Density into an annoyance measure. This conversion is performed by correcting the Disturbance Density values for situations with:
Two further indicators, one for spectral flatness and one for level variations are also calculated in this step.
So far all operations were performed on frames with a duration of approximately 32 and 43ms duration (depending on the sample rate and using an overlap of 50%) and for each Bark band separately. In a final step all indicators are integrated over time and frequency in order to compute the final MOS LQO value.
The key concept inside the perceptual model is Idealisation. The idea behind this is, that POLQA is supposed to simulate Absolute Category Rating (ACR) tests. In an ACR test however, subjects have no comparison to the actual reference signal when they score a speech signal. Instead, it is assumed that subjects have an understanding of what an ideal signal sounds like and they use this as their own reference. Consequently, if they are asked to score a reference signal which is not absolutely perfect (e.g. it has the wrong volume or contains too much timbre, noise or reverberation), it will be scored worse than perfect. In its idealization step POLQA therefore corrects small imperfections of the reference signals in order to derive the same ideal reference for the comparison to the degraded signal as human subjects would use in their minds. Similar to the idealization of the reference signal, some distortions present in the degraded signal which are hardly perceptible in an ACR test will be partially compensated (e.g. small pitch shifts, linear frequency distortions). The perceptual model starts with scaling the reference signal to an ideal average active speech level of approximately -26dBov. No such scaling is performed on the degraded signal. It is assumed that any deviation of the level of the degraded signal from the ideal -26dBov is to be scored as a degradation of the signal. Next, the spectra of both signals are computed using an FFT with 50% overlapping frames with a duration of between 32ms and 43ms duration (depending on the sample rate). Subsequently small pitch shifts of the degraded signal will be eliminated (Frequency Dewarping). Now, the spectra will be transformed to a psychoacoustically motivated pitch scale, by combining individual spectral lines (FFT bins) to so-called critical bands. The pitch scale used is similar to the Bark scale with an average resolution of 0.3 Bark per band. The result is the Pitch Power Density. At this stage the first three distortion indicators for frequency response distortions, additive noise and room reverberations are calculated.After this, the excitation of each band is derived. This includes the modeling of masking effects in the frequency as well as in the temporal domain. The result is for each frame of each signal a head-internal representation which indicates roughly how loud each frequency component would be perceived.Now, a further idealization step of the reference signal takes place by filtering out excessive timbre and low level stationary noise. At the same time, linear frequency distortions and stationary noise are partially removed from the degraded signal.A subtraction of the idealized excitations finally leads to the Distortion Density, which is measure for the audibility of distortions.
A paper which uses POLQA to investigate the impact of tone language and non-native listening on speech quality measurement can be found in.[11]