Document Type : Original Article
Author
Mechanical Engineering and Engineering Science, University of North Carolina at Charlotte, Charlotte, USA
Abstract
Graphical Abstract
Keywords
Different signal processing methods have been used in order to process different signals and remove noise from them such as Butterworth low pass digital, Fast Fourier Transform and Two-dimensional Fourier Transform [1]. Hilbert-Huang Transform can be used to determine the dynamic specification of a system, Kelareh et al. determined the dynamic characteristics of an eight-story structure with acquisition noise and different loads through the Hilbert-Huang Transform An adaptive extended Kalman filter is applied in proposed algorithm to reduce the noises[2]. Hamed et al. showed that using a state-feedback controller with extended Kalman filter can decrease the noise of signal significantly[3]. Mohamed et al. implemented Fast Fourier Transformation for
analyzing the signals extracted from PZT in the structural health monitoring application [4,5].
Two-dimensional Fourier Transform field, widely used different process such as noise removal in image processing, , postprocessing of tomographic Particle Image Velocimeters[6,7] and many features and image descriptors are extracted in this area. Conversely, in speech processing, valuable information for speech processing is usually extracted from the amplitude of the short signal spectrum time, which is displayed as a spectrogram image. Although the image is anywhere, the value of the spectrum amplitude in the frame of speech and it is displayed at a known frequency, it cannot completely reconstruct the input time signal, but the reconstructed signal from an auditory point of view has the same quality as the input signal. In some other applications, like system identification and FDIR methods, FFT is a convenient tool to extract the features of the
Figure 2: Diagram block of the general process of the proposed method
system.[8,9] The extracted features are the key indicator for different classified faults and errors. Also, the combination of FFT and a multi-layer neural network is used to detect electromechanical faults in RW.[9] The time for generating biomedical samples is a vital factor when we consider ambulatory devices, with the fact that information should be sent to the physician as soon as possible. In addition, there are some wearable ECG recorders that have limited power, and may only be capable of doing simple algorithms. In these cases also using signal characteristics in bio-medical applications can be based in DFT[10].
In recent years, there has been a growing interest in spectrogram image processing for spoken information analysis [11]. In a process suggested by Quatiri et al. [11-14], Speech information is searched in the spectrogram image texture using two-dimensional Fourier transform windows of the spectrogram [12].
As the spectrogram image axes represent time (frame number) and frequency,
the axes of this image represent the frequency of changes along with the successive frames (scales) and the new page of the dense window conversion, or GCT, is called the speech signal.
In the one-dimensional Fourier transform of the speech signal, each of the step frequency harmonics can be considered as a sink that appears on the spectrogram plane in parallel or diagonal lines.[11-12] If a two-dimensional Fourier transform of the spectrogram plate multiplied by a local window function such as a square or a Hamming is considered, the resulting rate and scale domain in the convolution of the Fourier transform multiplied window for a window containing parallel lines, according to figure 1, is two clusters which represent the harmonic distance and the slope of the harmonic distance changes over time. This idea is used to extract step frequency,[13,14] and this process is transformed into a noise-resistant algorithm for step extraction by image clustering algorithm [15]. Rafieipour et al. presented a a self-organized method of clustering
Figure 3: The process of adding dc to the reconstructed signal using cluster shifts to the source
Figure 4: GCT from a specific part of the STFT signal and remove dc
based on Low Energy-Adaptive Clustering Hierarchy (LEACH) algorithm, which considers the frequencies [16]. Wang et al. also proposed a method for extracting formulations from high-frequency steps based on GCT [26].
To separate speech from noise, we use the Weighted K-means clustering method to cluster information in the GCT space, delineate, and extract cluster centers [19]. In this paper, the number of clusters with a random procedure is obtained by means of testing. In this way, the number of clusters is chosen randomly from the samples. Then, the mean and variance of all samples from these centers are calculated, and each sample is computed by comparing with the centers of the cluster and selecting the closest member distance to the cluster. The centers calculate the spectra obtained from the two-dimensional Fourier transform of the populated spectrogram. Then, the weighted samples are calculated as the centers of the clusters.
The WK-means method is used as a precise method for determining the frequency of the speech step. In this paper, we use the above method to determine the centers of the clusters to obtain an accurate mask of the clusters to improve the speech signal in an efficient way by obtaining a proper time-frequency mask using the Gaussian model.
The rest of the article is organized in several sections. Section 2 deals with the GCT formulation and how to analyze and reconstruct the signal from the GCT domain. In Section 3, the proposed method for improving speech signals in GCT domain space is presented. We also describe the experiments and results in Section 4. In this section, the performance of the algorithm against different types of noise with
Figure 5: Display HCG and determine the cluster center using the wk-mean algorithm and removal of the dc value for 25 frames for the male speaker
Figure 6: Determination of cluster centers (left figure) and Determine the mask using the threshold on the Gaussian function (right figure)
Figure 7: Applied Gaussian mask to the GCT range
different SNR is evaluated, and section 5 concludes the article.
After determining the centers of the clusters, the mask corresponding to each cluster is determined, after which the Gaussian mask transform function is obtained. Figure 6 shows the images of a speech signal block.
For GCT extraction, the speech signal spectrogram is first extracted. Equation 1 is related to the Fourier transform of a short time [14,15,37,38] .
(1)
Where is the signal value, and is the window distance, and the window is chosen according to the short time required. In this paper, the shape of the selected Hamming window is selected.
If we show the time and frequency axes in and respectively in the spectrogram plane, the modulated components can be represented by a static two-dimensional sinusoidal model , which contains the spatial frequency and is the value of the in signal [1]. In Equation 2, represents the angle of the spectral lines on the spectrogram. In practice, the modulated components are analyzed in small sections of the spectrogram plane separated by multiplied windows. The signal analysis is now continued by multiplying the area specified in the window [12-3].
(2)
After receiving the two-dimensional Fourier transform of area , we will have:
(3)
In equation 3, the is the two-dimensional Fourier transform of the STFT signal, and sinusoidal harmonic lines are used to determine step frequency.^{[13,14]} In equation 4, is the sampling frequency, NSTFT, the number of DFT points, and the is the vertical peak distance from the origin of the GCT domain, respectively. The GCT parameters are related to the step frequency determination.
(4)
the STFT representation of a specific segment of the signal with the GCT is shown in Figure 1. In the figure below, the distance between the two STFT lines is . [9,11,22,23].
At this stage, after GCT determination, the dc value is removed and demodulated to reconstruct the STFT domain and its phase, and after adding the amplitude and phase, the value is added to the reconstructed signal.[18-29]
GCT is a two-dimensional analysis of speech signals which is effective in estimating the frequency of the step taken to improve speech and blended speech. Finding the exact position of the frequency spectrum centers is particularly important in calculating the speech step frequency [20]. For this purpose, the Weighted K-means algorithm is used to accurately determine the speech step frequency [19]. After determining the speech step frequency, we obtain the exact centers of the clusters that is an automated and unsupervised way of finding cluster centers from training samples. So that the number of clusters is randomly selected from the samples and then the mean and variance of all samples are selected from these centers and each sample is selected through comparison with centers. To precisely determine the centers of the clusters and to simplify the calculations, the extra samples and dc values are removed. After specifying the centers of the clusters and removing the dc value, it uses the energy of the clusters to determine the mask for each cluster. Due to the symmetry of the GCT axis, we zero the spectrum by selecting the appropriate part of the time-frequency spectrum of the DC part. Then, we determine the mask for each cluster. Using the time-frequency binary mask, which
Table 1: PESQ results for the above two experiments for both male and female speakers | ||
The type of speaker | Experiment 1: Use a binary mask | Experiment 2: Use the Gaussian distribution function |
Male | PESQ=4.21 | PESQ=4.39 |
Female | PESQ=3.91 | PESQ=4.006 |
Table 2: Comparison between two experiments using white noise with different SNRs. The first test is to remove and add dc using the appropriate threshold. The second experiment is to shift the clusters to the center to get the dc value for the male speaker and the female speaker | ||||||
SNR | 0db | 2db | 4db | 6db | 10db | Clean signal |
PESQ_ex2(M) | 1.79 | 1.92 | 2.16 | 2.35 | 2.43 | 4.39 |
PESQ_ex1(M) | 1.55 | 1.71 | 1.92 | 2.05 | 2.33 | 4.26 |
PESQ_ex2(F) | 1.71 | 1.87 | 1.95 | 2.23 | 2.41 | 4.28 |
PESQ_ex1(F) | 1.53 | 1.68 | 1.90 | 2.01 | 2.31 | 4.17 |
Figure 8: Quality improvement chart for male and female speakers for two proposed and spectral subtraction methods
Figure 9: Spectrogram of the speech signal with SNR = 10db (left) Reconstructing the speaker signal in the spectrogram space (right)
Figure 10: PESQ Input and Output Results for a Speaker Signal, Reconstructed Signal to SNR (Red) Noise Signal to SNR (Blue) Signal
is described in the following relations, the mask for each cluster is determined. The binary mask is obtained from the following formula:
(5)
Where and are predicted outputs that include different sources from the signal spectrum and are shown for different frequencies. Then, using the Gaussian distribution function for each cluster, its Gaussian model is obtained. After determining the Gaussian distribution function and applying it to the GCT domain, the value is added to it, and the reconstructed signal is obtained. The block diagram of Figure 2 shows the general trend of the proposed method.
In this method, the steps of the test process are the same as the previous test. However, the only difference is that after the step of adding GCT to the range of spectra resulting from the Gaussian mask, and reconstructed clusters are obtained by using clusters to add in a way that after shifting the clusters to the center of the time-frequency axis, we apply their sum to the dc part of the signal. In this case, unlike the previous method, all parts of the spectrum will be reconstructed and of good quality. Figure 3 shows the process of adding clusters to the center.
We will select the experiments from the 16-bit and 16 kHz TIMIT speech signals. These experiments have been performed for a large number of male and female speakers. Here are some of the results of the speakers for different experiments. Each experiment uses different speakers with different sentences.
In the first experiment after GCT and dc removal, by selecting a suitable threshold and determining the exact centers of the clusters, the binary mask related to the clusters is obtained using the energy of the clusters. In energetic places, a mask for each cluster appears. After determining the mask, the range of GCT is applied to it, and significant clusters are identified. Figure 4 shows the process of this experiment for a speech signal.
In the second experiment, after determining the GCT using the WK-means algorithm, after accurately determining the step frequency, we determine the centers of the frequency-time spectrum clusters [19]. After determining the centers using the Gaussian conversion function, we obtain the probability distribution function for clusters. Then, using the appropriate threshold, the mask corresponding to each cluster is observed. The relation of the function of two Gaussian variables is expressed in Equation 6.
(6)Where in the above equation: (7)
In other words, the covariance matrix is estimated to be weighted based on the member elements of the clusters.
(8)
Where is the weight of points in GCT. The frequency of the step using Equation 4 and the cluster centers was created using the wk-means algorithm and deleting the dc value in order to speed up and facilitate the calculation of the algorithm and are determined to improve the clarity of the peaks by selecting a suitable threshold. In Figure 5, the GCT analysis for all blocks of a speech signal related to the male speaker over a short period of time multiplied by a 20ms Hemingway window is shown.
After determining the centers of the clusters, the mask corresponding to each cluster is determined, after which the Gaussian mask transform function is obtained. Figure 6 shows the images of a speech signal block.
After determining the mask corresponding to the Gaussian transform function, each mask is applied to the GCT range, and the result of that is shown in Figure 7.
At this stage, we apply the phase to the mask resulting from the Gaussian function. The following equations represent the signal phase relationships.
(9)
In the above equations, is the conversion of SGT from the signal , and is the angle of rotation of the clusters from the GCT.
After this step, we will reconstruct the signal. First, using the inverse of the two-dimensional Gaussian transform, the inverse of the GCT is obtained. Then, by adding the spectrogram phase to the inverse of GCT, the inverse short-time fast transform (ISTFT) is taken. The criterion used to compare the results of improved speech quality is the PESQ criterion, and its value is 4.5 for full audio compliance, and its minimum for non-compliance is -0.5, which is in line with the MOS standard. Determining the PESQ score is like the MOS criterion, which is the listening score of the MOS sound quality between 1 and 5, which is excellent quality to poor quality and poor sound quality.
In Table 1, the PESQ results for the male and female speaker signals are executed under the same conditions. In the first experiment, a binary mask was used to determine the clusters, and in the second experiment, after obtaining the Gaussian distribution function obtained from the cluster centers, we get the corresponding mask. The results of the above two experiments are shown for male and female speakers. Table 1 shows the results of the two experiments.
Table 2 shows the results of quality improvement for a male speaker and a female speaker compared to the two methods described in Section 3. First, in the first experiment, we do the process of improving speech quality by removing and adding dc in such a way that by selecting a suitable threshold, we can delete extra values and dc samples and after performing the algorithm, before reconstructing the signal, we add the deleted dc value. In this experiment, the cluster mask is obtained using the Gaussian distribution function. These tests are performed by adding white noise with different SNRs and in a pure signal mode for female and male speakers.
As can be seen from Table 2, the second experiment was more successful in the lower SNRs and the accuracy of the second experiment, in which the by means of clusters shift to the origin and the selection of the desired mask using the Gaussian distribution function is obtained, is more. The accuracy of the test has been reduced in terms of PESQ in low SNRs, but it is of good audio quality. In Figure 8, the acoustic noise from the ambient noise is acoustically added to the speech signal of both male and female speakers. In this figure, the blue, red, green and purple bars show the rate of improvement in the proposed method for male and female speakers, the rate of improvement in the differential method for male and female speakers respectively. A comparison of the proposed method algorithm with the spectral subtraction method with four babble noise, car, office, and exhibition, which is taken from the NoiseX database, has been done [36]. The spectral diffraction method is one of the first and most widely used methods in improving speech signal with noise. Noise spectra are usually detected during a period of silence. Assuming that this estimate is the same for all noise signals and that it is constant and inconsistent with the original speech signal, the noise power spectrum can be reduced from noisy speech signal [33,34].
Figure 8 shows the improvement in speech quality when adding ambient noise to the speaker's speech. The horizontal axis of the diagram indicates the type of ambient noise, and the vertical axis indicates the improvement of male and female speech. Speech Improvement is the difference between the PESQ reconstructed signal and noisy PESQ signal.
According to this figure, it is observed that the proposed method is more successful than the spectral subtraction method for both female and male speakers, it has acceptable quality in terms of reconstruction and hearing. It has a reasonable recovery rate in different noisy environments with SNR of about 5db. However, in the spectral subtraction method in some noisy environments, it has a negative improvement rate. As a result, spectral subtraction has not been successful in low SNRs and has not improved speech reconstruction.
Also, because the female voice is lower and thinner than the male voice, the quality of speech in the female speaker has decreased. The main reason for this is overlapping with environmental noise and its impact on speakers.
Figure 9 shows the STFT signal of a speaker and its reconstruction using the proposed method in a noisy environment. In Figure 9 (left), the STFT signal of a speaker by adding white noise with SNR = 10db, and in Figure 9 (right), the STFT signal of a reconstructed speech signal are shown, respectively. According to the STFT speech signal in Figure 9, it is observed that the information from the noise is well separated in the proposed method.
Figure 10 shows the output and input PESQ diagrams for the proposed method. In this experiment, the dc value of the speech signal is obtained by adding clusters to the center. The output of the diagram is the reconstructed signal to the original signal, and its input is the ratio of the original signal to the noise signal. The figure below shows the process of improving speech quality in white noise conditions with different SNRs. For noisy signals less than 10db, the PESQ is less than 2.5, but for high SNRs, the PESQ is higher than 3.5.
In this paper, a new method for improving the quality of speech was presented, in which the step frequency of each cluster was first calculated by the WK-means method. After obtaining the speech step frequency, we identify the exact center of the clusters in the GCT space and then, according to the correct diagnosis of the cluster centers, we attribute the Gaussian distribution to the clusters, and by using the GCT mask in the GCT space, the signal is separated from the noise. In cases where the GCT is not clear, it is possible to go directly to the reconstruction stage and apply the mask only to mosaics in which GCT has significant clustering. In the continuation of this research, this method can be used to separate the two speakers that have been combined and it also can be used in separating speakers in noisy situations.
Conflict of Interest
The authors declare that there is no financial or commercial conflict of interest.