Cochannel interference of speech signals is a common practical problem particularly in tactical communications. Ideally, separation of the individual speech signals is desired. However, it is known that when two equal bandwidth signals are added, such a separation is not possible. We examine the problem of identifying temporal regions or frames as being either one-speaker or two-speaker speech. This identification is important in making automatic speaker and speech recognition systems more robust and is based on feature extraction and subsequent classification as is done in pattern recognition. The research has looked into both the closed-set problem where the identity of the tow interfering speakers are known a priori and the more difficult open-set problem where the identities are not known (speaker independent). For the feature extraction step, we propose a new pitch prediction feature (PPF) which is compared with the linear Predictive cepstral coefficients (LPCC) and the mel frequency cepstral coefficients (MFCC). The features are computed and classified on a frame-by-frame basis. We compare the performance of two classifiers, namely, the neural tree network (NTN) and vector quantizer (VQ). The results show that in both the closed-and open-set cases, (1) the VQ is the better classifier and (2) the PPF outperforms both the MFCC and LPCC features. The superiority of the PFF comes with the added benefits of using a scalar feature as opposed to the 12-dimensional vectorial LPCC and MFCC features and a lower VQ codebook size.
All Science Journal Classification (ASJC) codes
- Signal Processing
- Computer Vision and Pattern Recognition
- Artificial Intelligence