top of page

Automatic Censorship

An EECS 351 Project

We are embarking on a journey to apply basic digital processing theory to solve real-world challenges. We are designing a low-level speech recognition system to detect and censor key words from a speech clip.

The Journey: Overview

Overall Problem Definition

We are attempting to create a system, without any human element, that can detect and censor certain words in spoken audio and subsequently play back the censored audio clip.
Early on, we hypothesized that the best way to do this would be to define two subsequent problems that, together, would solve the overall problem. The first is the ability to detect where individual words are spoken in an audio clip and separate these words into their own audio clip, and the second would be to then take these audio clips of individually spoken words and be able to tell whether or not a given clip does or does not contain the word to be censored. Accomplishing and combining these two solutions would become our overall system.

The Journey: Text

Challenges of Voice Recognition

Designing a speech recognition subsystem in our automatic censorship system, we are concerned with some universal challenges of speech-recognition systems.

  1. “Special” words can increase difficulty in recognition.

    1. Recognizing imperceptible words. Prepositions with short durations which are almost imperceptible in speech. With such short durations (usually <100 ms), they are often misinterpreted with preceding words. In order to realize this level of recognition, a classifier system that can discover and distinguish characteristics of prepositions from environmental noise is required.

    2. Recognizing the meanings of individual words. Homophones and homonyms have almost identical pronunciations. They can only be distinguished acutely from their individual meanings. In order to realize this level of recognition, an identifier system that learns the target meaning for censorship from input is required.

    3. Recognizing the contextual meanings of a word. Sometimes, we don’t want to censor out a word as long as it occurs in a sentence. Instead, we just want to censor out a specific definition of it in context. In order to realize this level of recognition, a NLP-based tool bundle (with tools like parser, part-of-speech tagger, relation extractor and so on) may be required.

  2. Environmental noise can blur and distort the audio characteristics of a pronounced word. Furthermore, as the speaker sensed the background noise, he might even modify his speech traits in an endeavor to fine-tune the communication efficacy over the noisy medium. In order to ensure undistorted audio characteristics of words are recognized, a subsystem of noise-elimination is required.

  3. Different accents in English: this can cause many problems in interpreting what the speaker wants to express. Needless to say, it dramatically increases the difficulties of speech recognition. To realize this level of recognition, characteristics of English words spoken by foreigners should be studied and classified, in order to design a system that handles different accents.

  4. Different speech traits in English: this can also cause problems in interpreting what the speaker wants to express, especially among fast speakers. They can usually neglect the gap between two words and speak them as like they are one word. To realize this level of recognition, a system that can reduce their speaking pace without distorting the characteristics of spoken words is required.

The Journey: Text

Speech Recognition Techniques

A speech recognition system has two fundamental segments: the feature extraction and classification. For feature extraction techniques, there are two leading techniques:

  1. Time-domain speech analysis -- Linear Prediction Coefficients (LPC) method [1]:

    1. A parametric technique that recognizes separated words through calculation of a minimum prediction residual. For each separated word to be recognized, a reference pattern is represented and stored as a series of LPCs with respect to time. To minimize the total log prediction residual, the reference LPC is determined by the input autocorrelation coefficients.

    2. The LPC method intends to moderately harmonize the resonant characteristics of words spoken by human [2]. However, the LPC method is limited in representing speech, as it takes the assumption of an immobile signal within a specified frame. Thus, it’s unable to assess the localized events accurately. In addition, it’s also limited to capture the imperceptible and nasalized sounds accurately [3].

    3. Linear Predictive Analysis (LPC)

    4. Linear Predictive Cepstral Coefficients (LPCC)

    5. Perceptual Linear Predictive Coefficients (PLP)

  2. Frequency-domain speech analysis -- the Mel-frequency cepstral coefficients (MFCC) [4].

    1. Mel-scale cepstral analysis (MEL)

    2. Relative spectra filtering of log domain coefficients (RASTA)


For feature classification techniques, approaches that are utilized includes,

  1. k-Nearest Neighbors Algorithm (k-NN)

  2. Support Vector Machine (SVM)

  3. Particle Swarm Optimization (PSO)

  4. Neural Networks

    1. Artificial Neural Network (ANN)

    2. Back-Propagation Neural Networks (BPNN)

References


[1] F. Itakura, Minimum prediction residual principle applied to speech recognition, IEEE Trans. Acoustic Speech Signal Processing 23 (1975), 67–72.

[2] A.V. Haridas, R. Marimuthu, and V.G. Sivakumar, A critical review and analysis on techniques of speech recognition: The road ahead, International Journal of Knowledge-Based and Intelligent Engineering Systems 22(1) (2018), 39–57.

[3] L. Rabiner and B.H. Juang, Fundamentals of Speech Recognition (Prentice-Hall Inc, Englewood Cliffs, NJ, 1993).

[4] S.B. Davis and P. Mermelstein, Comparison of parametric

representations for mono-syllabic word recognition in continuously spoken sentences, IEEE Transaction on Acoustic Speech Signal Processing 28(4) (1980), 357–366.

The Journey: Text

First Iteration: Word Separation

Silence Thresholding

At first glance, this problem seems relatively simple. Look for the spacing (silence) between words and separate at these points. However, it is not necessarily this trivial. When speaking at, say, a slow to regular pace, at best there will be a dip in the volume between words, but not a complete silence, which is where silence thresholding comes in. We can define a threshold value based on the average amplitude of the audio clip, and we will consider anything below this threshold to be silence. So, what this algorithm does is searches the audio clip for points that dip below the threshold, removes these parts, and returns the parts in between silences as individual audio clips. To learn more about how this works, click the link below.

Findings

Using this technique can work well for sentences spoken slower than normal speech. Although, if we want this to work for someone speaking in everyday speech, we will need to refine our approach.

The Journey: Text

First Iteration: Word Recognition

To actually recognize the individual words once they were seperated we chose to use a K-nearest Neighbors Classifier and tried various preprocessing approaches to maximize its accuracy. These included averaging the Spectrogram, the magnitude of the Discrete Fourier Transform, and the Periodogram Estimate of the Spectral Density. We chose to try these techniques because they were frequency techniques that we were familiar with and because they were relatively simple to implement.


Findings

These approaches had mixed results. Reasons for the less than optimal results include the fact that the techniques used were not specific enough to human speech to reliably extract meaningful features and that the data fed into the classifier was strictly one dimensional.

The Journey: Text

Second Iteration: Word Separation

Because thresholding silence for our word separator was limited to slow, clear speech, we looked at other ways to separate our words. We were led to a power thresholding method, which breaks our audio clip up into predetermined sections and cuts out the audio below a determined threshold of frequency power. This method was more successful in separating words in faster speech; however, sometimes polysyllabic words would be split up by syllables.

Findings

This is an improvement on silence thresholding, but in order to improve our system to work with regular to fast speech, we believe that the best method would be to remove word separation entirely.

The Journey: Text

Second Iteration: Word Recognition

For word classification in our second iteration we found a technique more specific to human speech called Mel Filter Spectral Coefficients. It's a feature extraction technique built around the Mel Scale that is based on the logarithmic nature of human hearing. We also changed how our algorithm is trained and how it predicts so that it could work with two dimensional data so that we could get the most out of using the MFCC to process our data. Additionally we chose to train with using data from the same people to improve the accuracy of our approach. For more information on this, click the link below.


Findings

We were able to obtain a 97.5% accuracy in the worst case using the same data set from the last iteration which is about a 30% improvement over the averaged spectrogram. The MFCC is a far better pre-processing technique than any of the frequency domain techniques.

The Journey: Text

Third Iteration: Classification Scanning

In our final iteration of this project, we decided to avoid word separation. Detecting when words begin and end creates errors with respect to finding and segmenting them properly. If words are skipped, combined, or only partially found and then fed into the classifier, the results are often incorrect. These incomplete audio clips sent to the word classifier are miss classified, which propagated the error. We decided that scanning the audio signal with a predefined window (small segment) would give us the desired accuracy. This method did increase word detection accuracy, but also increased computational time.


Our method of scanning uses systematic divides in the original audio signal to find the censored word. A window of the signal was fed into the classifier and produced a probability match with the censored word. If the probability was high enough, then the window is considered to contain the censored word and is subsequently edited out of the original audio signal. Then, this window was moved forward in time and the process was repeated. This allows multiple censored words to be detected and reduces the complexity and errors in the program. However, this requires the program to do more classification steps than before. For more information on this, click the link below.

The Journey: Text

Final Results

Below are two examples of our programs; the first is iteration 2, and the second is iteration 3. The original audio clip says, “You have to drive through Ohio to get to Florida. Ohio has so many farms. Ohio is such a boring state.” It is visually difficult to see the location of each word, which is what made this clip difficult to censor. When we implemented scanning into our program we were able to censor the word “Ohio” each time it appeared. The audio clips found below demonstrate the finished product and the future potential of our program.

The Journey: Text

Censoring with Word Separation

ohiomultipletimes_censored_word_separati
The Journey: Image

Audio Files

Click to Download

Original

Censored (fail)

The Journey: Files

Censoring with Scanning

ohiomultipletimes_censored_scanning.png
The Journey: Image

Audio Files

Click to Download

Original

Censored

The Journey: Files

DSP Techniques

We used a variety of DSP techniques through this project. Some of which were learned from UM's EECS 351, a DSP course. An example of some in class techniques used are windowing, spectrogram, Parseval's Theorem, KNN classification, and etc. In addition to these inclass techniques, other DSP techniques were used. An example of out-of-class DSP techniques are mel-frequency cepstral coefficients(mfcc), posterior probability (score), and etc.

The Journey: Text

Further Improvements

Our system is currently designed for post-processing due to the computational intensity and the time it requires to censor an audio clip. We would like to optimize and reduce the time our system takes to run. Currently, that means either implementing a more advanced word separating algorithm or finding the best parameters for scanning and classification to balance detection accuracy and computation time.


We would like to add to our feature vector to be able to more accurately detect words with less voiced sounds. Currently, we use a feature vector composed only of MFCC’s. There are other approaches to classification and other features that could be added to increase the robustness of our program classification. An example of a feature to add is pitch.


We would also like to eliminate manually set absolute thresholds. Our program uses some thresholds that are manually set through experimentation and can often change depending on training data or the audio clip being censored. We like to find a correlation or use machine learning to create a flexible threshold that changes with the training data and input audio signal.

In order to increase the recognition accuracy by discovering the acoustic characteristics of a separated word, we would also like to develop a subsystem that recognizes the vowel in a given spoken word. We believe in its potentials in enabling a more accurate classification of spoken words. We currently have a preliminary vowel-recognition system inspired by the idea presented in [1]. We have decided to take a spectrogram-analysis approach by assuming that the input word is relatively constant in amplitude and spoken deliberately slowly. While processing the input word, we would preprocess the data by detrending the signal with a low order polynomial, windowing the signal into a series of non-overlapping samples (defined by the user), fitting its z-transform into a transfer function and find the peaks of this transfer function to match with closet formant pairs from a reported formant table in [2]. This idea has limitations due to requirements of a clear enunciation, as we can naturally change the pitch while speaking the words. It’s also limited to a spoken vowel with a distinguishably moderate duration. Again, this is still in very preliminary stages, and the recognition can only be done accurately in distinguishably different vowels (like vowel in “head”). In the future iterations, we hope to fully utilize this idea in our speech recognition system and evaluate the improvement in recognition accuracy.

References

[1] https://cnx.org/contents/sO5OIfhH@3/Vowel-Recognition-using-Matlab

[2] http://ec-concord.ied.edu.hk/phonetics_and_phonology/wordpress/learning_website/chapter_2_vowels_new.htm

The Journey: Text

Applications

We are all connected in this current age of technology. Information is spread faster and wider than ever. Because of this, often the information that gets spread is such that it is irresponsible to spread/receive. This is where our system is applicable. Any content creator with advertisement responsibilities could use our product. From podcasts to shows to streaming platforms, any situation where choice words must be omitted from the final product, our system can be used.
On the listener side, our system could be used as a way to “child-proof" the media that they consume. Explicit music, movies, videos, amongst other things can all be subdued to a point where your little one can enjoy. Looking ahead, we envision a product placed in the ear canal which can censor the language that your child hears in real time.

The Journey: Text

Ethics of Censorship

We would like to take a brief moment to discuss the ethics of censorship. While our project may not be on the scale of some engineering applications in the topic of censorship, we do recognize that our system, and other systems like it, can be used for negative purposes in controlling what the public can and cannot see/hear as well as what the public can and cannot say. It is up to engineers, like ourselves, who are most familiar with these projects and applications as well as citizens/voters to hold not only ourselves, but people with the power to use this technology on a grander scale accountable for the way this technology is used.

The Journey: Text

Download Our Code Here

The Journey: Files
bottom of page