These notes will help you get the most out of Voice Trap, the vocal eliminator / isolator plug-in. DirectX and VST versions are available.
General controls
The Mode buttons switch between removing the vocal track and isolating it.
The Gain knob adjusts the level of the output. Each of the separation engines can mess with the levels, so you may need to compensate.
FFT Center Channel controls
The Engage button engages or bypasses the FFT-based center-channel separation engine. This is based on the idea that in most cases the vocal track is panned dead center. In Remove mode, this portion of the signal is removed. Isolate mode does the converse - it leaves the center portion intact, and gets rid of the rest of the signal - in most cases, the backing arrangement.
The Bass Cutoff control lets you decide not to mess with the bass frequencies. In Remvoe mode, this means that bass guitar and kick drum (which are also panned center) are not removed. In Isolate mode, it means that the bass frequencies are not considered part of the signal that we are trying to isolate.
The Treble Cutoff control lets you decide not to mess with the treble frequencies. In Remove mode, this means that high hats and cymbals (if they are panned center) are not removed. In Isolate mode, it means that those selected treble frequencies are not considered part of the signal that we are trying to isolate.
The Center Width knob controls just how narrow or wide the "slot" in the center is. If you choose a low setting, only signal that is absolutely, mathematically dead center will be selected for removal or isolation. A higher setting will select signal that is close to the center.
The Center Profile knob lets you adjust width of the center "slot" based on frequency. The lowest setting /\ has your specified width applying at the bottom of the frequency spectrum, but narrows to zero at the top. In the middle, the || setting has no frequency-based adjustment - your chosen width applies across all frequencies. The Maximum setting \/ applies zero width at the lowest frequencies, but opens up to your chosen width at the top of the spectrum.
The Phase Window control further controls which portions of the center signal to select. What we are talking about here is how the sound waves in the left and right channels "line up". A low setting will only select frequencies that are perfectly lined up in both channels. A higher setting will also select frequencies that are more or less lined up, but not be so strict.
The Phase Profile knob lets you adjust how your chosen Phase Window is applied across the frequency spectrum. The lowest setting /\ has your specified window applying at the bottom of the frequency spectrum, but narrows to zero at the top. In the middle, the || setting has no frequency-based adjustment - your chosen phase window applies across all frequencies. The Maximum setting \/ applies zero phase window at the lowest frequencies, but opens up to your chosen value at the top of the spectrum.
Cepstral Lifter controls
The Engage button engages or bypasses the Cepstral Lifter
separation engine. Cepstral liftering (yes, liftering - it's not a spelling mistake)
is an advanced mathematical technique for identifying and modifying harmonically rich groups of frequencies,
in a way that cannot be achieved with FFT-based techniques alone.
By the way, "cepstrum" is an anagram of "spectrum", and "lifter" is an anagram of "filter". Oh those crazy mathematicians!
Anyway, the idea is that if we find a prominent, harmonically rich but essentially monophonic
portion of the signal, and it is in the vocal frequency range, it is most likely to be the lead vocal.
The Track Low and Track Range controls define the the vocal range in which Voice Trap will seek for a lead vocal component. When you modify these settings, the vocal range display will be updated, showing the range in note names and frequencies.
The Node Width knob controls how modifications are applied. The mathematics get pretty hairy here, so I'll make some generalisations. A low setting will apply modifications to those portions of the signal that exactly match what we were looking for. Unfortunately, this will probably not be enough to do the job. A high setting will apply modifications all over the place - hopefully to portions of the signal that are close or related to what we matched. This will therefore have much more of an effect at suppressing or isolating the vocal, however it will also generate more artifacts. Somewhere in the middle you will hopefully find a setting that is effective and does not create too many artifacts.
The Long Nodes, Track Node and Short Node knobs boost or cut the selected portion of the signal. In Remove mode, these knobs progressively cut out the selected portions of the signal, and at higher settings even invert them. This may seem odd, but remember that we're describing something mathematically deep, so the description doesn't quite fit. Put simply, higher values will attempt to remove more of the shape of the spectrum of the signal than is actually there. This will be necessary in some cases, though it will generate artifacts. In Isolate mode, these knobs behave slightly more sensibly, and boost the selected portions of the signal. However (once again), since we are modifying spectral signatures rather than actual signal, even modest amounts of boost can have a dramatic effect.
The Long Nodes control modifies a portion of the spectrum having lower-frequency relationships with the tracked vocal signal.
The Track Node control modifies a portion of the spectrum that was identified as the tracked vocal signal.
The Short Node control modifies a portion of the spectrum having higher-frequency relationships with the tracked vocal signal.
What does that all mean? Why three controls? One answer is that if the tracking misses the vocal,
having "related nodes either side" means the mistake will be less obvious. Another answer is that cepstral tracking
does not appear to catch the full spectral signature of a vocal track, so by modifying "related nodes either side"
we can get more useful results.
In practice, it means that you should probably have the Track Node somewhere in the middle, with Long Nodes
and Short Node at slightly lower settings.
A word about cepstral liftering
Cepstral liftering is a cutting edge mathematical technique. Ha!
It would perhaps be more accurate to call it a black art: "We are dealing with forces we do not understand."
So if you run into the odd spot of frustration, please be patient.
You're right at the bleeding edge.
The technique has a lot of promise, and can do things not possible otherwise.
A word about FFT-based center-channel voice reduction / isolation
Voice Trap works by dividing a small "chunk" of sound into a large number of frequency-based components. This is done using a method called "fast fourier transform", or FFT for short. After the sound is divided, each component is examined, and the question is asked: "Is this component both the same volume in the left and right channels, and also lined up the same? If the answer is yes, then we either remove that component (for Remove mode), or remove everything else (for Isolate mode).
Your results will vary from mix to mix!
When removing or isolating the center track from a mix, you are always going to be treading a fine line. If you select too much signal to remove / isolate, you will hear FFT/cepstral artifacts, and if not enough signal is selected, you don't get enough removal / isolation. You will find that some mixes will separate nicely, others will be impossible.
Buffers and Latency
The FFT (fast fourier transform) that Voice Trap uses is a
powerful mathematical technique, but it needs to work on rather large chunks of
audio data at once (around 200 milliseconds). However, if you like to use VSTi
and DXi soft synths, you probably have your PC and audio host app set up for
low latency – which means that processing plugins like Voice Trap are only
passed very small chunks of data (sometimes less than 10 milliseconds).
If you are running the DirectX version of Voice Trap, and it decides
that the buffers are too small, it switches to Buffer Accumulate Mode.
When this happens, Voice Trap will introduce a short delay between the input
and output, because it has to "accumulate" a buffer full before it
can process, meaning that the output data is 200 ms "behind the
times".
There are two things you can do about this: (1) go into your host app’s audio
settings window, and increase the latency/buffer sizes, or (2) drag the audio
data 200 ms to the left after processing it.
The VST version of Voice Trap won’t exhibit this delay under most hosts,
because VST is able to "tell" the host that there is a delay, and the
host can compensate by "nudging" things forward a bit.
Glitches during playback
If the latency is very low, the CPU utilisation of Voice Trap will
become erratic. This is easily explained: for most of the time (perhaps four
out of every five "little" buffers passed in) the plugin is doing
very little work, just "passing on" data that it has already
processed. But then once a buffer has accumulated, it does a processing
operation on the accumulated "big" chunk. If this uneven CPU usage
causes problems (ie glitching during playback), you may need to increase your
latency settings. For most apps this is done with an audio settings dialog, but
for Steinberg products like Cubase and Wavelab, you have to go to your
soundcard's ASIO settings dialog.
When trying to track down and solve this problem, here are the sort of things
you'll need to check (depending on your audio interface and host application):
* Soundcard / audio interface: latency
* Soundcard / audio interface: number of buffers
* Soundcard / audio interface: buffers size / length / samples
* Host application: latency
* Host application: number of buffers
* Host application: buffers size / length / samples
This problem will not occur if you process your tracks destructively.
All content and software Copyright 2007 Trevor Magnusson