Importantly, I have been able to experimentally replicate a very important property - and one of the main reasons WBVPM was developed, in fact. That property is shape-invariance. You see, an important property of the human voice is that, all else being equal, the shape of each pulse in the waveform stays roughly the same regardless of frequency. The reason for this property is phase-coherence. At the start of each voice pulse (when the glottis closes), the phases of all the harmonics within each formant (where each 'formant' is a spectral region affected by the vocal tract differently) are roughly the same. Since phase changes proportionally to frequency, the different harmonics will shift from that point over time, and soon become very different from one another. Since the phases are vastly different with relation to the frequency at times other than the voice pulse onset, the harmonics interfere constructively and destructively in the time-domain. Importantly however, if all the harmonics are scaled equally, the phases all now change at a slower or faster rate, but importantly this rate scales the same for all of them. This gives rise to shape-invariance, since the pattern of interference stays the same, just at different scales.
Importantly, if you apply a transform relative to a point that is not a voice pulse onset, the phases will not be flat. Of course, that transform can shift the changes *from* the point it started from, but importantly it is NOT accounting for inherent phase shift that occurs from not being at a voice pulse onset. Of course, if no transform is occurred, then there will be no issue. But if one is, say a pitch transposition, then the initial phases from the starts if signal was actually shifted to the pitch originally will differ considerably from the observed ones since the observed ones base themselves on the measured phase *at a different pitch*. This results in the breaking of shape-invariance, a noticeable 'phasiness' sound, and the sound sounding un-human.
Here is an image of 500 samples from the original signal:
https://files.catbox.moe/223l7p.png
Now here is 1000 from a one octave down pitch transposition using a naive approach (a fixed-window and hop-size approach using a 1024-point Hann window):
https://files.catbox.moe/jxgtg0.png
Notice that not only is the waveform unrecognizable compared to the original, it even varies considerably between individual voice pulses!
Now compare to 1000 samples from the WBVPM approach:
https://files.catbox.moe/6hpf8l.png
Notice how the waveform is almost identical, only scaled up two times in period, and it varies much less.
You may be wondering, couldn't we just downsample or upsample the signal and play it back at the same sample rate to get the same result? Well, importantly, we have independent control over pitch and time. In the example, I downsampled the voice by a factor two, but kept the time the same and it contains the same number of samples as the original. Additionally, in the analysis and then synthesis reconstruction, it is seperating it into individual voice pulses. Importantly, it isn't just scaling them, it is generating new voice pulses in the frequency domain and inserting them at positions that were also generated.
Here is an amplitude envelope of the latter half of the original audio:
https://files.catbox.moe/6zw5v5.png
Now here is an amplitude envelope of the latter half of the pitch-transposed audio:
https://files.catbox.moe/2a3utu.png
Notice how they a roughly the same. If the audio was just downsampled, the second would be stretched out by a factor of two - but is not.
I also implemented timbre-scaling, although I have not tested it. Fun fact; when I implemented it, actually did so by accident. I was trying to implement the pitch transposition, got a bit confused, and realized I had also accidently implemented timbre scaling.