Over several weeks, XSICHT has been trained to match faces and audio. With a training batch of tens of thousands of frames, the AI has learned to construct a human face from any given audio input. What happens when we abstract the input? This is the question, XSICHT tries to answer.
Since it is nothing more than a complex concatenation of intertwined non- linear functions that get amplified or dampened, its complexity is often hard to understand, which is why the intrinsic of an AI is called Hidden Layers or a Blackbox.
XSICHT doubles the unpredictability by feeding it not the voices it was trained on, but music, leading to unexpected results when confronted with various genres or instruments. Harmonic piano music, for example, more often leads to the recreation of female faces, while bassline-driven techno mostly resembles male speakers.
A brief technical overview can be split into data and network architecture.
The former is given to XSICHT in form of a 0.2-second-long spectrogram, calculated using the Short Time Fourier Transformation. To enhance the spatial representation of lower frequencies, the spectrogram is logarithmically recalculated to resemble the human sound perception, called a MEL spectrogram.
The latter takes this input and convolutes it down to a 1×1 pixel sized latent space from where the information is used to deconvolute the compressed information. This is called a U-shaped architecture or more common an Image- to-Image GAN network, but here it is used without skip connections between the de- and convolution pipe. During learning, the counterpart of this generator, the discriminator, works in a patch-based manner.
XSICHT gets input from a live dialog between Synthesizers and acoustic instruments produced by Timo Dufner, a voice or prerecorded sounds that harmonize with the visualization.