Play Bach: let a neural network play for you. Part 1.

pascal boudalier
Nerd For Tech
Published in
7 min readMay 17, 2021

--

I do not know how to play music. But I still can play with music.

source twinlk.com

It all started with a long time regret. My parent did not offer me music lessons when I was a kid.

So I do not know how to play music, but I still can play with music.

How? My idea is to use deep learning to automatically generate music. This music will be inspired by a given author and hopefully will sound interesting.

Even if the generated music will resemble the music of the actual composer, it will be based on my personal tuning of the deep learning’s neural network, and therefore will be unique. In that way, it is also my own music.

So time for some tinkering with neural networks.

I picked Bach as my sandbox because I love his music a lot and I expect it has some underlying mathematical structure which the neural network should be able to leverage.

This is part of a series of articles, which will explore many aspects of this project, including static MIDI file generation, real time streaming, Tensorflow/Keras sequential and functional models, LSTM, over and under fitting, attention mechanism, embedding layers, multi-head models, probability distribution, conversion to TensorflowLite, use of TPU/hardware accelerator, running application on multiple platforms (Raspberry PI, edge devices) ….

see Part 2. Part3. Part4.

The entire code is available on github

A first example of generated music. More will follow in next articles

Deep learning in one sentence (or two)

Essentially, deep learning is a magic black box which converts input data into a prediction about ‘something’ contained in the data. The beaten to death example: the input data is a picture, and the prediction is: ‘is this a dog or a cat ?’ (as if anyone cares)

First, one has to ‘train’ the network by feeding it with many, many ‘examples’ (e.g. thousands, ten of thousands, millions of pictures of dogs and cats). During the training process, the network will automatically configure its internal parameters to become better and better at ‘guessing it right’. This phase requires a lot of training data, and a lot of computing power.

Once the training is complete, the neural network is nothing more than a few millions, billions parameters, smartly organized.

Then the magic happens. If one feed the network with a new picture of cat (a picture never used for training), the network should/will guess it right. The model is able to ‘generalize’ to new data. The billions parameters inside the network work together to figure out that a picture with two triangles pointing upward (the ears), two circles in the middle (the eyes), and a series of horizontal lines (the whiskers), look like a cat.

This network is just a bunch of numbers, and has absolutely no understanding or “conscious” of what a cat is. It looks at the pixels in an image and compares them to patterns learned from the numerous training examples.

What we will build

Our deep learning application will use segments of original music as input (a series of notes composed by Bach), and will be trained to predict what note should follow.

The end result will be a few seconds of original music followed with a string of notes that have been generated automatically. Will it sound nice ? weird ? uninspiring ? boring ? WTF ? We shall see.

How does a computer represent music?

As you know, (and if you didn’t, no problem, you now do), neural networks require training data to be represented as numbers.

For pictures this is not a problem, there are already represented as a bunch of numbers (each pixels is 3 numbers i.e. intensity of Red, Green and Blue). For other types of inputs, such as text, each word can easily be converted to a number (an index in a dictionary of all possible words). But what about music?

In the world of computers, music can be represented in different ways:

  • the traditional audio format (such as .mp3, .wav, .flak, etc.). This is essentially a representation of the way air vibrates when carrying sound. This air vibration (a sound wave) is converted into an electrical wave (using a microphone) and this electrical wave is represented as an array of numbers (sampling).
  • MIDI: This is a ‘computer’ language, where each note is represented by a single integer (the MIDI code). Since in MIDI format the music is already coded in numbers, let’s use it for our neural network.
The MIDI table. Note A4 has a MIDI code of 69.

Public domain MIDI music can easily be found online. A MIDI file has a .mid extension.

Some very basic music concepts useful for our purpose (very simplified)

Music consist of Notes. Each note is an air vibration of a given frequency. For example, in the MIDI table above, the note A4 corresponds to a 440 hertz air vibration and is represented by the MIDI code 69.

Each note has numerous characteristics, amongst them its frequency (also called pitch) , duration, volume..etc.. A4 is the note’s pitch.

Over the centuries and across geographical regions, notes’ pitches have been written differently. We will use the Anglo-Saxon notation. C, C#, D, D#, E, F, F#, G, G#, A, A#, B.

In my country, France, notes are written Do, Re, Mi, Fa, Sol, La, Si. There is an historical context to this, but this would be too much of a digression. Feel free to refer to this article .

Notes are arranged in octaves. An octave is a series of 12 notes with increasing frequencies (as you can see above, there are 12 notes from C to B).

If you are puzzled by the fact that Oct means 8 and an octave has 12 notes don’t worry, there is a LOT of historical background in all this.

An octave corresponds to a doubling of frequency. For example: Note A in octave 5 (A5) has a frequency of 880 hertz. It is twice the frequency of A4 ( Note A in octave 4 )

Multiple notes can be played together. This is a chord. A silence is a special thing. It is called a rest.

Notes (or chords or rests) can be played longer or shorter. This is the duration. Some instruments can play a note forever (an organ), whereas in the case of others instruments , like a violin, notes will decay.

Finally, a note (or chord) can be played louder or softer. The fancy name for volume is velocity.

Let the music flow in our neural network

For now, let’s build a network that focuses on pitch only. We will look at duration and velocity and a lot of other aspects in the next articles of this series.

Below is a segment of music, i.e. a sequence of notes, chords (multiple notes separated with a period . eg ‘A2.G#3’ ) and Rest (‘R’) :

‘B2’, ‘A2’, ‘G2’, ‘F#2’, ‘E2’, ‘D2’, ‘R’, ‘C#2’, ‘A2.G#3’, ‘E3’, ‘F#3’, ‘G3’, ‘F#3’, ‘G3.A4’

We will train our deep learning network to predict the next element of a sequence. Below are 3 training samples: The network input consists of a sequence of 5 notes/chords. What we want the network to learn and predict is the next element in the sequence:

‘B2’, ‘A2’, ‘G2’, ‘F#2’, ‘E2’ => D2‘A2’, ‘G2’, ‘F#2’, ‘E2’, ‘D2’ => R‘G2’, ‘F#2’, ‘E2’, ‘D2’, ‘R’ => C#2

In the actual code, the input sequence is longer ( 40 notes/chords) to capture enough ‘structure’ in the data and we use thousands, even millions of training samples.

Once a network is fully trained (more on that concept in the next articles), it can be used to make prediction based on real life data (a prediction is called an inference). In our case an inference will look like:

  • start with a sequence of notes, typically extracted from a real MIDI file
  • input the sequence into the neural network. It will generate a prediction for the following note/chord
  • update the input sequence with a shift
  • do the above as long as you care …

for instance:

input: ‘B2’, ‘A2’, ‘G2’, ‘F#2’, ‘E2’ and get xx as predictioninput: ‘A2’, ‘G2’, ‘F#2’, ‘E2’, ‘xx’ and get yy as predictioninput: ‘G2’, ‘F#2’, ‘E2’, ‘xx’, ‘yy’ and get zz as prediction

at the end , play B2’, ‘A2’, ‘G2’, ‘F#2’, ‘E2’, ‘xx’, ‘yy’, ‘zz’ , ….. this is your own generated music.

If the network is correctly trained, the music should sound familiar, but it will NOT be an exact copy of the training set. The network will always introduce some variation, randomness, which makes the entire process interesting (well .. I guess you can call this interesting).

In the next article we will build the actual deep neural network, and it will look like the thing below. Stay tuned !!!!

And this will become very clear…

--

--

pascal boudalier
Nerd For Tech

Tinkering with Raspberry PI, ESP32, RiscV, Solar, LifePo4, IoT, Zigbee, energy harvesting, Python, MicroPython, Keras, Tensorflow, tflite, TPU. Ex Intel and HP