• Category
  • >Deep Learning

Music Composition Using Deep Learning

  • Bhumika Dutta
  • Nov 01, 2021
Music Composition Using Deep Learning title banner



Music is something that everyone in the world appreciates and listens to. It provides an environment in which individuals may connect based on cultural similarities or similar tastes. 


Another such sector that has extended its branches all across the world is technology. When technology and music come together, it produces a condition that benefits both parties indefinitely. 


On the one hand, technology gives artists new sounds, interfaces to engage with their compositions, and new techniques of creating. 


On the other hand, the music produced by these new technologies motivates the human intellect to develop new tools to engage with these new creations.


Deep learning for music generation is primarily concerned with music analysis, discovery, and recommendation. The major difficulties in emphasis are digital audio signals, processing, and modeling of an effective machine learning system. 


Many research attempts have been made in the literature that can lead to music analysis. Among these initiatives is the identification of musical properties such as styles, instruments, and genres. 


The algorithm in a prediction or analysis system determines the music features to be used. In this article, we are going to understand how deep learning is used for Music generation.


(Related read: Top 10 Deep Learning Algorithms)


Elements of Music:


Music is composed of Notes and Chords, and have three constituent elements:


  • Notes: A note is a sound generated by a single key.

  • Chords: A chord is the sound created when two or more keys are played at the same time. Most chords have at least three key tones in them.

  • Octave: An octave is a pattern that repeats itself. Each octave has seven white and five black keys.


Understanding Automatic Music Generation:


Automatic Music Generation is a method for creating a brief piece of music with minimal human input. Mozart devised it in 1787 when he presented a Dice Game for generating random sound selections and personally produced roughly 272 tones. 


Musical Grammar encompasses the information required for the right arrangement and combining of musical sounds, as well as the proper performance of musical works; it was also used to create music.


Iannis Xenakis employed the notions of statistics and probability to produce music in the early 1950s, which became known as Stochastic Music. He described music as an accidental succession of components. As a result, he used stochastic theory to formulate it. His random element selection was entirely based on mathematical notions.


Deep Learning architectures have recently become the standard for Automatic Music Generation. We'll mainly talk about deep learning architectures that can aid with music production.


Studying Deep Learning Architectures for Music Composition:


We went through a very detailed article written by towards data science on deep learning architectures. When we are talking about deep learning models that deal with music composition and texture, the most common of the lot is WaveNet. 


Wavenet is a generative waveform architecture developed by DeepMind, a London-based artificial intelligence firm, in 2016. It operates primarily through convolutional neural networks. 


WaveNet's major goal was to synthesize each audio sample using convolutional filters that work on the waveform domain directly. As a result, it's called a Generative Model. Wavenet is similar to an NLP language model. 


It was mostly used for speech synthesis and text-to-speech tasks. This is why the vast majority of open-source implementations rely on training on the VCTK Dataset or something similar, which is a collection of recordings of English speakers.


(Recommended read: Music Genre Classification Using Machine Learning)


Another architecture that generates music is Google Magenta. It's an open-source research project that looks into the function of machine learning as a creative tool. 


Magenta features over 20 cutting-edge deep learning models that may be utilized for a variety of musical tasks, such as humanizing drums, generating piano parts, continuing melodies, chord accompaniment conditioned to melodies, and interpolation between measures using variational autoencoders, among others. It also features additional models that do tasks like picture stylization and vectorized sketch-like drawing production. 


PerformanceRNN is an LSTM-based recurrent neural network in Magenta that uses a stream of MIDI events with learned onset, duration, velocity, and pitch to simulate polyphonic music with expressive timing and dynamics. 


There's also the Piano Transformer, which is an autoregressive model capable of learning long-term patterns and delivering expressive piano performances.


What is the Long Short Term Memory (LSTM) Model?


The Long Short Term Memory Model, or LSTM for short, is a kind of Recurrent Neural Network (RNN) that can capture long-term relationships in an input sequence. Speech Recognition, Text Summarization, Video Classification, and other Sequence-to-Sequence modeling problems can all benefit from LSTM. 


An amplitude value is supplied into the Long Short Term Memory cell at each timestep, which then computes the hidden vector and sends it on to the next timestep. Based on the current input at and the previously hidden vector ht-1, the currently hidden vector at time step ht is calculated.


(Read further on this topic: How do LSTM and GRU work in deep learning?)


Working of Wavenet:


Wavenet is a deep learning-based generative model for raw audio and was developed by Google DeepMind. 


As an input, WaveNet takes a piece of a raw audio wave. The representation of a wave in the time-series domain is referred to as a raw audio wave. WaveNet tries to anticipate the next amplitude value given the series of amplitude values.


The model's input and output sequences are depicted in the figure below:

An image is displaying the convolutiona layers of WaveNet algorithm.

Convolution layers of WaveNet

Causal Dilated 1D Convolution layers are the foundation of WaveNet. Understanding the notion of convolution and why it is employed in WaveNet is critical. 


A mathematical procedure that combines two functions is known as convolution. Convolution is used for a variety of reasons, one of which is to extract characteristics from an input. Let us learn about all of them one by one:


  • 1D Convolution:


The goal of 1D convolution is comparable to that of the Long Short Term Memory model. It is used to solve challenges that are comparable to those solved by LSTM. 


A kernel or a filter moves in just one direction in 1D convolution. The output of a convolution is determined by the kernel size, input shape, padding type, and stride. 


1D convolution has its disadvantages. When padding is set to zero, the output at time step t is convolved with the preceding t-1 and future timesteps t+1 as well. As a result, it violates the Autoregressive concept.


  • 1D Casual Convolution: 


Convolutions are defined as output at time t being involved solely with items from time t and earlier in the preceding layer. In layman's terms, the only difference between normal and causal convolutions is padding. 


To retain the autoregressive concept, zeroes are added to the left of the input sequence in causal convolution. 


Causal convolution cannot go back in time or examine previous timesteps in the sequence. As a result, causal convolution has a relatively small receptive field.


  • Dilated 1D Casual Convolution:


Dilated 1D convolution is a Causal 1D convolution layer with gaps or spaces between the values of a kernel. The dilation rate determines the number of spaces to be added. It specifies a network's reception field. 


A kernel of size k with a dilation rate of d contains d-1 holes between each value in kernel k. The dilated 1D convolution network expands the receptive field by raising the dilation rate exponentially at each hidden layer.


(Also read: 7 Neural Network Programs/Software)


Workflow of WaveNet:


According to Analytics Vidhya, the workflow of WaveNet is given as follows:


  • A causal 1D convolution is supplied into the wavenet's input.

  • After that, the output is fed into two dilated 1D convolution layers with sigmoid and tan h activations.

  • A skip connection is created by multiplying two separate activation values element-by-element.

  • The residual is obtained by adding a skip connection and the output of causal 1D element by element.





Deep Learning may be used in a variety of ways in our daily lives. Understanding the issue statement, articulating it, and establishing the architecture to address the problem are the most important aspects of solving any challenge. Many approaches to automatic music generation have been discussed in this article. 


Python may be used to implement these models. Music 21 is a Python package for analyzing music data created by MIT. A typical format for storing music files is MIDI. 


(Also read: Top 10 Deep Learning Applications)


Musical Instrument Digital Interface is the abbreviation for Musical Instrument Digital Interface. The instructions are contained in MIDI files rather than the actual audio. As a result, it takes up very little memory. As a result, it is frequently used when transferring files.


In this article, we have learned about automatic music generation, deep learning architectures, WaveNet, and the LTSM model. 

Latest Comments