• Category
  • >Artificial Intelligence

OpenAI Triton Programming Language for Neural Networks

  • Vanshika Kaushik
  • Aug 24, 2021
OpenAI Triton Programming Language for Neural Networks title banner

Who thought GPU Programming will ever become a reality. But it did with Nvidia's CUDA. The parallel computing platform CUDA became the talk of the town with its stable release in May, 2020. Though it became every developer's first choice for better performance, its lengthy code execution was openly criticised.

 

Developers were in a desperate need of an open source programming language that could break down the arduous task of code development into simple, smaller fragments. Keeping a tab on the limitations of CUDA, OpenAI designed and released its open source programming language Triton 1.0. 

 

There is more to Triton than just programming, to know the nitty gritty details of this programming language please follow the blog.

 

 

What is Triton?

 

Triton 1.0 is an open source programming language like Python. This new programming language can redefine the coding process. Triton focuses on reducing the code size. It can perform the same functions as other programming languages with fewer lines of code. You may want to visit the basics of Python via the First Step Towards Python blog. 

 

It can write F16 matrix multiplication code under 25 lines. Kernels produced using Triton are 2 times more efficient and faster than previous torch implementations. Triton can provide a stable interface for DNN transcompilers. 

 

(Must Check: Working With Random Numbers in Python: Random Probability Distributions)

 

Advantages of Triton

 

  • Triton simplifies the development of specific kernels, it is faster than other general purpose libraries. 

  • Triton’s compiler simplifies code, it also automizes and parallelizes the code. Compiler further converts code into executable format.

  • Developers with no experience, or little CUDA experience can also write codes in Triton.

  • Triton increases the possibilities of reaching hardware’s peak performance with less effort. 

 

Programmers with little GPU Programming knowledge will get a better frontend platform through Triton. Learning the syntax of GPU programming might be easy but porting algorithms for efficient utilization of GPUs is not an easy thing. 

 

Before diving into the basics of Triton let’s learn about the difficulties in GPU programming. 

 

Shortcomings in GPU Programming

 

DRAM, SRAM and ALUs are the architecture of modern GPU programming. All three must be considered prior to CUDA code optimization:

 

  • Memory transfers from DRAM must be integrated into large transactions to leverage the width of modern memory interfaces.

  • Data must be manually stored to SRAM before being re-used, Data should be managed and utilized efficiently for minimizing shared memory nabk conflicts in data retreival. 

  • It is important to partition and scedule computation within Streaming Multiprocessor for the promotion of instruction, thread level parallelism. It also enables to leverage arithmetic logic unit.

 

GPUs are also optimized for training deep learning models. ( Deep learning, a subset of machine learning, is based on artificial neural networks for representation learning).  

 

GPUs can process multiple computations simultaneously, this unique feature allows them to form the basis of model training. GPUs have an extremely large number of cores which enable them to perform computation for multiple parallel processes. 

 

Triton is specifically designed for GPU powered neural networks. Neural networks and Convolutional Neural Networks are a series of algorithms that recognize underlying relationships between datasets by mimicking the human brain process. Programming in GPU is correlated with deep learning. 

 

 

Syntax of Triton C Programming Language 

 

Triton’s programming structure is based on ANSI C (CUDA C). Modifications were made in the CUDA C structure to simplify the code execution functions and to enable accommodation of semantics with a differentiated model. 

 

Following changes are integrated in Triton:-

 

Tile Declaration:- A new special syntax will enable the declaration of multidimensional arrays. Example (int tile[16],[16]). Special Syntax will highlight semantic differences with nested arrays. While writing the code tile shapes must be kept constant. Tile shapes can be made parametric via the tumble keyword. One dimensional integer tiles can be initialized using ellipses. 

 

Built in Functions:- (dot, trans, get_global_range) are some built in functions that are added to support tile semantics. Other commonly used wise array operations (+ , -, *, %) are also included in Triton’s infrastructure. 

 

Broadcasting:- N Dimensional tile can be broadcasted using the (newaxis) keyword and slicing syntax. Example: int broadcast [9,9] = range [:,newaxis]

 

Predication:- Prefix “@” is used for executing control flow within the tile operations. 

 

(Related Blog: Matrix Functions in R)

 

Triton’s String and Numeric Functions /Operations

 

(Please note data for string and numeric functions is taken from Triton)

 

1. triton.language.zeros

 

 Returns a block with the scalar value 0 for the given shape and d type.

  • Shape: shape of new array example: (8,16), (8, 0)

  • dtype: Data type of the new array

 

2. triton.language.broadcast_to

 

Tries to broadcast the given block to a new shape. 

  • Input: the input block

  • Shape: the desired shape

 

3. triton.language.reshape

 

Tries to reshape the given block to a new shape

 

  • Input: the input block

  • Shape: the desired shape

 

4. Triton.language.revel

 

Returns a contiguous flattened view of X. 

  • X (Block)- the input block

 

5. triton.program_id

  • axis:  values (0,1,2)

  • builder:  IR builder for code generation

 

6. Triton.language.num_programs 

  • axis:  values (0,1,2)

  • builder: IR builder for code generation

 

7. triton.language.arange

 

Returns contiguous values within the open interval [start,end]

  • Start: start of the interval power of two

  • stop:  end of the interval, power of two

  • builder: IR builder for code generation

 

8. triton.langauge.broadcast_to

  • Input: the input block

  • Shape: desired shapes

  • Builder: IR builder for code generation

 

9. triton.language.exp

  • X (Block): the input values

  • builder:  IR builder to generate code

 

(Must Check: DATA TYPES in Python)

 

 

Programming Model

 

CUDA GPU execution code is supported by SPMD Programming Model. In the SPMD programming model each kernel (central part of an operating system) is a part of an identifiable thread block in the launch grid. Although Triton follows a similar programming model kernels in “triton” are “single threaded” (processes one command at a time). 

 

Kernels in Triton are automatically parallelized and associated with varied global ranges. Automatic parallelization leads to simpler kernels with inexistent CUDA-like concurrency. For querying global ranges with associated kernels (get_global_range) function is used. 

 

(Suggested Blog: OpenAI’s GPT-2 (Generative Pre-Trained Transformer-2) : "AI that is too Dangerous to Handle.")

 

Triton JIT Compiler

 

Triton IR enables the compiler to automatically perform basic optimizations. Data in the compiler is automatically stored by operands for intensive block operations and synchronized analysis techniques.

 

Triton programs are automatically parallelized across SM’s through the concurrent execution of different kernels, the optimization within SM’s analyzes iteration of block level operations across SIMD units (single infrastructure multiple data processing units).

 


This image shows automatic parallelization in Triton

                                                             (Automatic parallelization in Triton) 


 

(Related Blog: Working With Python JSON Objects)


 

Nvidia CUDA v/s Triton 

 

Triton and CUDA, though share some similarities, are also dissimilar in a variety of ways. Let’s learn the basic difference between the duo.

 

(Please note points of difference have been taken from OpenAI)

 

Modern GPUs comprise three main components DRAM (Dynamic Random Access Memory), SRAM (Static Random Access Memory),  ALUs (Arithmetic Logic Unit). All the three components can be employed for CUDA code optimization. 

 

  • Manual updation of data takes place in SRAM. Prior to data usage SRAM manages data reuse and minimizes shared memory conflicts during data retrieval.

 

  • DRAMs memory transfer is combined with large transactions for taking the advantage of modern memory interfaces.

 

  • Calculations are arranged and divided within streaming multiprocessors (SMs). Multiprocessors give a push to thread level parallelism for special purpose ALU utilization. 

 

Justifying the factors of optimization is difficult. Even programmers with extensive experience in CUDA encounter problems in optimization. Triton has automated basic performance level optimizations, for reducing the developer workload. Developers can now focus on high level logic in the parallel code. 

 

(Suggested Blog: OpenAI Jukebox: AI That Generates Complete Songs)

 

CUDA Programming Model

(Scalar Program, Blocked Threads)

Triton Programming Model

(Blocked Program, Scalar Threads)

#pragma parallel

for(int m = 0; i < M; m++)

#pragma parallel

for(int n = 0; j < N; n++){

  float acc = 0;

  for(int k = 0; k < K;k ++)

    acc += A[i, k]* B[k, j];

 

  C[i, j] = acc;

}

 

#pragma parallel

for(int m = 0; m < M; m += MB)

#pragma parallel

for(int n = 0; n < N; n += NB){

  float acc[MB, NB] = 0;

  for(int k = 0; k < K; k += KB)

    acc +=  A[m:m+MB, k:k+KB]

          @ B[k:k+KB, n:n+NB];

  C[m:m+MB, n:n+NB] = acc;

}

 

   

(Source: Triton)

 

 

CUDA 

Triton 

Memory Coalescing 

Manual 

Automatic

Shared Memory Management 

Manual 

Automatic 

Scheduling (Within SMS) 

Manual 

Automatic 

Scheduling (Across SMS) 

Manual 

Manual 

(Source: OpenAI)

 

(Must Catch: About OpenAI GPT-3 Language Model)


Issues in Triton

 

  • Random number generation

  • Small suggestions for Code Syntax

  • Dot Product Computes Wrong Values

  • Invalid PTX code in matrix multiplication example

  • Non- defined constant expressions

  • Poor matmul-square-nn bench performance

 

 

Last Note

 

Truth can only be found in one place: the code.”

― Robert C. Martin, Clean Code:  A Hanbook of Agile Software Craftsmanship

 

C++, Java, HTML, Python and Lua are some popular programming languages. As coding in Graphic Processing Unit is gaining popularity software developers are rushing to find a coding platform that can perform simpler code optimizations . 

 

In the manual coding world Triton instills a fresh ray of hope among programmers. With major automatic optimizations this newly devised Python’s similar programming language has the calibre to tackle it all. With smaller code and better optimizations this platform is the future of the programming world literally.

Latest Comments