Who thought GPU Programming will ever become a reality. But it did with Nvidia's CUDA. The parallel computing platform CUDA became the talk of the town with its stable release in May, 2020. Though it became every developer's first choice for better performance, its lengthy code execution was openly criticised.
Developers were in a desperate need of an open source programming language that could break down the arduous task of code development into simple, smaller fragments. Keeping a tab on the limitations of CUDA, OpenAI designed and released its open source programming language Triton 1.0.
There is more to Triton than just programming, to know the nitty gritty details of this programming language please follow the blog.
Triton 1.0 is an open source programming language like Python. This new programming language can redefine the coding process. Triton focuses on reducing the code size. It can perform the same functions as other programming languages with fewer lines of code. You may want to visit the basics of Python via the First Step Towards Python blog.
It can write F16 matrix multiplication code under 25 lines. Kernels produced using Triton are 2 times more efficient and faster than previous torch implementations. Triton can provide a stable interface for DNN transcompilers.
(Must Check: Working With Random Numbers in Python: Random Probability Distributions)
Triton simplifies the development of specific kernels, it is faster than other general purpose libraries.
Triton’s compiler simplifies code, it also automizes and parallelizes the code. Compiler further converts code into executable format.
Developers with no experience, or little CUDA experience can also write codes in Triton.
Triton increases the possibilities of reaching hardware’s peak performance with less effort.
Programmers with little GPU Programming knowledge will get a better frontend platform through Triton. Learning the syntax of GPU programming might be easy but porting algorithms for efficient utilization of GPUs is not an easy thing.
Before diving into the basics of Triton let’s learn about the difficulties in GPU programming.
DRAM, SRAM and ALUs are the architecture of modern GPU programming. All three must be considered prior to CUDA code optimization:
Memory transfers from DRAM must be integrated into large transactions to leverage the width of modern memory interfaces.
Data must be manually stored to SRAM before being re-used, Data should be managed and utilized efficiently for minimizing shared memory nabk conflicts in data retreival.
It is important to partition and scedule computation within Streaming Multiprocessor for the promotion of instruction, thread level parallelism. It also enables to leverage arithmetic logic unit.
GPUs are also optimized for training deep learning models. ( Deep learning, a subset of machine learning, is based on artificial neural networks for representation learning).
GPUs can process multiple computations simultaneously, this unique feature allows them to form the basis of model training. GPUs have an extremely large number of cores which enable them to perform computation for multiple parallel processes.
Triton is specifically designed for GPU powered neural networks. Neural networks and Convolutional Neural Networks are a series of algorithms that recognize underlying relationships between datasets by mimicking the human brain process. Programming in GPU is correlated with deep learning.
Triton’s programming structure is based on ANSI C (CUDA C). Modifications were made in the CUDA C structure to simplify the code execution functions and to enable accommodation of semantics with a differentiated model.
Following changes are integrated in Triton:-
Tile Declaration:- A new special syntax will enable the declaration of multidimensional arrays. Example (int tile[16],[16]). Special Syntax will highlight semantic differences with nested arrays. While writing the code tile shapes must be kept constant. Tile shapes can be made parametric via the tumble keyword. One dimensional integer tiles can be initialized using ellipses.
Built in Functions:- (dot, trans, get_global_range) are some built in functions that are added to support tile semantics. Other commonly used wise array operations (+ , -, *, %) are also included in Triton’s infrastructure.
Broadcasting:- N Dimensional tile can be broadcasted using the (newaxis) keyword and slicing syntax. Example: int broadcast [9,9] = range [:,newaxis]
Predication:- Prefix “@” is used for executing control flow within the tile operations.
(Related Blog: Matrix Functions in R)
(Please note data for string and numeric functions is taken from Triton)
1. triton.language.zeros
Returns a block with the scalar value 0 for the given shape and d type.
Shape: shape of new array example: (8,16), (8, 0)
dtype: Data type of the new array
2. triton.language.broadcast_to
Tries to broadcast the given block to a new shape.
Input: the input block
Shape: the desired shape
3. triton.language.reshape
Tries to reshape the given block to a new shape
Input: the input block
Shape: the desired shape
4. Triton.language.revel
Returns a contiguous flattened view of X.
X (Block)- the input block
5. triton.program_id
axis: values (0,1,2)
builder: IR builder for code generation
6. Triton.language.num_programs
axis: values (0,1,2)
builder: IR builder for code generation
7. triton.language.arange
Returns contiguous values within the open interval [start,end]
Start: start of the interval power of two
stop: end of the interval, power of two
builder: IR builder for code generation
8. triton.langauge.broadcast_to
Input: the input block
Shape: desired shapes
Builder: IR builder for code generation
9. triton.language.exp
X (Block): the input values
builder: IR builder to generate code
(Must Check: DATA TYPES in Python)
CUDA GPU execution code is supported by SPMD Programming Model. In the SPMD programming model each kernel (central part of an operating system) is a part of an identifiable thread block in the launch grid. Although Triton follows a similar programming model kernels in “triton” are “single threaded” (processes one command at a time).
Kernels in Triton are automatically parallelized and associated with varied global ranges. Automatic parallelization leads to simpler kernels with inexistent CUDA-like concurrency. For querying global ranges with associated kernels (get_global_range) function is used.
(Suggested Blog: OpenAI’s GPT-2 (Generative Pre-Trained Transformer-2) : "AI that is too Dangerous to Handle.")
Triton IR enables the compiler to automatically perform basic optimizations. Data in the compiler is automatically stored by operands for intensive block operations and synchronized analysis techniques.
Triton programs are automatically parallelized across SM’s through the concurrent execution of different kernels, the optimization within SM’s analyzes iteration of block level operations across SIMD units (single infrastructure multiple data processing units).
(Automatic parallelization in Triton)
(Related Blog: Working With Python JSON Objects)
Triton and CUDA, though share some similarities, are also dissimilar in a variety of ways. Let’s learn the basic difference between the duo.
(Please note points of difference have been taken from OpenAI)
Modern GPUs comprise three main components DRAM (Dynamic Random Access Memory), SRAM (Static Random Access Memory), ALUs (Arithmetic Logic Unit). All the three components can be employed for CUDA code optimization.
Justifying the factors of optimization is difficult. Even programmers with extensive experience in CUDA encounter problems in optimization. Triton has automated basic performance level optimizations, for reducing the developer workload. Developers can now focus on high level logic in the parallel code.
(Suggested Blog: OpenAI Jukebox: AI That Generates Complete Songs)
CUDA Programming Model (Scalar Program, Blocked Threads) |
Triton Programming Model (Blocked Program, Scalar Threads) |
#pragma parallel for(int m = 0; i < M; m++) #pragma parallel for(int n = 0; j < N; n++){ float acc = 0; for(int k = 0; k < K;k ++) acc += A[i, k]* B[k, j]; C[i, j] = acc; }
|
#pragma parallel for(int m = 0; m < M; m += MB) #pragma parallel for(int n = 0; n < N; n += NB){ float acc[MB, NB] = 0; for(int k = 0; k < K; k += KB) acc += A[m:m+MB, k:k+KB] @ B[k:k+KB, n:n+NB]; C[m:m+MB, n:n+NB] = acc; }
|
(Source: Triton)
CUDA |
Triton |
|
Memory Coalescing |
Manual |
Automatic |
Shared Memory Management |
Manual |
Automatic |
Scheduling (Within SMS) |
Manual |
Automatic |
Scheduling (Across SMS) |
Manual |
Manual |
(Source: OpenAI)
(Must Catch: About OpenAI GPT-3 Language Model)
Random number generation
Small suggestions for Code Syntax
Dot Product Computes Wrong Values
Invalid PTX code in matrix multiplication example
Non- defined constant expressions
Poor matmul-square-nn bench performance
Truth can only be found in one place: the code.”
― Robert C. Martin, Clean Code: A Hanbook of Agile Software Craftsmanship
C++, Java, HTML, Python and Lua are some popular programming languages. As coding in Graphic Processing Unit is gaining popularity software developers are rushing to find a coding platform that can perform simpler code optimizations .
In the manual coding world Triton instills a fresh ray of hope among programmers. With major automatic optimizations this newly devised Python’s similar programming language has the calibre to tackle it all. With smaller code and better optimizations this platform is the future of the programming world literally.
5 Factors Influencing Consumer Behavior
READ MOREElasticity of Demand and its Types
READ MOREAn Overview of Descriptive Analysis
READ MOREWhat is PESTLE Analysis? Everything you need to know about it
READ MOREWhat is Managerial Economics? Definition, Types, Nature, Principles, and Scope
READ MORE5 Factors Affecting the Price Elasticity of Demand (PED)
READ MORE6 Major Branches of Artificial Intelligence (AI)
READ MOREScope of Managerial Economics
READ MOREDijkstra’s Algorithm: The Shortest Path Algorithm
READ MOREDifferent Types of Research Methods
READ MORE
Latest Comments