Posts

SPO600 - Project Stage 3

 I used the Python tool to modify existing functions for different architectures, and it calls the GCC compiler and passes the necessary flags to enable auto vectorization for that specific code. In the final version, I implemented solution and resolves to all the limitation listed in my previous blogs. However, I was unable to overcome the permissible constraints.  To summarize, auto vectorization is a useful technique for improving the performance of program on ARM CPUs that use SVE, SVE2, and ASIMD instructions. Auto vectorization in the C programming language can be enabled by using the GCC compiler and the appropriate flags. This can help to optimize program and take advantage of the SIMD capabilities of modern ARM processors. I made changes to the tool's final version. The tool can be downloaded from  https://github.com/puja1102/SPO600Project . Improvements: The previous implementation's first limitation was that it did not support exception handling. The tool's final

SPO600 - Project Stage 2

Introduction: This tool generates three versions of the function specified as the second argument. Then, while building the main output file from these functions, this tool employs the ifunc capability to select the best method.  The tool was written in Python, while the ifunc resolver function is written in C. In this blog, I will explain the tool's working mechanism and limitations, as well as the procedure for testing the tool. Accessing the tool: The tool is hosted on GitHub. The tool is available for download at https://github.com/puja1102/SPO600Project . The GPL version 2 licence governs the use of this tool. The tool's main files are as follows: tool.py: the main tool written in Python that builds the functions and generates the output binary files using the best SIMD implementation. template.txt: This is a text file containing the template for the ifunc function, as well as the resolver function for the ifunc. The ifunc.c file is created using this template. This tool

Auto Vectorization

 I'm working on a blog post about the concept of automatic vectorization in parallel computing, which can help reduce the number of cycles and chains in loops. It means that instead of scalar implementation, code can be converted to perform vector operations, which means a single operation can be performed on multiple operands. The GCC compiler is sophisticated enough to use vectorization to optimise code and improve performance. It is possible to accomplish this by using optimization flags such as -03 and -ftree-vectorize. AArch64 supports three extensions: SMID (Single Instruction, Multi Data), SVE, and SVE2. SVE  and SVE2 Scalable Vector Extension 2 (SVE2) is an armv9 extension that provides variable-width SMID capability. The main difference between SVE2 and SVE is the functional coverage of the instruction set. SVE and SVE2 both allow for large amounts of data to be collected and processed. For an aarch64 system, run the following command to enable SVE: gcc -g -O3 -c march=arm

SPO600 - Project Stage 1

 In this blog, I'll go over the first stage of my SP0600 project. The first stage is dedicated to project planning. In this project, we must develop a proof-of-concept tool for creating functions using automatic vectorization. The primary goal of this project is to eliminate all of the setup performed by software developers and to automate the process through the creation of a tool. This tool allows developers to create three versions of a function, and the compiler will select one of them at runtime to generate a single output file. The goal of this project is to create a proof-of-concept tool that will take code that meets certain criteria and automatically build it with the ifunc capability to choose between multiple, autovectorized versions of a function, allowing the code to take advantage of the best SIMD implementation available on the CPU on which it is running.  The limitations of this project are: The tool is only compatible with the aarch64 system. There are only three S

Lab 5 - Algorithm Selection Lab

In this lab, I wrote a set of programs that generates sound samples in a data array. It generates 5000000 samples as specified in the provided "vol.h" file, then scales the samples by a factor of 0.75 and returns them to the array. As a result, the program generates and displays a sum from the output array. To test the run time for each program, we'll use the command "time./programName," which will run the program and output the results as well as the time it took to run it. The program's real time is the sum of two sections: "user time" and "system time." Task 1: I obtained the base run time data we will work with when I ran the initial program, "vol1." This program runs with the volume factor unchanged. This allows us to test the other programs and observe any significant changes in run time data. Running the same program will not always produce the same results; there may be slight differences. This could be due to initial r

Lab 4 - Part 2 (x86_64)

In this section, I must write the same programme for the x86 64 platform. This was a little difficult for me because I found aarch64 syntax to be simpler and easier to understand than x86 64 syntax.  While writing the code for x86_64, I followed a similar procedure. I first printed the loop 10 times without printing the loop counter. I tried it, and it worked on the first try. However, implementing the loop counter proved to be somewhat difficult. I was able to implement it, but when I was storing the count in location #, instead of storing a single byte, it was storing a qword, which overwrote n, and the output was in a single line. To solve the problem, I suffixed the command and registered with b. Here's my code: .text .globl _start min = 0 /* starting value for the loop index; note that this is a symbol (constant), not a variable */ max = 10 /* loop exits when the index hits this number (loop condition is i<max) */ _start:

Lab 4 - Part 1 (AArch64)

 This lab is based on an introduction to 64 bit Assembly language for aarch64 and x86 64. In this lab, I'll be working with assembly language on both the x86 64 and aarch64 platforms. To access these platforms, we must use SSH servers to connect to two servers provided by the professor. The first server is Israel, which runs on an aarch64 system, and the second server is Portugal, which runs on an x86 64 system.  The first task is to compile and run aarch64 programs. I connected to the Israel server and configured some settings for it. After that, I was given some examples to work with.  Then, from 0 to 9, we must create a loop and print the word with each increment along with the increment counter. We were given a loop template that did not include the loop body. In aarch64, the value 64 is used to invoke the write method. In the second step, I copied the value from r19 to a new register and prefixed it with '0' to convert it to ascii, then I moved that value to the locati