Beamspace channel estimation is a crucial part of modern wireless communication systems because it minimises the impact of channel losses and allows for the effective use of spectrum resources. This review paper presents a thorough examination and comparison of beamspace channel estimation techniques. These algorithms include subspace-based techniques, machine-learning techniques, compressive sensing techniques, semi-blind techniques, and conventional techniques. Each category of algorithms is assessed and contrasted based on several criteria. Least square (LS), minimum mean square error (MMSE), and linear minimum mean square (LMMSE) algorithms are examples of traditional techniques. Space alternating generalised expectation maximisation (SAGE) is a component of the semi-blind approach. The sparsity of signals is used by compressive sensing algorithms that employ statistical properties for estimation, such as Compressive Sampling Matching Pursuit (CoSaMP), sparse Bayesian learning (SBL), and orthogonal matching pursuit (OMP), to efficiently recover information from undersampled measurements. In addition, the article aims to explore the emerging field of machine-learning-based approaches to path computation, including convolutional neural network (CNN), recurrent neural network (RNN), deep neural network (DNN), approximate message passing (AMP), and orthogonal approximate message passing (OAMP). Additionally, subspace-based Multiple Signal Classification (MUSIC) and Estimation of Signal Parameters via Rotational Invariance Techniques (ESPRIT) are also evaluated.
Artificial Intelligence (AI) has significantly transformed a wide range of fields, revolutionizing diverse applications. In this AI-driven era, tasks across multiple domains are being optimized for enhanced efficiency and performance. For deployment in real-world scenarios, edge devices play a critical role due to several compelling advantages. However, edge devices have constraints in terms of computational power, battery life and available resources to implement AI models. Model quantization, reduces the precision of model parameters (weights and activations), is one such technique that results in smaller model sizes and faster inference times, without significantly compromising performance. In this study, we explore a detailed comparative analysis of several existing quantization methods, including post-training quantization (PTQ), quantization-aware training (QAT), within the context of edge AI applications. We evaluate the effectiveness of these techniques using key performance metrics such as accuracy, model compression, inference latency, and energy efficiency. We have explored quantization workflows of two popular frameworks TensorFlow and PyTorch, for generating compatible models for edge devices such as microcontrollers and GPUs, and conduct a feasibility analysis of FPGA-based deployment.
Feature extraction at remotely located edge nodes poses a challenge due to resource scarcity in terms of computation power, memory footprint, and battery life. Compared to deep neural network (DNN) algorithms for feature extraction, this work targets non-neural face recognition algorithms with comparable accuracy at a reduced memory footprint for small data sets. Mixed precision (8-bit fixed-point and 32-bit floating-point) training data is considered in this work for higher accuracy. Additionally, energy consumption is reduced by deploying a Memory Oriented MAnual tIling (MOMAI)-based non-neural face recognition (FR) algorithm on the GAP8 cluster. We have customized data transfer flow with MOMAI (similar to GAP8’s NNtool with Auto-tiler) and parallelized mixed precision training data of a non-neural Eigenfaces-based FR algorithm. Compared to non-DMA non-quantized manually tiled versions, DMA-based full manual-tiling strategy reduced recognition time by 16.09× on 8-cores and 134.32× on 1-core, respectively. Furthermore, an 8-bit quantization reduces the model size by 14.95× and overall cycle counts by 55.75%.
Object tracking on embedded systems requires efficient models that operate within limited computational resources. Edge deployment reduces latency and power consumption, ensuring optimal performance for real-time applications in constrained environments. In this work, we propose a lightweight custom object tracker based on Siamese architecture, optimized for faster response and lower computation. We explore quantization-aware training (QAT) to reduce parameters in the backbone classifier. The tracker, validated on the VOT 2015 benchmark, shows minimal deviation compared to the baseline SiamFC. The tracker was implemented and tested across various platforms, including CPU, NVIDIA Jetson GPU, and PYNQ-Z2 FPGA, demonstrating efficient performance on resource-limited devices while maintaining real-time capabilities. Our compressed model is 2.1× smaller, with minimal accuracy loss, making it suitable for applications such as UAVs, traffic monitoring, real-time surveillance systems.
Convolutional neural networks (CNNs) are the epitome of artificial intelligence (AI)-based applications. The computationally intensive convolution operation is the core of the entire architecture. Acceleration of CNN-based applications requires several algorithmic level manipulations and optimization for resource-constrained devices. In this work, we have proposed a template-based methodology for CNN acceleration on field programmable gate arrays (FPGA) hardware by designing reusable cores for individual layers like convolution, pooling, and dense layers. We explored various optimization techniques to achieve the best-hardware-designing strategy with data reuse and design space exploration. We have verified our methodology for LeNet-5 with kernel 5×5 and a custom CNN with kernel 3×3 for classification. The hardware-system design was validated on FPGA Xilinx XC7Z020 FPGA. Our proposed methodology achieves 2.9 GOPS/s performance outperforming existing implementation by 1.28× .
With the high load of vehicle traffic, tracing and capturing vehicular information over traffic surveillance on roads, parking, or for safety concerns is tough. In the proposed method, a deep learning-based object detection model, EfficientDet-D0, has been trained with the custom dataset for license plate detection and used an optical character recognition model, Tesseract. In the proposed method, we have used an improved license plate extraction algorithm, which reduces false localization followed by character recognition in a pipeline manner. We have also explored the model quantization method to compress the model with reduced precision for efficient edge-based deployment for an end application. In the proposed work, we have dedicated our study to Indian vehicles, evaluated the performance with standard datasets like CCPD and UFPR, and have achieved 97.9% in license localization and 95.15% in end-to-end detection and recognition, respectively. We have implemented it on Raspberry Pi3 and NVIDIA Jetson Nano devices with improved performances. Compared with state-of-the-art, we have achieved 2×, 3.8×, and 2.5× in CPU, GPU, and edge platform, respectively
Fabric defects can significantly impact the quality of a textile product. By analyzing the types and frequencies of defects, manufacturers can identify process inefficiencies, equipment malfunctions, or operator errors. Although deep learning networks are accurate in classification applications, some defects may be subtle and difficult to detect, while others may have complex patterns or occlusions. CNNs may struggle to capture a wide range of defect variations and generalize well to unseen defects. Discriminating between genuine defects and benign variations requires sophisticated feature extraction and modeling techniques. This paper proposes a residual network-based CNN model to enhance the classification of fabric defects. A pretrained residual network, ResNet50, is fine-tuned to classify fabric defects into four categories: holes, objects, oil spots, and thread errors on the fabric surface. The fine-tuned network is further optimized via cuckoo search optimization using classification error as a fitness function. The network is systematically analyzed at different layers, and the investigation of classification results are reported using a confusion matrix and classification accuracy for each class. The experimental results confirm that the proposed model achieved superior performance with 95.36% accuracy and a 95.35% F1 score for multiclass classification. In addition, the proposed model achieved higher accuracy with similar or fewer trainable parameters than traditional deep CNN networks.
Designing high-speed and energy-efficient blocks for image and digital signal processing (DSP) architecture is an evolving research field. This work designs a high-speed and energy-efficient multiply-accumulate (MAC) unit to augment the performance of field-programmable gate array (FPGA)-based accelerators and softcore processors. In this work, three discrete 32-bit fixed-point signed MAC architectures were designed in Verilog and synthesized for the Zynq 7000 ZedBoard to obtain efficient MAC architecture. The ultimate goal of this work is to design a fast and energy-efficient MAC unit that can achieve speed up to the DSP48 block to reduce the latency of IoT edge computing. Energy efficiency was achieved in PPG and partial product addition (PPA) for the proposed Booth radix-4 Dadda (BR4D)-based MAC. At PPG, the width of the partial product (PP) terms was optimized with Bewick’s signed extension to reduce the power consumption. At PPA, the number of PP rows reduces the critical path delay (CPD) with Dadda-based PPA. The proposed BR4D MAC unit offers a reduction in dynamic power, CPD, power-delay product (PDP) and energy-delay product (EDP) by 22%, 9%, 29% and 36%, respectively, compared to standard Booth radix-4 Wallace tree (BR4WT) based MAC. Furthermore, hybrid MACs (BR4WT and BR4D) were compared with the current state-of-the-art (SoA) designs, and it was found that the proposed BR4D MAC is 47% faster compared to the same design in SoA. The proposed BR4D was tested for frequency scaling technique by reducing the frequency in steps of 10 MHz from a maximum usable frequency (MUF) of 64 MHz to 10 MHz to evaluate the performance for low-power applications. Reducing clock frequency by 84% will reduce the power consumption at the same proportion and speed by 38%. Additionally, the proposed design helps to improve the battery life of IoT end nodes with a reduction in energy consumption and EDP by 76% and 61%, respectively..
Among various platforms for computer vision algorithms, FPGA has gained popularity as a low-power solution. These algorithms involve convolution operation which are extensively performed using signed multipliers. Hence, this work proposes high-speed and energy-efficient signed fixed-point multipliers for digital signal processing (DSP) applications. This work focuses on reducing the CPD using LUT-based Booth radix-4 partial product (PP) generation with Bewick’s sign extension and Dadda-based concurrent PP reduction with carry save adder (CSA) for Xilinx (now AMD) FPGA. The proposed design eliminates the requirement of a long carry chain for PP reduction. The proposed multiplier reduces combinational path delay (CPD) by 3%, 4%, and 16% compared to the state-of-the-art (SoA) multiplier for 8×8, 16×16, and 32×32 sizes, respectively. We have also analyzed our proposed 32×32 multiplier by pipelining, which offers CPD and EDP reduction by 12.28% and 19.47% at the cost of a 3% and 80% increase in LUTs and flip-flops, respectively, compared to the combinatorial multiplier.
Automatically adapting new productive approaches from the previous effort is one of the traditional ways of improving upcoming challenges. Machine learning (ML) is very popular for the same reason. Machine learning also learns from past experiences using past data. The important mechanism is automation in all procedures. On the basis of historical data, machine learning will be able to predict related outcomes by various mathematical models on data-driven approached. There are many types of algorithms to support the ML framework. Model efficiency plays an important role for better outcomes. Leave-one-out cross-validation is one of the techniques for checking the efficiency. The processes are executing N times for deriving better outcomes.
Modern computer vision applications are vastly dependent on convolutional neural networks (CNNs) in order to perform feature extractions, dimensionality reduction, etc. There are a series of arithmetic calculations markedly dominated by multiply-accumulate (MAC) operations in a single convolutional layer which is directly dependent on the kernel size. In this work, we propose a design method for a run-time re-configurable convolution IP core which could be implemented for any layer of a CNN having variable kernel sizes and verified for different dimensions, 5 × 5 for LeNet-5 and 11 × 11 for AlexNet respectively. The proposed accelerator is 6.25 x and 12.1 x faster than basic MAC operation for LeNet-5 and AlexNet respectively. The design has been compared with existing architecture and has improved performance by 2.46x G(FL)OPS/s per MAC module. Our proposed design has run-time re-configuration support with a scheduler logic which empowers to realize the design for any specific layer, kernel dimension and channels for respectively with efficient resource utilization.
Over the last few years, we have witnessed a relentless improvement in the field of computer vision and deep neural networks. In a deep neural network, convolution operation is the load bearer as it performs feature extraction and dimensionality reduction on a large scale. As the models continue to go deeper and bulkier for better efficiency and accuracy there is a rapid increment in storage requirements too. The problem arises when performing computation with efficient numerical representations for embedded devices. Transitioning from floating-point representation to fixed-point could potentially reduce computation time, storage requirements, and latency with some accuracy loss. In this paper, an analysis of the effects of quantization of the first convolutional layer on the accuracy, and memory storage requirement with varying bit-width for fixed-point integer values of network parameters has been carried out. The approach adopted is post-training quantization with a mixed-precision format to avoid model re-training and minimize accuracy loss by using root-mean-square-error (RMSE) as a performance metric. Various combination has been analyzed and compared to find the optimal precision to implement on a resource-constraint device. Based on the analysis, the suggested bit-width of I/O data for this implementation is selected as <10,5> and mid-data be <20,10> instead of <16,8> and <32,16> respectively. This combination of bit-widths has reduced memory consumption such as BRAM by 10%, DSPs by 98.6% and FFs by 40.27% with some accuracy loss.
The multi-core parallel ultra-low power (PULP) cluster architecture allows the IoT edge node to shift toward near-sensor computing. In this paper, non-neural Eigenfaces-based face recognition (FR) is examined on an octa-core PULP cluster. It is possible to achieve high accuracy in the Eigenfaces-based algorithm without using a large data model. It is observed that the Eigenfaces-based face recognition algorithm achieved 93% accuracy on the PULP platform with a 4.55× lesser model size compared to the state-of-the-art SqueezeNet1.1-based FR algorithm on GAP8 platform. Parallelization of Eigenfaces-based face recognition is done to achieve maximum speed-up on multi-core, reducing recognition time. Furthermore, DMA-based communication between the fabric controller and multi-core cluster reduces the recognition time by 50× at the cost of a little degradation in speed-up on the multi-core. By adopting this technique, 165 faces per second are recognized with 93% accuracy on octa-core PULP cluster, which is 7.85× faster than a single core RISC-V with DMA. Compared to the ARM Cortex-M7 architecture, the multi-core PULP cluster reduces recognition time by 89.89%. These results make the multi-core PULP cluster an efficient choice for Eigenfaces-based face recognition on the edge.
Smart farming uses information and communication technologies in various fields of agriculture. It refers to the use of information and data management technologies in agriculture. Smart farming leads towards high productive and sustainable agricultural production. Smart farming provides the farmer many advantages for decision making for better management. Smart farming technologies collect precise measurements of factors that determine farming outcomes. It enables agriculture more reliable, predictable, and sustainable. It also improves crop health, reduces the ecological footprint of farming, helps feed the increasing global population, provides food security in climate change scenarios, and achieves higher yields while reducing operating costs. It's also needed to meet the needs of the growing population. There are many technological devices, such as IoT, software support, connection, data analytics, robots, drones, and GPS, which is useful to enhance the quantity and quality of agriculture production with minimizing labor.
Modern-day applications like the Internet of Things (IoT), machine learning, and arti cial intelligence collect a massive amount of data to process. Next-generation processors need to be more e-cient in processing enormous amounts of data to extract features. IoT is an emerging eld of technology, and edge computing is a potential research eld nowadays. Customized architecture with digital signal pro-cessing (DSP) operations reduce data transmitted to a higher node for processing. Multiply and accumulate (MAC) unit is an essential part of the many modern-day processors. Multiplier design is required to be optimized to reduce partial product terms and area. The addition is a frequently used operation in multiplication as well as MAC operation. In this paper, novel approaches are taken into account to optimize the MAC unit to improve performance and latency while extending on soft processor architecture for edge computing devices.
Proposed quasi-cyclic low-density parity-check code (QC-LDPC) efficiently communicates image at comparable peak signal-to-noise ratio (PSNR) with less signal-to-noise ratio (SNR). Encoding of the image is done using the Gauss–Jordan elimination method while decoding is done using the min-sum iterative message-passing algorithm using the code length 1152. Hardware design of decoder is based on fully parallel architecture for achieving more throughput.
Particle filter algorithms have been successfully used in various visual object tracking applications. They handle non-linear model and non-Gaussian noise, but are computationally demanding. In this paper, we propose a scalable implementation of particle filter algorithm for visual object tracking, using scalable interconnect such as network-on-chip on an FPGA platform. Here, several processing elements execute parallelly to handle large number of particles. We propose two designs and implementations, with one optimized for speed and other optimized for area. These implementations can easily support different image sizes, object sizes, and number of particles, without modifying the complete architecture. Multi-target tracking is also demonstrated for four objects. We validated the particle filter-based visual tracking with video feed from a Petalinux-based system. With image size of 320×240, frame rates of 348 fps and 310 fps were achieved for single-object tracking of size 17×17 and 33×33 pixels, respectively, with a reasonable low-power consumption of 1.7 mW/fps on Zynq XC7Z020 (Zedboard) with an operating frequency of 69 MHz. This makes our implementation a good candidate for low-power, visual object tracking using FPGA, especially in low-power, smart camera applications.
This paper presents the implementation of Particle filter based Object tracking using ReconOS on Reconfigurable Computing System. Multithreading can be used to improve the performance of complex Image processing algorithms, but their sequential execution is a barrier which can be tackled by the effective use of FPGAs. In order to accomplish the desired performance, an operating system, ReconOS is used on an ARM based CPU-FPGA hybrid platform. ReconOS extends communication and synchronization primitives of operating systems like mutexes, semaphores, condition variable and message boxes to reconfigurable hardware. ReconOS provides the advantage of mapping the particle filter algorithm into reconfigurable hardware and accessing the data from software threads. Thus providing improved performance, portability and unified appearance as well as transparency to the object tracking application.
Low density parity check (LDPC) code is a widely used error correcting code in various applications such as Wi-Fi, Wi-Max and Digital Video Broadcasting — Satellite — Second Generation (DVB-S2). Proposed work is focused on LDPC decoder design using soft decision iterative message passing for a code length 222 bits, 546 bits, 642 bits, 648 bits and 1152 bits that gives BER of 10−5 with coding gain of approximately 4 dB. Proposed work shows fully parallel design on isolated shifted identity matrices based on structured construction quasi cyclic low density parity check code (QC-LDPC) that give throughput of 2 Gbps.
Real-time particle filter based object tracking in videos on embedded platforms (FPGA) is challenging because of its resource usage and computational complexity. Furthermore, minor changes to the algorithm will need changes in the hardware. To address these issues, we propose a parametrizable FPGA framework for particle filter based object tracking algorithm. This parametrizable implementation can be used for various image sequences, object sizes and number of particles. By changing few parameters, this parametrization leads to appropriate changes in hardware resources resulting in efficient real-time operation of the algorithm. Experimental results show better tracking from the implementation and the proposed architecture can run particle filter algorithm for a color video sequence with 650 fps on average.
The algorithm-to-hardware High-level synthesis (HLS) tools today are purported to produce hardware comparable in quality to handcrafted designs, particularly with user directive driven or domains specific HLS. However, HLS tools are not readily equipped for when an application/algorithm needs to scale. We present a (work-in-progress) semi-automated framework to map applications over a packet-switched network of modules (single FPGA) and then to seamlessly partition such a network over multiple FPGAs over quasi-serial links. We illustrate the framework through three application case studies: LDPC Decoding, Particle Filter based Object Tracking, and Matrix Vector Multiplication over GF(2). Starting with high-level representations of each case application, we first express them in an intermediate message passing formulation, a model of communicating processing elements. Once the processing elements are identified, these are either handcrafted or realized using HLS. The rest of the flow is automated where the processing elements are plugged on to a configurable network-on-chip (CONNECT) topology of choice, followed by partitioning the 'on-chip' links to work seamlessly across chips/FPGAs.
There is a continuous requirement of enhancing the computation speed with minimum resources to improve performance of signal processing algorithm. This paper proposes an architecture and implementation of a modified color histogram based Particle filter for object tracking in video. This architecture implements weight calculation and histogram calculation in a highly parallel form. The proposed architecture occupies less resource saving by effective memory utilization. The performance of the algorithm is demonstrated using a single object scenario.
In this paper, we introduce modified hardware architecture with key features of lessening the resource usage of the FPGA and elevating the face detection frame rate. The system is based on well-known Viola Jones Framework which consists of AdaBoost algorithm integrated with Haar features. We also enlist the modification in hardware design techniques to achieve more parallel processing and higher detection speed of the system. The system implemented on Xilinx Virtex-5 FPGA development board outputs a high face detection rate (91.3%) at 60 frame/second for a VGA (640 × 480) video source. The power consumption of the implementation is 2.1 W.
The objective of this paper is to present the architecture design and implementation of a software defined hardware module called Control Signal Generator (CSG) for pulsed RADAR (Radio Detection and Ranging) application. It is a digital, programmable, application-specific control timing signal generator for Disaster Management Synthetic Aperture Radar (DM-SAR)[1]. This module is a slave controller which receives command through asynchronous serial interface and generates programmable timings. Architecture evolved and the module is developed using Very High Speed Integrated Circuit (VHSIC) Hardware Description Language (VHDL) and successfully implemented on Xilinx Field Programmable Gate Array (FPGA) XCV600-6HQ240
The ultimate Aim of ASIC verification is to obtain the highest possible level of confidence in the correctness of a design, attempt to find design errors and show that the design implements the specification. Complexity of ASIC is growing exponentially and the market is pressuring design cycle times to decrease. Traditional methods of verification have proven to be insufficient for Digital Image processing applications. We develop a new verification method based on SystemVerilog verification with MATLAB to accelerate verification. The co-simulation is accomplished using MATLAB and SystemVerilog coupled through the DPI. Here is used the Image Resize design verification as case study by using co-simulation method between SystemVerilog and MATLAB. Golden reference will be made using MATLAB In-built functions, while rest of the Verification Environment are in SystemVerilog. The goal is to find more bugs from the Design as compared to traditional method of Verification, reduce time to verify video processing ASIC, reduce debugging time, and reduce coding length
The ultimate Aim of ASIC verification is to obtain the highest possible level of confidence in the correctness of a design, attempt to find design errors and show that the design implements the specification. Complexity of ASIC is growing exponentially and the market is pressuring design cycle times to decrease. Traditional methods of verification have proven to be insufficient for Digital Image processing applications. We develop a new verification method based on SystemVerilog verification with MATLAB to accelerate verification. The co-simulation is accomplished using MATLAB and SystemVerilog coupled through the DPI. I will be using the Image Resize design as case study by using co-simulation method between SystemVerilog and MATLAB. Golden reference will be made using MATLAB In-built functions, while rest of the Verification blocks are in SystemVerilog. The goal is to find more bugs from Image resizing Design as compared to traditional method of Verification, reduce time to verify video processing ASIC, reduce debugging time, and reduce coding length.