Fabric defects can significantly impact the quality of a textile product. By analyzing the types and frequencies of defects, manufacturers can identify process inefficiencies, equipment malfunctions, or operator errors. Although deep learning networks are accurate in classification applications, some defects may be subtle and difficult to detect, while others may have complex patterns or occlusions. CNNs may struggle to capture a wide range of defect variations and generalize well to unseen defects. Discriminating between genuine defects and benign variations requires sophisticated feature extraction and modeling techniques. This paper proposes a residual network-based CNN model to enhance the classification of fabric defects. A pretrained residual network, ResNet50, is fine-tuned to classify fabric defects into four categories: holes, objects, oil spots, and thread errors on the fabric surface. The fine-tuned network is further optimized via cuckoo search optimization using classification error as a fitness function. The network is systematically analyzed at different layers, and the investigation of classification results are reported using a confusion matrix and classification accuracy for each class. The experimental results confirm that the proposed model achieved superior performance with 95.36% accuracy and a 95.35% F1 score for multiclass classification. In addition, the proposed model achieved higher accuracy with similar or fewer trainable parameters than traditional deep CNN networks.
Designing high-speed and energy-efficient blocks for image and digital signal processing (DSP) architecture is an evolving research field. This work designs a high-speed and energy-efficient multiply-accumulate (MAC) unit to augment the performance of field-programmable gate array (FPGA)-based accelerators and softcore processors. In this work, three discrete 32-bit fixed-point signed MAC architectures were designed in Verilog and synthesized for the Zynq 7000 ZedBoard to obtain efficient MAC architecture. The ultimate goal of this work is to design a fast and energy-efficient MAC unit that can achieve speed up to the DSP48 block to reduce the latency of IoT edge computing. Energy efficiency was achieved in PPG and partial product addition (PPA) for the proposed Booth radix-4 Dadda (BR4D)-based MAC. At PPG, the width of the partial product (PP) terms was optimized with Bewick’s signed extension to reduce the power consumption. At PPA, the number of PP rows reduces the critical path delay (CPD) with Dadda-based PPA. The proposed BR4D MAC unit offers a reduction in dynamic power, CPD, power-delay product (PDP) and energy-delay product (EDP) by 22%, 9%, 29% and 36%, respectively, compared to standard Booth radix-4 Wallace tree (BR4WT) based MAC. Furthermore, hybrid MACs (BR4WT and BR4D) were compared with the current state-of-the-art (SoA) designs, and it was found that the proposed BR4D MAC is 47% faster compared to the same design in SoA. The proposed BR4D was tested for frequency scaling technique by reducing the frequency in steps of 10 MHz from a maximum usable frequency (MUF) of 64 MHz to 10 MHz to evaluate the performance for low-power applications. Reducing clock frequency by 84% will reduce the power consumption at the same proportion and speed by 38%. Additionally, the proposed design helps to improve the battery life of IoT end nodes with a reduction in energy consumption and EDP by 76% and 61%, respectively..
Among various platforms for computer vision algorithms, FPGA has gained popularity as a low-power solution. These algorithms involve convolution operation which are extensively performed using signed multipliers. Hence, this work proposes high-speed and energy-efficient signed fixed-point multipliers for digital signal processing (DSP) applications. This work focuses on reducing the CPD using LUT-based Booth radix-4 partial product (PP) generation with Bewick’s sign extension and Dadda-based concurrent PP reduction with carry save adder (CSA) for Xilinx (now AMD) FPGA. The proposed design eliminates the requirement of a long carry chain for PP reduction. The proposed multiplier reduces combinational path delay (CPD) by 3%, 4%, and 16% compared to the state-of-the-art (SoA) multiplier for 8×8, 16×16, and 32×32 sizes, respectively. We have also analyzed our proposed 32×32 multiplier by pipelining, which offers CPD and EDP reduction by 12.28% and 19.47% at the cost of a 3% and 80% increase in LUTs and flip-flops, respectively, compared to the combinatorial multiplier.
Automatically adapting new productive approaches from the previous effort is one of the traditional ways of improving upcoming challenges. Machine learning (ML) is very popular for the same reason. Machine learning also learns from past experiences using past data. The important mechanism is automation in all procedures. On the basis of historical data, machine learning will be able to predict related outcomes by various mathematical models on data-driven approached. There are many types of algorithms to support the ML framework. Model efficiency plays an important role for better outcomes. Leave-one-out cross-validation is one of the techniques for checking the efficiency. The processes are executing N times for deriving better outcomes.
Over the last few years, we have witnessed a relentless improvement in the field of computer vision and deep neural networks. In a deep neural network, convolution operation is the load bearer as it performs feature extraction and dimensionality reduction on a large scale. As the models continue to go deeper and bulkier for better efficiency and accuracy there is a rapid increment in storage requirements too. The problem arises when performing computation with efficient numerical representations for embedded devices. Transitioning from floating-point representation to fixed-point could potentially reduce computation time, storage requirements, and latency with some accuracy loss. In this paper, an analysis of the effects of quantization of the first convolutional layer on the accuracy, and memory storage requirement with varying bit-width for fixed-point integer values of network parameters has been carried out. The approach adopted is post-training quantization with a mixed-precision format to avoid model re-training and minimize accuracy loss by using root-mean-square-error (RMSE) as a performance metric. Various combination has been analyzed and compared to find the optimal precision to implement on a resource-constraint device. Based on the analysis, the suggested bit-width of I/O data for this implementation is selected as <10,5> and mid-data be <20,10> instead of <16,8> and <32,16> respectively. This combination of bit-widths has reduced memory consumption such as BRAM by 10%, DSPs by 98.6% and FFs by 40.27% with some accuracy loss.
The multi-core parallel ultra-low power (PULP) cluster architecture allows the IoT edge node to shift toward near-sensor computing. In this paper, non-neural Eigenfaces-based face recognition (FR) is examined on an octa-core PULP cluster. It is possible to achieve high accuracy in the Eigenfaces-based algorithm without using a large data model. It is observed that the Eigenfaces-based face recognition algorithm achieved 93% accuracy on the PULP platform with a 4.55× lesser model size compared to the state-of-the-art SqueezeNet1.1-based FR algorithm on GAP8 platform. Parallelization of Eigenfaces-based face recognition is done to achieve maximum speed-up on multi-core, reducing recognition time. Furthermore, DMA-based communication between the fabric controller and multi-core cluster reduces the recognition time by 50× at the cost of a little degradation in speed-up on the multi-core. By adopting this technique, 165 faces per second are recognized with 93% accuracy on octa-core PULP cluster, which is 7.85× faster than a single core RISC-V with DMA. Compared to the ARM Cortex-M7 architecture, the multi-core PULP cluster reduces recognition time by 89.89%. These results make the multi-core PULP cluster an efficient choice for Eigenfaces-based face recognition on the edge.
Smart farming uses information and communication technologies in various fields of agriculture. It refers to the use of information and data management technologies in agriculture. Smart farming leads towards high productive and sustainable agricultural production. Smart farming provides the farmer many advantages for decision making for better management. Smart farming technologies collect precise measurements of factors that determine farming outcomes. It enables agriculture more reliable, predictable, and sustainable. It also improves crop health, reduces the ecological footprint of farming, helps feed the increasing global population, provides food security in climate change scenarios, and achieves higher yields while reducing operating costs. It's also needed to meet the needs of the growing population. There are many technological devices, such as IoT, software support, connection, data analytics, robots, drones, and GPS, which is useful to enhance the quantity and quality of agriculture production with minimizing labor.
Modern-day applications like the Internet of Things (IoT), machine learning, and arti cial intelligence collect a massive amount of data to process. Next-generation processors need to be more e-cient in processing enormous amounts of data to extract features. IoT is an emerging eld of technology, and edge computing is a potential research eld nowadays. Customized architecture with digital signal pro-cessing (DSP) operations reduce data transmitted to a higher node for processing. Multiply and accumulate (MAC) unit is an essential part of the many modern-day processors. Multiplier design is required to be optimized to reduce partial product terms and area. The addition is a frequently used operation in multiplication as well as MAC operation. In this paper, novel approaches are taken into account to optimize the MAC unit to improve performance and latency while extending on soft processor architecture for edge computing devices.
Proposed quasi-cyclic low-density parity-check code (QC-LDPC) efficiently communicates image at comparable peak signal-to-noise ratio (PSNR) with less signal-to-noise ratio (SNR). Encoding of the image is done using the Gauss–Jordan elimination method while decoding is done using the min-sum iterative message-passing algorithm using the code length 1152. Hardware design of decoder is based on fully parallel architecture for achieving more throughput.
Particle filter algorithms have been successfully used in various visual object tracking applications. They handle non-linear model and non-Gaussian noise, but are computationally demanding. In this paper, we propose a scalable implementation of particle filter algorithm for visual object tracking, using scalable interconnect such as network-on-chip on an FPGA platform. Here, several processing elements execute parallelly to handle large number of particles. We propose two designs and implementations, with one optimized for speed and other optimized for area. These implementations can easily support different image sizes, object sizes, and number of particles, without modifying the complete architecture. Multi-target tracking is also demonstrated for four objects. We validated the particle filter-based visual tracking with video feed from a Petalinux-based system. With image size of 320×240, frame rates of 348 fps and 310 fps were achieved for single-object tracking of size 17×17 and 33×33 pixels, respectively, with a reasonable low-power consumption of 1.7 mW/fps on Zynq XC7Z020 (Zedboard) with an operating frequency of 69 MHz. This makes our implementation a good candidate for low-power, visual object tracking using FPGA, especially in low-power, smart camera applications.
This paper presents the implementation of Particle filter based Object tracking using ReconOS on Reconfigurable Computing System. Multithreading can be used to improve the performance of complex Image processing algorithms, but their sequential execution is a barrier which can be tackled by the effective use of FPGAs. In order to accomplish the desired performance, an operating system, ReconOS is used on an ARM based CPU-FPGA hybrid platform. ReconOS extends communication and synchronization primitives of operating systems like mutexes, semaphores, condition variable and message boxes to reconfigurable hardware. ReconOS provides the advantage of mapping the particle filter algorithm into reconfigurable hardware and accessing the data from software threads. Thus providing improved performance, portability and unified appearance as well as transparency to the object tracking application.
Low density parity check (LDPC) code is a widely used error correcting code in various applications such as Wi-Fi, Wi-Max and Digital Video Broadcasting — Satellite — Second Generation (DVB-S2). Proposed work is focused on LDPC decoder design using soft decision iterative message passing for a code length 222 bits, 546 bits, 642 bits, 648 bits and 1152 bits that gives BER of 10−5 with coding gain of approximately 4 dB. Proposed work shows fully parallel design on isolated shifted identity matrices based on structured construction quasi cyclic low density parity check code (QC-LDPC) that give throughput of 2 Gbps.
Real-time particle filter based object tracking in videos on embedded platforms (FPGA) is challenging because of its resource usage and computational complexity. Furthermore, minor changes to the algorithm will need changes in the hardware. To address these issues, we propose a parametrizable FPGA framework for particle filter based object tracking algorithm. This parametrizable implementation can be used for various image sequences, object sizes and number of particles. By changing few parameters, this parametrization leads to appropriate changes in hardware resources resulting in efficient real-time operation of the algorithm. Experimental results show better tracking from the implementation and the proposed architecture can run particle filter algorithm for a color video sequence with 650 fps on average.
The algorithm-to-hardware High-level synthesis (HLS) tools today are purported to produce hardware comparable in quality to handcrafted designs, particularly with user directive driven or domains specific HLS. However, HLS tools are not readily equipped for when an application/algorithm needs to scale. We present a (work-in-progress) semi-automated framework to map applications over a packet-switched network of modules (single FPGA) and then to seamlessly partition such a network over multiple FPGAs over quasi-serial links. We illustrate the framework through three application case studies: LDPC Decoding, Particle Filter based Object Tracking, and Matrix Vector Multiplication over GF(2). Starting with high-level representations of each case application, we first express them in an intermediate message passing formulation, a model of communicating processing elements. Once the processing elements are identified, these are either handcrafted or realized using HLS. The rest of the flow is automated where the processing elements are plugged on to a configurable network-on-chip (CONNECT) topology of choice, followed by partitioning the 'on-chip' links to work seamlessly across chips/FPGAs.
There is a continuous requirement of enhancing the computation speed with minimum resources to improve performance of signal processing algorithm. This paper proposes an architecture and implementation of a modified color histogram based Particle filter for object tracking in video. This architecture implements weight calculation and histogram calculation in a highly parallel form. The proposed architecture occupies less resource saving by effective memory utilization. The performance of the algorithm is demonstrated using a single object scenario.
In this paper, we introduce modified hardware architecture with key features of lessening the resource usage of the FPGA and elevating the face detection frame rate. The system is based on well-known Viola Jones Framework which consists of AdaBoost algorithm integrated with Haar features. We also enlist the modification in hardware design techniques to achieve more parallel processing and higher detection speed of the system. The system implemented on Xilinx Virtex-5 FPGA development board outputs a high face detection rate (91.3%) at 60 frame/second for a VGA (640 × 480) video source. The power consumption of the implementation is 2.1 W.
The objective of this paper is to present the architecture design and implementation of a software defined hardware module called Control Signal Generator (CSG) for pulsed RADAR (Radio Detection and Ranging) application. It is a digital, programmable, application-specific control timing signal generator for Disaster Management Synthetic Aperture Radar (DM-SAR)[1]. This module is a slave controller which receives command through asynchronous serial interface and generates programmable timings. Architecture evolved and the module is developed using Very High Speed Integrated Circuit (VHSIC) Hardware Description Language (VHDL) and successfully implemented on Xilinx Field Programmable Gate Array (FPGA) XCV600-6HQ240
The ultimate Aim of ASIC verification is to obtain the highest possible level of confidence in the correctness of a design, attempt to find design errors and show that the design implements the specification. Complexity of ASIC is growing exponentially and the market is pressuring design cycle times to decrease. Traditional methods of verification have proven to be insufficient for Digital Image processing applications. We develop a new verification method based on SystemVerilog verification with MATLAB to accelerate verification. The co-simulation is accomplished using MATLAB and SystemVerilog coupled through the DPI. Here is used the Image Resize design verification as case study by using co-simulation method between SystemVerilog and MATLAB. Golden reference will be made using MATLAB In-built functions, while rest of the Verification Environment are in SystemVerilog. The goal is to find more bugs from the Design as compared to traditional method of Verification, reduce time to verify video processing ASIC, reduce debugging time, and reduce coding length
The ultimate Aim of ASIC verification is to obtain the highest possible level of confidence in the correctness of a design, attempt to find design errors and show that the design implements the specification. Complexity of ASIC is growing exponentially and the market is pressuring design cycle times to decrease. Traditional methods of verification have proven to be insufficient for Digital Image processing applications. We develop a new verification method based on SystemVerilog verification with MATLAB to accelerate verification. The co-simulation is accomplished using MATLAB and SystemVerilog coupled through the DPI. I will be using the Image Resize design as case study by using co-simulation method between SystemVerilog and MATLAB. Golden reference will be made using MATLAB In-built functions, while rest of the Verification blocks are in SystemVerilog. The goal is to find more bugs from Image resizing Design as compared to traditional method of Verification, reduce time to verify video processing ASIC, reduce debugging time, and reduce coding length.