., Human Object Detection for Real-Time Camera using Mobilenet-SSD

Technology development is very rapid, so all fields are required to develop technology to increase the effectiveness and efficiency of work. One of the focuses is related to image processing technology. We can benefit from this system, so various fields have implemented image processing systems, such as security, health, and education. One of the current obstacles is safety, namely in searching for people, which is still done manually. Searching for teams to find people is often challenging because of the significant search area, low light conditions, and complex search fields. Therefore, we need a tool capable of detecting humans to assist in finding people. Therefore, to detect human objects, the authors try to research human object detection using a simple device for the human object detection system. The authors use the MobilenetV2-SSD, where this algorithm has high detection and accuracy. Using the mobilenetV2-SSD simulation method for human object recognition, a detection rate of 100% is obtained with an FPS value of 5. This is an open access article under the CC BY-SA license


INTRODUCTION
In this digital era, technology is increasingly sophisticated and diverse.Technology can make human work more accessible.The development of object detection technology is currently experiencing a significant increase.Object detection is a computer vision technique for finding instances of objects in images or videos.Object detection algorithms usually use machine learning or deep learning to produce meaningful results.The goal of object detection is to replicate the intelligence that humans have in seeing objects using a computer.Object detection works because object detection locates an object's presence in the image and draws a bounding box around that object.Image classification and object detection scenarios look similar.In general, classification is classifying images into specific categories.Object detection technology has been used in various fields to improve the provision of services in multiple places.Object detection is one of the fundamental problems of computer vision.Some of these include object detection applications such as pedestrian detection, people counter, face detection, writing detection, pose detection, or license plate recognition [1,2,3,4].
A digital image is a matrix in which the row and column indices represent a point in the picture, and the matrix elements (referred to as image elements or pixels) represent the gray level at that point.For a digital image, each pixel has an integer value, namely the gray level, which indicates the amplitude or intensity of the pixel.The idea is a two-dimensional function in which the two variables, namely the amplitude value and the coordinates, are integer values [5,6,7].This project will design and implement a human detection system.The system will be created to capture image irritants, find people's positions and the number of people in a room.There are various object detection algorithms, such as YOLO, R-CNN, Mask R-CNN, MobileNet, and SqueezeDet [8,9,10,11,12].This project will use the SSD (Single Shot Detector) method.SSD is a popular detector that can predict several classes.This method uses a single deep neural network to detect objects in an image by discriminating the output space of the bounding box into a series of standard frames with different aspect ratios and scales for each location on the feature map [13].
The object detector generates a score for the presence of each object category in each standard field and adjusts the area to match the object's shape.The network also combines predictions from multiple feature maps at different resolutions to process things of various sizes.SSD detectors are easy to train and can be integrated with software systems that require an object detection component.SSD is much more accurate than other one-step methods, even with smaller input image sizes [14][15].This research will create a device that is used to detect objects.The method used is SSD (Single Shot Detector), which can run a convolution network on the input image only once and calculate the feature map.SSD is the better choice as we can run it on video, and the trade-off fidelity is very simple.This research requires a camera, an ARM device (NanoPi M4V2), and a screen (monitor) that can display object detection results, accuracy values, and fps (frames per second) values from the camera.

METHOD SSD (Single Shot Detector)
The SSD approach is based on a feed-forward convolution network that generates a fixedsized bounding box set and scores for the existence of instances of an object class in those boxes, followed by a non-maximum suppression step to produce the final detection [16].The initial network layer is based on the standard architecture used for high-quality image classification (cut before any classification layer), which we will refer to as the base network.Then add additional structures to the network to produce detection in Figure 1 with the following main features: • Detector -The network is a detector that also classifies detected objects.
• Single-shot detectors are faster and more accurate.
• SSD predicts category scores and box offsets for a fixed number of default bounding boxes using convolution filters applied to the feature map.• • We generate predictions of different scales from feature maps of various dimensions to achieve high accuracy and then separate the predictions by aspect ratio.• These features result in high accuracy, even in low-resolution input images.

MobileNetV2
MobileNet is a convolutional neural network (CNN) architecture that can address excessive computational resource requirements.Researchers at Google have created a CNN architecture that can be used on mobile phones [17]-mobile net as a network model.Mobilenet will try to extract features that will be classified later.
MobileNetV2 uses depth wise and pointwise convolution.MobileNetV2 adds two new features, namely: In Figure 2, the bottleneck section, there are inputs and outputs between models, and the inner layer or layers encapsulate the model's ability to change information from lower-level concepts (pixels) to higher-level descriptive.

Open CV
Signal processing with input in the form of an image and transformed into another image as output with specific techniques.Digital image processing is carried out to correct image signal data errors that occur due to transmission and during signal acquisition, as well as to improve the quality of image appearance so that it is easier for the human visual system to interpret by manipulating and analyzing images.Operations performed to transform an image into another embodiment can be grouped based on the purpose of the transformation or the scope of operations served on the image [18].

Image Processing
A form of processing or signal processing with input in the form of an image (image) and transformed into another image as output with a specific technique.Digital image processing is carried out to correct image signal data errors that occur due to transmission and during signal acquisition, as well as to improve the quality of image appearance so that it is easier for the human visual system to interpret by manipulating and analyzing images.Operations performed to transform an image into another image can be grouped based on the purpose of the transformation or the scope of operations served on the image [19] [20].

Hardware Design and Analyst
Firstly, the program must import the library, then load the necessary dataset.We can identify whether the code is valid by running it in Python.Then we put the loop "While true" to avoid exiting by itself.Inside the loop, we put the core functionality.It is as follows: (1) capture video by frame, (2) detect multi-scale, (3) draw a rectangle around a human, (4) display the resulting frame.
The device is connected to the network, USB Camera (C922), and SSD through a GPIO pin.Then for, the native display, it relates to the LG monitor through HDMI port.The program design is as follows in Figure 3 and Figure 4.
Input: in this section, there is an input device in the form of a camera that takes pictures in real-time.Aside from that, a video is also entered into the process section to be analyzed.Lastly, in the form of a datasheet, the data sheet section contains the calculation results from the object detection being investigated and used as the Real score of the video, which will later be compared with the reading results from the process section.
Process: in this section, use a laptop containing software to process Python, OpenCV, and YoloV4-Tiny programs.The program analyzes input in video or real-time from the camera.Output: In this section, a monitor displays the object detection results with an algorithm embedded in the laptop to process object detection.Initial setting handler.This function is necessary for creating the initial setting or applying the saved set.This function will also handle system calls for the framework's initial start.Optionally, this will also contain the necessary process for analysis as a time recorder: processor core and hardware identification.The framework needs to identify the specification of the processor.Then this function will create the hardware configuration file the framework will utilize.
The process will also prepare the other entity's necessary system call and hardware utilization function-image retrieval by a camera.Image data will be retrieved by this function inside the framework.This function will analyze and settle the hardware communication and system call necessary for the framework.Core utilization analysis.This function will monitor the processor's utilization data and determine the available core.We can set the policy for each core's utilization.For example, the core with high utilization will not be utilized for the task.This function also provides data for the workload assigner function.Workload division mapping and assigner for each core.This function will receive data from the core utilization analysis function.This will determine which task that will be executed by which core.This function will manage the assignment on the OS layer and process utilization, so it can pinpoint to which core the given task will run.Execute image processing task.The image processing function will run at a given core.
Retrieve and arrange the raw result and data of the task.Because of the parallelization, the development of each core's process will be retrieved, and those will be combined into necessary data for further processing.Further result processing.This function will handle the result finalization and data shaping as intended.This function will check the finalized data and running time of one cycle of the framework.This function will also handle functions necessary for analysis purposes.

RESULTS AND DISCUSSION
In the process of testing, this accuracy level is used to determine the level of accuracy of the device that has been made so that it can be said that the device can be appropriately used with a perfect level of accuracy.From the existing calculations will be obtained the results: True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN).If these are obtained, the existing Recall (R) and Precision (P) can be considered to determine an algorithm's accuracy level.The definition is as follows Mean Average Precision (mAP) is used to evaluate the model.The purpose of mAP is as follows: (3) Figure 5 and Figure 6 discusses the video that was run for 6 seconds with 30 FPS (Total of 180 FPS).It can be detected a human image object using a video.This can be done using a real-time camera without reducing the accuracy level.
The MobilenetV2-SSD algorithm obtains satisfactory results because it obtains improvisation from the big core and little core so that the processing is more focused and improves the processor performance much better to beat YOLOV4.Figure 5.The precision of several algorithms Figure 6.Recall several algorithms Figure 7 explains the evaluation of the human object model, which is run with several algorithms that are used.It can be seen that MobilenetV2-SSD is still superior to the others because there is processor improvisation and different architecture algorithm, so the image process is much better and more remarkable compared to the YOLOV4 version, which is superior to MobilenetV2-SSD.
When the process image is in Figure 8 then the video will be processed by the program, and the processing results will be displayed in the form of the video's value, the accuracy level, and the detected object's name.Table 1 describes the mAP of an algorithm used and the average time needed to read images with a total of 180 frames (30 FPS) contained in a video with a duration of 6 seconds.The author will discuss the performance of RAM (Random Access Memory) and CPU (Central Processor Unit) when the digital image processing is executed in Figure 9.
Explaining the image data obtained, the CPU usage on the device is not too high or not up to 100%, and the RAM that is used is only 25% of 100% or 1,02Gb of the available 4Gb RAM aside from that, the temperature on the device also does not overheat.So overall, the program and device run well, and there are no problems with their usage.

CONCLUSION
The experiments' results show that the device and system that have been made can work well to detect human objects.The NanoPi M4V2 device has an FPS (Frame Per Second) rate reasonable for digital image processing in video format with the MobilenetV2-SSD algorithm.The accuracy level in the algorithm using the NanoPi M4V2 device has a high level of accuracy with video input with brighter lighting.When the device is run, there is no problem with the CPU usage on the NanoPi M4V2 using the MobilenetV2-SSD algorithm to detect human objects.

Figure 7 .
Figure 7. PR Curve of several Algorithms

Table 1 .
Evaluation Result