A. P. Kusumah et al., Counting Various Vehicles using YOLOv4 and DeepSORT

The Ministry of Public Works and Public Housing (PUPR) conducted a traffic survey to determine the total number of vehicles and classify them according to the Bina Marga vehicle categorisation. The survey has thus far been carried out manually. As a result, surveys take a lot of time and money to perform. Additionally, as the survey scope grows, so will the requirement for surveyors. Therefore, a substitute that can execute the survey procedure automatically and with tolerable accuracy is required. One solution is to utilise deep learning technology to detect and categorise vehicles that can be used in apps. The program is designed as a web application that provides a summary of vehicle calculations and receives video data from traffic recordings. The deep learning model used is YOLOv4 which is trained to recognise vehicle classes following Bina Marga vehicle types. The model was trained and tested using the Python programming language and the Darknet framework on the Google Colab platform. The YOLOv4 and DeepSORT method with custom dataset reached a decent accuracy of 67.94%, considering the limited 1000 images used for training the model.


INTRODUCTION
At the Ministry of Public Works and Public Housing (PUPR), a traffic survey was conducted by calculating the total number of vehicles and their classification based on the Bina Marga vehicle class. The Ministry of PUPR, in policy planning and report generation, uses results from traffic surveys. So far, the process of calculating and classifying vehicles in traffic surveys has been done manually by surveyors by observing one vehicle lane on-site or via Closed-circuit Television (CCTV) camera recordings [1] [2]. Given the time and money invested in each survey for calculation and classification and for creating a traffic survey report document, the manual traffic survey process is costly. In addition, there is a high possibility of errors in the calculation and classification of vehicles made by the surveyors because multiple vehicle classes need to be identified and counted simultaneously. Table 1 shows the vehicle classes, [3], that are used as detection classes in this study [4,5,6].
This study aims to propose an alternative method of traffic survey that could potentially reduce the dependency on human surveyors using deep learning concepts such as object detection, classification, and tracking. A website-based application that performs object identification, classification, and calculation was proposed to help analyse the YOLOv4 and DeepSORT [7][8] method results with the dataset from CCTV placed on the highway. YOLOv4 is used despite the existence of YOLOv5 because of better performance characteristics and relatively lower training time. Whereas DeepSORT is chosen because it has an average Multiple Object Tracking Accuracy (MOTA) of 68.7% and provides a better detection and tracking speed [9,10,11,12].
The web application will be able to accept video files and upload them through an API to be processed by the server in Google Colab. The server would then start the detection, classification, tracking and counting on the video using YOLOv4. The YOLOv4 model must first be converted after training and used in the Tensorflow framework for the DeepSORT algorithm to work. A survey document with road data and vehicle calculation results will be generated using the calculation results with a specifically formatted excel file [10].

METHOD
The research is conducted by building an application to count the number of vehicles by type in a determined amount of intervals. A custom-trained deep learning model is required to be able to detect custom vehicle classes to do that. Therefore, the YOLOv4 model will be trained to detect 12 vehicle classes based on Bina Marga vehicle types and then implemented in an application with object tracking methods such as the DeepSORT. The application will be able to accept video input and output the counting result as a file, which can be viewed once the object counting process is completed.
Following that, a separate evaluation will be conducted for object detection and object counting. The evaluation will be carried out for object detection using the Mean Average Precision (mAP) of each 1000th iteration while training. On the other hand, vehicle counting will be tested by comparing the manual counting result of vehicles and the results from the implemented application. The following are the steps done in this research.

Dataset Collection
It is required to gather data that will be utilised as training data to create a deep learning model that uses custom classes. The information will be presented as images of roads that will be used to analyse the input data. Images are manually captured or acquired from video recordings of vehicles cut into frames to obtain this information.
On local traffic survey websites, we obtain CCTV footage from various parts of Indonesia that can be used to create the dataset. A dataset in a jpg file is created from many frames of video streaming which is then prepared as a dataset for the YOLOv4 deep learning model.

Dataset Annotation
To prepare the dataset for YOLOv4, jpg images are manually searched for vehicles in the frame and labeled for each vehicle found using the labelImg as shown in Figure 1. Annotation tool, which already has a setting for YOLOv4 labels and bounding boxes. An example of an annotated dataset image is shown in Figure 1. The datasets that have been gathered will be separated into three categories: training, validation, and testing datasets. In general, deep learning uses a train-test split on its datasets, according to Dobbin and Simon [13], to reduce the issue of overfitting and to generalise the learning algorithm well to cases that are not observable in future environments. The train-test split method ratio that is typically applied is 90-10. However, the dataset is split into an 80-10-10 ratio because it is divided into three parts (training, validation, and testing).

Model Training
Model training uses the Darknet framework and YOLOv4 pre-trained model, which can use a custom dataset and custom classes. The model is trained to properly detect every class using the annotated images in a dataset which categorises the training as supervised learning. Therefore, a set of hyperparameters is required to configure the training process, which helps the model create parameters fitting for the dataset. The hyperparameters used in this study are shown in Table 2.
Width and height are resolutions that will be used as the target image height after downsampling. Max_batches is the maximum number of iterations of model training. The value of max_batches is calculated by the formula: classes x 2000 and a minimum of 6000. Filters are the number of kernels used in each image convolution layer. The value of filters is calculated by the formula: (number of classes + 5) x 3.
And last but not least, steps are adjustments to the learning rate when the number of batches reaches each step's value. For example, the steps are 500 and 1000. Then the learning rate improvement is made when the batch reaches 500 and 1000. The number of steps is 80% and 90% of max_batches [14].
The percentage error could be used to subtract from 100 to depict the percentage of correctness to find the accuracy of the counting result. This method to count accuracy should use a video that the PUPR officially counts to produce the best results [15].

RESULTS AND DISCUSSION TRAINING MODEL
Model training is carried out using t he framework Darknet. The process will not be stopped if mAP continues rising [14]. The result can be seen in Figure 2. Whilst training, the mAP fluctuates around 50 -70% but stagnates around 60%. Therefore, the training process stopped before it decreased further. As a result, the 4000 th iteration is used for the detection model for having the highest mAP (66.17%).

Determining Confidence Threshold
The confidence threshold is a parameter for determining how confident deep learning is in detecting objects. The best threshold for detection using the available model can be determined by using one of the features that calculate mAP for each threshold in the Darknet and comparing True Positive with False Positive in the result. The result is shown in Table 3.   By dividing True Positive and False Positive, threshold 0.9 is chosen as the threshold for detection for having the highest ratio of True Positive to False Positive (7.189).

Error Estimation Result for Counting Objects
After training the model and choosing the threshold, the next step is calculating actual results for detection and counting after converting the YOLO weights from the Darknet framework to be used in the TensorFlow framework. A one-hour-long video taken from Cipali Highway between 17:00 and 18:00 was used for testing; it features both daytime and nighttime. Table 4 displays accuracy results after calculating this model counting and classifying results with manual counting and vehicles from the video. It is also shown as accuracy with the formula: 100% subtracted by estimation error [15].

CONCLUSION
This study proposed a non-real-time vehicle detection system using model YOLOv4 and TensorFlow. YOLOv4 is used to localise and classify vehicles, and the Deep SORT algorithm is used for tracking objects. The dataset for training and testing is taken from local CCTV. Model training results indicate that the model reached mAP of 66.17% using the testing dataset, and counting results indicate that the model reached an average of 67.94%. Given the experiment's results, it is thought that this method can be applied to classify and count vehicles within a limited scope. However, in this study, there are a few limitations. Since this model is based on Indonesia's vehicle dataset, there is a significant chance that it will be inaccurate when counting and classifying vehicles on a highway in another nation.
Additionally, testing at night yields significantly lower accuracy due to the headlight glare that reflects on the camera. Improvement in the range of the dataset, further study will improve the range and number of images for the dataset and eliminate problems such as weather, lighting, and object blur that may affect the detection accuracy. A faster network connection and more computing capabilities could also improve uploading time and detection speed respectively for the application to perform better.
Lieharyani from Bandung State Polytechnic who provided insight and expertise that greatly assisted in verifying the research, although they may not agree with all of the interpretations/conclusions of this paper.