CYBER CRIME AND INFORMATION ABOUT CONFUSION MATRIX AND IT’S TWO TYPES OF ERRORS

Malay Kaushik
7 min readJun 6, 2021

Hello everyone I am back with my another blog based on my task 5 to Create a blog about cyber crime cases where they talk about confusion matrix or it’s two types of errors..

DEFINING CYBER CRIME

New technologies create new criminal opportunities but few new types of crime. What distinguishes cybercrime from traditional criminal activity? Obviously, one difference is the use of the digital computer, but technology alone is insufficient for any distinction that might exist between different realms of criminal activity. Criminals do not need a computer to commit fraud, traffic in child pornography and intellectual property, steal an identity, or violate someone’s privacy. All those activities existed before the “cyber” prefix became ubiquitous. Cybercrime, especially involving the Internet, represents an extension of existing criminal behaviour alongside some novel illegal activities.

Most cybercrime is an attack on information about individuals, corporations, or governments. Although the attacks do not take place on a physical body, they do take place on the personal or corporate virtual body, which is the set of informational attributes that define people and institutions on the Internet. In other words, in the digital age our virtual identities are essential elements of everyday life: we are a bundle of numbers and identifiers in multiple computer databases owned by governments and corporations. Cybercrime highlights the centrality of networked computers in our lives, as well as the fragility of such seemingly solid facts as individual identity.

Thus, detecting various cyber-attacks in a network is very necessary. The application of Machine Learning model in building an effective Intrusion Detection System (IDS) comes into play. A binary classification model can be used to identify what is happening in the network i.e., if there is any attack or not.

Understanding the raw security data is the first step to build an intelligent security model for making predictions about future incidents. The two categories being — normal and anomaly. Take into account the selected security features and performing all preprocessing steps, train the model that can be used to detect whether the test case is normal or an anomaly. For evaluation of model, one of the metric used is Confusion matrix.

CONFUSION MATRIX

So now let’s talk about Confusion Matrix and it’s types

In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix,[9] is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class, or vice versa — both variants are found in the literature.[10] The name stems from the fact that it makes it easy to see whether the system is confusing two classes (i.e. commonly mislabeling one as another).

It is a special kind of contingency table, with two dimensions (“actual” and “predicted”), and identical sets of “classes” in both dimensions (each combination of dimension and class is a variable in the contingency table).

Understanding Confusion Matrix:

The following 4 are the basic terminology which will help us in determining the metrics we are looking for.

  • True Positives (TP): when the actual value is Positive and predicted is also Positive.
  • True negatives (TN): when the actual value is Negative and prediction is also Negative.
  • False positives (FP): When the actual is negative but prediction is Positive. Also known as the Type 1 error
  • False negatives (FN): When the actual is Positive but the prediction is Negative. Also known as the Type 2 error

For a binary classification problem, we would have a 2 x 2 matrix as shown below with 4 values:

Confusion Matrix for the Binary Classification

  • The target variable has two values: Positive or Negative
  • The columns represent the actual values of the target variable
  • The rows represent the predicted values of the target variable.

Type 1 Error

In this type of error our system is not able to predict the attack so what mainly happens is our system shows that there is no attack happening but actually attack is happening so in this type of attacks we cannot do anything as there is no error notification by our system . So, therefore this type of error can prove to be very dangerous for us….

Type 2 error

This type of error are not very dangerous as our system is protected in reality but model predicted an attack. So the team would get notified and check for any malicious activity and mostly there is no attack happening. This doesn’t cause any harm. They can be termed as False Alarm.

Classification measure is an extended version of the confusion matrix. There are measures other than the confusion matrix which can help achieve better understanding and analysis of our model and its performance.

a. Accuracy

b. Precision

c. Recall (TPR, Sensitivity)

d. F1-Score

e. FPR (Type I Error)

f. FNR (Type II Error)

a. Accuracy:

Accuracy simply measures how often the classifier makes the correct prediction. It’s the ratio between the number of correct predictions and the total number of predictions. The accuracy metric is not suited for unbalanced classes.

Accuracy has its own disadvantages, for imbalanced data, when the model predicts that each point belongs to the majority class label, the accuracy will be high. But, the model is not accurate.

b. Precision:

Precision is defined as the ratio of the total number of correctly classified positive classes divided by the total number of predicted positive classes. Or, out of all the predictive positive classes, how much we predicted correctly. Precision should be high(ideally 1).

Precision is a useful metric in cases where False Positive is a higher concern than False Negatives

Ex 1:- In Spam Detection : Need to focus on precision

Suppose mail is not a spam but model is predicted as spam : FP (False Positive). We always try to reduce FP.

Ex 2:- Precision is important in music or video recommendation systems, e-commerce websites, etc. Wrong results could lead to customer churn and be harmful to the business.

c. Recall:

Recall is defined as the ratio of the total number of correctly classified positive classes divide by the total number of positive classes. Or, out of all the positive classes, how much we have predicted correctly. Recall should be high(ideally 1).

Recall is a useful metric in cases where False Negative trumps False Positive

Ex 1:- suppose person having cancer (or) not? He is suffering from cancer but model predicted as not suffering from cancer

Ex 2:- Recall is important in medical cases where it doesn’t matter whether we raise a false alarm but the actual positive cases should not go undetected!

Recall would be a better metric because we don’t want to accidentally discharge an infected person and let them mix with the healthy population thereby spreading contagious virus. Now you can understand why accuracy was a bad metric for our model.

Trick to remember : Precision has Predictive Results in the denominator.

4. F-measure / F1-Score

There will be cases where there is no clear distinction between whether Precision is more important or Recall. We combine them!

In practice, when we try to increase the precision of our model, the recall goes down and vice-versa. The F1-score captures both the trends in a single value.

F1 score is a harmonic mean of Precision and Recall. As compared to Arithmetic Mean, Harmonic Mean punishes the extreme values more. F-score should be high(ideally 1).

5. Sensitivity & Specificity

3. Is it necessary to check for recall (or) precision if you already have a high accuracy?

We can not rely on a single value of accuracy in classification when the classes are imbalanced. For example, we have a dataset of 100 patients in which 5 have diabetes and 95 are healthy. However, if our model only predicts the majority class i.e. all 100 people are healthy even though we have a classification accuracy of 95%.

4. When to use Accuracy / Precision / Recall / F1-Score?

a. Accuracy is used when the True Positives and True Negatives are more important. Accuracy is a better metric for Balanced Data.

b. Whenever False Positive is much more important use Precision.

c. Whenever False Negative is much more important use Recall.

d. F1-Score is used when the False Negatives and False Positives are important. F1-Score is a better metric for Imbalanced Data.

Thank you for Reading

I have taken reference from many different blogs and various other information provided on internet I cannot mention all but thank you everyone..

SUCCESSFULLY COMPLETED MY TASK-5 OF SUMMER INTERNSHIP.

--

--