Repository: https://github.com/x0prc/UBMAD

User Behavior Modeling and Anomaly Detection

A comprehensive implementation for insider threat detection based on user behavior modeling and anomaly detection algorithms.


Overview

UBMAD implements insider threat detection using machine learning and statistical methods. The system models user behavior patterns and identifies anomalies that may indicate malicious insider activities.

Problem Statement

Traditional insider threat detection faces challenges:

  • Class Imbalance: Only a few abnormal examples exist among thousands of normal activities
  • Rule-based Limitations: Expert-defined rules lack flexibility
  • Detection Difficulty: Insiders are familiar with the system, making detection challenging

Solution Approach

The framework uses one-class classification and user behavior modeling to detect anomalies without requiring extensive labeled malicious examples.


Features

1. Data Preprocessing

  • Standardization of user behavior data
  • Feature extraction from raw logs
  • Email content preprocessing

2. User Behavior Modeling

  • K-means clustering for behavior grouping
  • Silhouette score evaluation
  • Weekly activity summaries
  • Email topic distribution

3. Anomaly Detection Algorithms

  • Isolation Forest: Tree-based anomaly detection
  • Local Outlier Factor (LOF): Density-based detection
  • One-Class SVM: Boundary-based detection
  • Gaussian Mixture Models: Density estimation
  • Kernel Density Estimation: Non-parametric density estimation
  • PCA-based Detection: Dimensionality reduction approach

4. Topic Modeling

  • Latent Dirichlet Allocation (LDA) for email content analysis
  • Topic distribution anomaly detection

Installation

Prerequisites

# Core dependencies
pandas
numpy
scikit-learn
gensim

Install Dependencies

pip install pandas numpy scikit-learn gensim

Import Libraries

# Data handling
import pandas as pd
import numpy as np
 
# Scikit-Learn modules
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.decomposition import LatentDirichletAllocation, PCA
from sklearn.neighbors import LocalOutlierFactor, KernelDensity
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.svm import OneClassSVM
from sklearn.metrics import silhouette_score
 
# Topic Modeling
import gensim

Dataset

CERT Insider Threat Dataset

The project uses the CERT (Computer Emergency Response Team) Insider Threat Tools dataset, specifically R6.2 - the latest and largest version.

Dataset Characteristics

AttributeValue
Total Users4,000
Malicious Users5
Data Typeslogon, device, http, file, email
Organizational InfoDepartment, roles

Loading the Data

# Load CERT dataset R6.2
data = pd.read_csv('https://kilthub.cmu.edu/ndownloader/files/24844280/r6.2-1.csv')
 
# Preview the data
print(data.head())
print(f"Shape: {data.shape}")
print(f"Columns: {data.columns.tolist()}")

Architecture

System Flow

┌─────────────────────────────────────────────────────────────────┐
│                        Raw Log Data                             │
│  (logon, device, http, file, email)                             │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                     Data Preprocessing                          │
│  - Extract user-day instances                                   │
│  - Feature engineering                                         │
│  - Standardization                                              │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                  User Behavior Modeling                        │
│  - K-means clustering                                           │
│  - Topic modeling (LDA)                                         │
│  - Weekly email history                                          │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                  Anomaly Detection                               │
│  - Isolation Forest                                             │
│  - Local Outlier Factor                                         │
│  - One-Class SVM                                                │
│  - Gaussian Mixture Models                                      │
│  - Kernel Density Estimation                                    │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Threat Alerts                                │
│  - Anomaly scores                                               │
│  - Investigation recommendations                                │
└─────────────────────────────────────────────────────────────────┘

Usage

Complete Workflow Example

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import LatentDirichletAllocation
 
# 1. Load Data
data = pd.read_csv('https://kilthub.cmu.edu/ndownloader/files/24844280/r6.2-1.csv')
 
# 2. Data Preprocessing
def preprocess_user_logs(user_logs, emails):
    """
    Preprocess user behavior data.
    
    Args:
        user_logs: DataFrame with user activity logs
        emails: DataFrame with email data
    
    Returns:
        scaled_data: Standardized feature matrix
    """
    df = pd.DataFrame(user_logs)
    user_behavior_data = df[["user", "email", "date", "personal_computer", "activity"]]
    
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(user_behavior_data)
    
    return scaled_data
 
# 3. User Behavior Modeling
def model_user_behavior(scaled_data, n_clusters=3):
    """
    Cluster user behaviors using K-means.
    
    Args:
        scaled_data: Standardized user behavior data
        n_clusters: Number of behavior clusters
    
    Returns:
        user_clusters: Cluster labels for each user
    """
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    kmeans.fit(scaled_data)
    user_clusters = kmeans.labels_
    
    # Evaluate clustering quality
    silhouette_avg = silhouette_score(scaled_data, user_clusters)
    print(f"Silhouette Score: {silhouette_avg:.4f}")
    
    return user_clusters
 
# 4. Anomaly Detection with Isolation Forest
def detect_anomalies(scaled_data):
    """
    Detect anomalies using Isolation Forest.
    
    Args:
        scaled_data: Standardized user behavior data
    
    Returns:
        anomaly_scores: Anomaly scores for each data point
    """
    isolation_forest = IsolationForest(random_state=42)
    isolation_forest.fit(scaled_data)
    anomaly_scores = isolation_forest.decision_function(scaled_data)
    
    return anomaly_scores
 
# 5. Topic Modeling for Email Content
def analyze_email_topics(emails, n_topics=3):
    """
    Analyze email content using LDA.
    
    Args:
        emails: Preprocessed email data
        n_topics: Number of topics to extract
    
    Returns:
        topic_distribution: Topic distribution for each email
    """
    lda_model = LatentDirichletAllocation(
        n_components=n_topics, 
        random_state=42
    )
    lda_model.fit(emails)
    topic_distribution = lda_model.transform(emails)
    
    return topic_distribution
 
# Execute the workflow
scaled_data = preprocess_user_logs(data, None)
user_clusters = model_user_behavior(scaled_data)
anomaly_scores = detect_anomalies(scaled_data)
 
# 6. Predict New Data
def predict_anomaly(model, scaler, new_data):
    """
    Predict anomalies for new user data.
    
    Args:
        model: Trained anomaly detection model
        scaler: Fitted StandardScaler
        new_data: New user behavior data
    
    Returns:
        prediction: -1 for anomaly, 1 for normal
    """
    scaled_new_data = scaler.transform(new_data)
    prediction = model.predict(scaled_new_data)
    
    if prediction[0] == -1:
        print("⚠️ Anomaly detected! Further investigation recommended.")
    else:
        print("✓ User behavior appears normal.")
    
    return prediction

API Reference

Data Preprocessing

preprocess_user_logs(user_logs, emails)

Standardizes user behavior data for modeling.

ParameterTypeDescription
user_logsDataFrameUser activity logs
emailsDataFrameEmail data (optional)

Returns: Standardized numpy array


User Behavior Modeling

model_user_behavior(scaled_data, n_clusters=3)

Clusters user behaviors using K-means.

# Example
user_clusters = model_user_behavior(scaled_data, n_clusters=3)
 
# Access cluster assignments
print(f"Cluster distribution: {np.bincount(user_clusters)}")

evaluate_clustering(scaled_data, labels)

Evaluates clustering quality using Silhouette Score.

silhouette_avg = silhouette_score(scaled_data, user_clusters)
print(f"Silhouette Score: {silhouette_avg:.4f}")

Anomaly Detection

Isolation Forest

from sklearn.ensemble import IsolationForest
 
iso_forest = IsolationForest(
    n_estimators=100,
    contamination=0.1,
    random_state=42
)
iso_forest.fit(scaled_data)
 
# Get anomaly scores (lower = more anomalous)
anomaly_scores = iso_forest.decision_function(scaled_data)
 
# Predict (-1 = anomaly, 1 = normal)
predictions = iso_forest.predict(scaled_data)

Local Outlier Factor

from sklearn.neighbors import LocalOutlierFactor
 
lof = LocalOutlierFactor(
    n_neighbors=20,
    contamination=0.1
)
lof.fit_predict(scaled_data)
 
# Get negative outlier factor
negative_outlier_factor = lof.negative_outlier_factor_

One-Class SVM

from sklearn.svm import OneClassSVM
 
ocsvm = OneClassSVM(
    kernel='rbf',
    gamma='auto',
    nu=0.1
)
ocsvm.fit(scaled_data)
 
# Predict (-1 = anomaly, 1 = normal)
predictions = ocsvm.predict(scaled_data)

Gaussian Mixture Model

from sklearn.mixture import GaussianMixture
 
gmm = GaussianMixture(
    n_components=1,
    covariance_type='full',
    random_state=42
)
gmm.fit(scaled_data)
 
# Get log-likelihood (lower = more anomalous)
densities = gmm.score_samples(scaled_data)

Kernel Density Estimation

from sklearn.neighbors import KernelDensity
 
kde = KernelDensity(
    bandwidth=1.0,
    kernel='gaussian'
)
kde.fit(scaled_data)
 
# Get density estimates
densities = np.exp(kde.score_samples(scaled_data))

PCA-based Detection

from sklearn.decomposition import PCA
 
pca = PCA(n_components=0.95)
transformed_data = pca.fit_transform(scaled_data)
 
print(f"Explained variance: {pca.explained_variance_ratio_.sum():.2%}")

Anomaly Detection Algorithms

Comparison Table

AlgorithmTypeBest ForTime Complexity
Isolation ForestTree-basedLarge datasetsO(n log n)
Local Outlier FactorDensity-basedClustered dataO(n²)
One-Class SVMBoundaryHigh-dimensionalO(n²) to O(n³)
GMMDensityGaussian dataO(n × k × iter)
KDEDensityComplex distributionsO(n²)

Choosing the Right Algorithm

# Algorithm selection guide
def select_algorithm(data_size, data_dimensions, known_distribution=None):
    """
    Select appropriate anomaly detection algorithm.
    """
    if data_size > 100000:
        return "Isolation Forest"  # Scales well
    elif data_dimensions > 50:
        return "One-Class SVM"  # Works in high dimensions
    elif known_distribution == "gaussian":
        return "Gaussian Mixture Model"  # Assumes Gaussian
    else:
        return "Kernel Density Estimation"  # Non-parametric

Example: Complete Insider Threat Detection Pipeline

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest
from sklearn.cluster import KMeans
 
class InsiderThreatDetector:
    """
    Complete insider threat detection system.
    """
    
    def __init__(self, n_clusters=3, contamination=0.1):
        self.scaler = StandardScaler()
        self.kmeans = KMeans(n_clusters=n_clusters, random_state=42)
        self.iso_forest = IsolationForest(
            contamination=contamination, 
            random_state=42
        )
        self.is_fitted = False
    
    def fit(self, user_logs):
        """Train the detection model."""
        # Preprocess
        scaled_data = self.scaler.fit_transform(user_logs)
        
        # Model user behavior
        self.kmeans.fit(scaled_data)
        
        # Train anomaly detector
        self.iso_forest.fit(scaled_data)
        self.is_fitted = True
        
        return self
    
    def predict(self, user_logs):
        """Predict insider threats."""
        if not self.is_fitted:
            raise ValueError("Model not fitted. Call fit() first.")
        
        scaled_data = self.scaler.transform(user_logs)
        predictions = self.iso_forest.predict(scaled_data)
        anomaly_scores = self.iso_forest.decision_function(scaled_data)
        
        return predictions, anomaly_scores
    
    def get_threat_level(self, anomaly_score):
        """Convert anomaly score to threat level."""
        if anomaly_score < -0.5:
            return "HIGH"
        elif anomaly_score < 0:
            return "MEDIUM"
        else:
            return "LOW"
 
# Usage
detector = InsiderThreatDetector(n_clusters=3, contamination=0.05)
detector.fit(user_logs)
 
predictions, scores = detector.predict(new_user_logs)
 
for i, (pred, score) in enumerate(zip(predictions, scores)):
    threat_level = detector.get_threat_level(score)
    status = "THREAT" if pred == -1 else "NORMAL"
    print(f"User {i}: {status} (Score: {score:.4f}, Level: {threat_level})")

Performance Metrics

Evaluating Detection Performance

from sklearn.metrics import (
    precision_score, 
    recall_score, 
    f1_score, 
    confusion_matrix,
    roc_auc_score
)
 
# Assuming you have ground truth labels
y_true = [0, 1, 0, 0, 1, 0, 0, 1]  # 1 = insider threat
y_pred = [0, 1, 0, 1, 1, 0, 0, 0]  # Model predictions
 
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
 
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
 
# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print(f"Confusion Matrix:\n{cm}")

References

  1. Kim, J., Park, M., Kim, H., Cho, S., & Kang, P. (2019). Insider Threat Detection Based on User Behavior Modeling and Anomaly Detection Algorithms. Applied Sciences, 9(19), 4018. https://doi.org/10.3390/app9194018

  2. CERT Insider Threat Tools Dataset. Carnegie Mellon University. https://kilthub.cmu.edu/collections/Insider_Threat_Test_Dataset