Repository: https://github.com/x0prc/UBMAD

User Behavior Modeling and Anomaly Detection

A comprehensive implementation for insider threat detection based on user behavior modeling and anomaly detection algorithms.

Overview

UBMAD implements insider threat detection using machine learning and statistical methods. The system models user behavior patterns and identifies anomalies that may indicate malicious insider activities.

Problem Statement

Traditional insider threat detection faces challenges:

Class Imbalance: Only a few abnormal examples exist among thousands of normal activities
Rule-based Limitations: Expert-defined rules lack flexibility
Detection Difficulty: Insiders are familiar with the system, making detection challenging

Solution Approach

The framework uses one-class classification and user behavior modeling to detect anomalies without requiring extensive labeled malicious examples.

Features

1. Data Preprocessing

Standardization of user behavior data
Feature extraction from raw logs
Email content preprocessing

2. User Behavior Modeling

K-means clustering for behavior grouping
Silhouette score evaluation
Weekly activity summaries
Email topic distribution

3. Anomaly Detection Algorithms

Isolation Forest: Tree-based anomaly detection
Local Outlier Factor (LOF): Density-based detection
One-Class SVM: Boundary-based detection
Gaussian Mixture Models: Density estimation
Kernel Density Estimation: Non-parametric density estimation
PCA-based Detection: Dimensionality reduction approach

4. Topic Modeling

Latent Dirichlet Allocation (LDA) for email content analysis
Topic distribution anomaly detection

Installation

Prerequisites

# Core dependencies
pandas
numpy
scikit-learn
gensim

Install Dependencies

pip install pandas numpy scikit-learn gensim

Import Libraries

# Data handling
import pandas as pd
import numpy as np
 
# Scikit-Learn modules
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.decomposition import LatentDirichletAllocation, PCA
from sklearn.neighbors import LocalOutlierFactor, KernelDensity
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.svm import OneClassSVM
from sklearn.metrics import silhouette_score
 
# Topic Modeling
import gensim

Dataset

CERT Insider Threat Dataset

The project uses the CERT (Computer Emergency Response Team) Insider Threat Tools dataset, specifically R6.2 - the latest and largest version.

Dataset Characteristics

Attribute	Value
Total Users	4,000
Malicious Users	5
Data Types	logon, device, http, file, email
Organizational Info	Department, roles

Loading the Data

# Load CERT dataset R6.2
data = pd.read_csv('https://kilthub.cmu.edu/ndownloader/files/24844280/r6.2-1.csv')
 
# Preview the data
print(data.head())
print(f"Shape: {data.shape}")
print(f"Columns: {data.columns.tolist()}")

Architecture

System Flow

┌─────────────────────────────────────────────────────────────────┐
│                        Raw Log Data                             │
│  (logon, device, http, file, email)                             │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                     Data Preprocessing                          │
│  - Extract user-day instances                                   │
│  - Feature engineering                                         │
│  - Standardization                                              │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                  User Behavior Modeling                        │
│  - K-means clustering                                           │
│  - Topic modeling (LDA)                                         │
│  - Weekly email history                                          │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                  Anomaly Detection                               │
│  - Isolation Forest                                             │
│  - Local Outlier Factor                                         │
│  - One-Class SVM                                                │
│  - Gaussian Mixture Models                                      │
│  - Kernel Density Estimation                                    │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Threat Alerts                                │
│  - Anomaly scores                                               │
│  - Investigation recommendations                                │
└─────────────────────────────────────────────────────────────────┘

Usage

Complete Workflow Example

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import LatentDirichletAllocation
 
# 1. Load Data
data = pd.read_csv('https://kilthub.cmu.edu/ndownloader/files/24844280/r6.2-1.csv')
 
# 2. Data Preprocessing
def preprocess_user_logs(user_logs, emails):
    """
    Preprocess user behavior data.
    
    Args:
        user_logs: DataFrame with user activity logs
        emails: DataFrame with email data
    
    Returns:
        scaled_data: Standardized feature matrix
    """
    df = pd.DataFrame(user_logs)
    user_behavior_data = df[["user", "email", "date", "personal_computer", "activity"]]
    
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(user_behavior_data)
    
    return scaled_data
 
# 3. User Behavior Modeling
def model_user_behavior(scaled_data, n_clusters=3):
    """
    Cluster user behaviors using K-means.
    
    Args:
        scaled_data: Standardized user behavior data
        n_clusters: Number of behavior clusters
    
    Returns:
        user_clusters: Cluster labels for each user
    """
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    kmeans.fit(scaled_data)
    user_clusters = kmeans.labels_
    
    # Evaluate clustering quality
    silhouette_avg = silhouette_score(scaled_data, user_clusters)
    print(f"Silhouette Score: {silhouette_avg:.4f}")
    
    return user_clusters
 
# 4. Anomaly Detection with Isolation Forest
def detect_anomalies(scaled_data):
    """
    Detect anomalies using Isolation Forest.
    
    Args:
        scaled_data: Standardized user behavior data
    
    Returns:
        anomaly_scores: Anomaly scores for each data point
    """
    isolation_forest = IsolationForest(random_state=42)
    isolation_forest.fit(scaled_data)
    anomaly_scores = isolation_forest.decision_function(scaled_data)
    
    return anomaly_scores
 
# 5. Topic Modeling for Email Content
def analyze_email_topics(emails, n_topics=3):
    """
    Analyze email content using LDA.
    
    Args:
        emails: Preprocessed email data
        n_topics: Number of topics to extract
    
    Returns:
        topic_distribution: Topic distribution for each email
    """
    lda_model = LatentDirichletAllocation(
        n_components=n_topics, 
        random_state=42
    )
    lda_model.fit(emails)
    topic_distribution = lda_model.transform(emails)
    
    return topic_distribution
 
# Execute the workflow
scaled_data = preprocess_user_logs(data, None)
user_clusters = model_user_behavior(scaled_data)
anomaly_scores = detect_anomalies(scaled_data)
 
# 6. Predict New Data
def predict_anomaly(model, scaler, new_data):
    """
    Predict anomalies for new user data.
    
    Args:
        model: Trained anomaly detection model
        scaler: Fitted StandardScaler
        new_data: New user behavior data
    
    Returns:
        prediction: -1 for anomaly, 1 for normal
    """
    scaled_new_data = scaler.transform(new_data)
    prediction = model.predict(scaled_new_data)
    
    if prediction[0] == -1:
        print("⚠️ Anomaly detected! Further investigation recommended.")
    else:
        print("✓ User behavior appears normal.")
    
    return prediction

API Reference

Data Preprocessing

`preprocess_user_logs(user_logs, emails)`

Standardizes user behavior data for modeling.

Parameter	Type	Description
user_logs	DataFrame	User activity logs
emails	DataFrame	Email data (optional)

Returns: Standardized numpy array