Repository: https://github.com/x0prc/UBMAD
User Behavior Modeling and Anomaly Detection
A comprehensive implementation for insider threat detection based on user behavior modeling and anomaly detection algorithms.
Overview
UBMAD implements insider threat detection using machine learning and statistical methods. The system models user behavior patterns and identifies anomalies that may indicate malicious insider activities.
Problem Statement
Traditional insider threat detection faces challenges:
- Class Imbalance: Only a few abnormal examples exist among thousands of normal activities
- Rule-based Limitations: Expert-defined rules lack flexibility
- Detection Difficulty: Insiders are familiar with the system, making detection challenging
Solution Approach
The framework uses one-class classification and user behavior modeling to detect anomalies without requiring extensive labeled malicious examples.
Features
1. Data Preprocessing
- Standardization of user behavior data
- Feature extraction from raw logs
- Email content preprocessing
2. User Behavior Modeling
- K-means clustering for behavior grouping
- Silhouette score evaluation
- Weekly activity summaries
- Email topic distribution
3. Anomaly Detection Algorithms
- Isolation Forest: Tree-based anomaly detection
- Local Outlier Factor (LOF): Density-based detection
- One-Class SVM: Boundary-based detection
- Gaussian Mixture Models: Density estimation
- Kernel Density Estimation: Non-parametric density estimation
- PCA-based Detection: Dimensionality reduction approach
4. Topic Modeling
- Latent Dirichlet Allocation (LDA) for email content analysis
- Topic distribution anomaly detection
Installation
Prerequisites
# Core dependencies
pandas
numpy
scikit-learn
gensimInstall Dependencies
pip install pandas numpy scikit-learn gensimImport Libraries
# Data handling
import pandas as pd
import numpy as np
# Scikit-Learn modules
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.decomposition import LatentDirichletAllocation, PCA
from sklearn.neighbors import LocalOutlierFactor, KernelDensity
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.svm import OneClassSVM
from sklearn.metrics import silhouette_score
# Topic Modeling
import gensimDataset
CERT Insider Threat Dataset
The project uses the CERT (Computer Emergency Response Team) Insider Threat Tools dataset, specifically R6.2 - the latest and largest version.
Dataset Characteristics
| Attribute | Value |
|---|---|
| Total Users | 4,000 |
| Malicious Users | 5 |
| Data Types | logon, device, http, file, email |
| Organizational Info | Department, roles |
Loading the Data
# Load CERT dataset R6.2
data = pd.read_csv('https://kilthub.cmu.edu/ndownloader/files/24844280/r6.2-1.csv')
# Preview the data
print(data.head())
print(f"Shape: {data.shape}")
print(f"Columns: {data.columns.tolist()}")Architecture
System Flow
┌─────────────────────────────────────────────────────────────────┐
│ Raw Log Data │
│ (logon, device, http, file, email) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Data Preprocessing │
│ - Extract user-day instances │
│ - Feature engineering │
│ - Standardization │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ User Behavior Modeling │
│ - K-means clustering │
│ - Topic modeling (LDA) │
│ - Weekly email history │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Anomaly Detection │
│ - Isolation Forest │
│ - Local Outlier Factor │
│ - One-Class SVM │
│ - Gaussian Mixture Models │
│ - Kernel Density Estimation │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Threat Alerts │
│ - Anomaly scores │
│ - Investigation recommendations │
└─────────────────────────────────────────────────────────────────┘
Usage
Complete Workflow Example
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import LatentDirichletAllocation
# 1. Load Data
data = pd.read_csv('https://kilthub.cmu.edu/ndownloader/files/24844280/r6.2-1.csv')
# 2. Data Preprocessing
def preprocess_user_logs(user_logs, emails):
"""
Preprocess user behavior data.
Args:
user_logs: DataFrame with user activity logs
emails: DataFrame with email data
Returns:
scaled_data: Standardized feature matrix
"""
df = pd.DataFrame(user_logs)
user_behavior_data = df[["user", "email", "date", "personal_computer", "activity"]]
scaler = StandardScaler()
scaled_data = scaler.fit_transform(user_behavior_data)
return scaled_data
# 3. User Behavior Modeling
def model_user_behavior(scaled_data, n_clusters=3):
"""
Cluster user behaviors using K-means.
Args:
scaled_data: Standardized user behavior data
n_clusters: Number of behavior clusters
Returns:
user_clusters: Cluster labels for each user
"""
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
kmeans.fit(scaled_data)
user_clusters = kmeans.labels_
# Evaluate clustering quality
silhouette_avg = silhouette_score(scaled_data, user_clusters)
print(f"Silhouette Score: {silhouette_avg:.4f}")
return user_clusters
# 4. Anomaly Detection with Isolation Forest
def detect_anomalies(scaled_data):
"""
Detect anomalies using Isolation Forest.
Args:
scaled_data: Standardized user behavior data
Returns:
anomaly_scores: Anomaly scores for each data point
"""
isolation_forest = IsolationForest(random_state=42)
isolation_forest.fit(scaled_data)
anomaly_scores = isolation_forest.decision_function(scaled_data)
return anomaly_scores
# 5. Topic Modeling for Email Content
def analyze_email_topics(emails, n_topics=3):
"""
Analyze email content using LDA.
Args:
emails: Preprocessed email data
n_topics: Number of topics to extract
Returns:
topic_distribution: Topic distribution for each email
"""
lda_model = LatentDirichletAllocation(
n_components=n_topics,
random_state=42
)
lda_model.fit(emails)
topic_distribution = lda_model.transform(emails)
return topic_distribution
# Execute the workflow
scaled_data = preprocess_user_logs(data, None)
user_clusters = model_user_behavior(scaled_data)
anomaly_scores = detect_anomalies(scaled_data)
# 6. Predict New Data
def predict_anomaly(model, scaler, new_data):
"""
Predict anomalies for new user data.
Args:
model: Trained anomaly detection model
scaler: Fitted StandardScaler
new_data: New user behavior data
Returns:
prediction: -1 for anomaly, 1 for normal
"""
scaled_new_data = scaler.transform(new_data)
prediction = model.predict(scaled_new_data)
if prediction[0] == -1:
print("⚠️ Anomaly detected! Further investigation recommended.")
else:
print("✓ User behavior appears normal.")
return predictionAPI Reference
Data Preprocessing
preprocess_user_logs(user_logs, emails)
Standardizes user behavior data for modeling.
| Parameter | Type | Description |
|---|---|---|
| user_logs | DataFrame | User activity logs |
| emails | DataFrame | Email data (optional) |
Returns: Standardized numpy array
User Behavior Modeling
model_user_behavior(scaled_data, n_clusters=3)
Clusters user behaviors using K-means.
# Example
user_clusters = model_user_behavior(scaled_data, n_clusters=3)
# Access cluster assignments
print(f"Cluster distribution: {np.bincount(user_clusters)}")evaluate_clustering(scaled_data, labels)
Evaluates clustering quality using Silhouette Score.
silhouette_avg = silhouette_score(scaled_data, user_clusters)
print(f"Silhouette Score: {silhouette_avg:.4f}")Anomaly Detection
Isolation Forest
from sklearn.ensemble import IsolationForest
iso_forest = IsolationForest(
n_estimators=100,
contamination=0.1,
random_state=42
)
iso_forest.fit(scaled_data)
# Get anomaly scores (lower = more anomalous)
anomaly_scores = iso_forest.decision_function(scaled_data)
# Predict (-1 = anomaly, 1 = normal)
predictions = iso_forest.predict(scaled_data)Local Outlier Factor
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(
n_neighbors=20,
contamination=0.1
)
lof.fit_predict(scaled_data)
# Get negative outlier factor
negative_outlier_factor = lof.negative_outlier_factor_One-Class SVM
from sklearn.svm import OneClassSVM
ocsvm = OneClassSVM(
kernel='rbf',
gamma='auto',
nu=0.1
)
ocsvm.fit(scaled_data)
# Predict (-1 = anomaly, 1 = normal)
predictions = ocsvm.predict(scaled_data)Gaussian Mixture Model
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(
n_components=1,
covariance_type='full',
random_state=42
)
gmm.fit(scaled_data)
# Get log-likelihood (lower = more anomalous)
densities = gmm.score_samples(scaled_data)Kernel Density Estimation
from sklearn.neighbors import KernelDensity
kde = KernelDensity(
bandwidth=1.0,
kernel='gaussian'
)
kde.fit(scaled_data)
# Get density estimates
densities = np.exp(kde.score_samples(scaled_data))PCA-based Detection
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)
transformed_data = pca.fit_transform(scaled_data)
print(f"Explained variance: {pca.explained_variance_ratio_.sum():.2%}")Anomaly Detection Algorithms
Comparison Table
| Algorithm | Type | Best For | Time Complexity |
|---|---|---|---|
| Isolation Forest | Tree-based | Large datasets | O(n log n) |
| Local Outlier Factor | Density-based | Clustered data | O(n²) |
| One-Class SVM | Boundary | High-dimensional | O(n²) to O(n³) |
| GMM | Density | Gaussian data | O(n × k × iter) |
| KDE | Density | Complex distributions | O(n²) |
Choosing the Right Algorithm
# Algorithm selection guide
def select_algorithm(data_size, data_dimensions, known_distribution=None):
"""
Select appropriate anomaly detection algorithm.
"""
if data_size > 100000:
return "Isolation Forest" # Scales well
elif data_dimensions > 50:
return "One-Class SVM" # Works in high dimensions
elif known_distribution == "gaussian":
return "Gaussian Mixture Model" # Assumes Gaussian
else:
return "Kernel Density Estimation" # Non-parametricExample: Complete Insider Threat Detection Pipeline
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest
from sklearn.cluster import KMeans
class InsiderThreatDetector:
"""
Complete insider threat detection system.
"""
def __init__(self, n_clusters=3, contamination=0.1):
self.scaler = StandardScaler()
self.kmeans = KMeans(n_clusters=n_clusters, random_state=42)
self.iso_forest = IsolationForest(
contamination=contamination,
random_state=42
)
self.is_fitted = False
def fit(self, user_logs):
"""Train the detection model."""
# Preprocess
scaled_data = self.scaler.fit_transform(user_logs)
# Model user behavior
self.kmeans.fit(scaled_data)
# Train anomaly detector
self.iso_forest.fit(scaled_data)
self.is_fitted = True
return self
def predict(self, user_logs):
"""Predict insider threats."""
if not self.is_fitted:
raise ValueError("Model not fitted. Call fit() first.")
scaled_data = self.scaler.transform(user_logs)
predictions = self.iso_forest.predict(scaled_data)
anomaly_scores = self.iso_forest.decision_function(scaled_data)
return predictions, anomaly_scores
def get_threat_level(self, anomaly_score):
"""Convert anomaly score to threat level."""
if anomaly_score < -0.5:
return "HIGH"
elif anomaly_score < 0:
return "MEDIUM"
else:
return "LOW"
# Usage
detector = InsiderThreatDetector(n_clusters=3, contamination=0.05)
detector.fit(user_logs)
predictions, scores = detector.predict(new_user_logs)
for i, (pred, score) in enumerate(zip(predictions, scores)):
threat_level = detector.get_threat_level(score)
status = "THREAT" if pred == -1 else "NORMAL"
print(f"User {i}: {status} (Score: {score:.4f}, Level: {threat_level})")Performance Metrics
Evaluating Detection Performance
from sklearn.metrics import (
precision_score,
recall_score,
f1_score,
confusion_matrix,
roc_auc_score
)
# Assuming you have ground truth labels
y_true = [0, 1, 0, 0, 1, 0, 0, 1] # 1 = insider threat
y_pred = [0, 1, 0, 1, 1, 0, 0, 0] # Model predictions
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print(f"Confusion Matrix:\n{cm}")References
-
Kim, J., Park, M., Kim, H., Cho, S., & Kang, P. (2019). Insider Threat Detection Based on User Behavior Modeling and Anomaly Detection Algorithms. Applied Sciences, 9(19), 4018. https://doi.org/10.3390/app9194018
-
CERT Insider Threat Tools Dataset. Carnegie Mellon University. https://kilthub.cmu.edu/collections/Insider_Threat_Test_Dataset