Performance Optimization: Making File Scanning Lightning Fast
Learn advanced techniques to optimize Pompelmi's file scanning performance for high-throughput applications while maintaining security effectiveness.
Performance Optimization: Making File Scanning Lightning Fast
Security and performance don’t have to be mutually exclusive. In high-throughput applications processing thousands of file uploads per minute, every millisecond counts. This comprehensive guide explores advanced techniques to optimize Pompelmi’s file scanning performance while maintaining robust security.
Understanding the Performance Challenge
File security scanning involves multiple computationally intensive operations:
- Content analysis: Deep inspection of file structures
- Signature matching: Comparing against malware databases
- Behavioral analysis: Simulating file execution patterns
- Archive extraction: Recursive ZIP/archive processing
- YARA rule evaluation: Pattern matching across file content
Performance Metrics That Matter
interface PerformanceMetrics { throughput: number; // Files per second latency: { p50: number; // Median response time p95: number; // 95th percentile p99: number; // 99th percentile }; resourceUsage: { cpu: number; // CPU utilization % memory: number; // Memory usage MB disk: number; // Disk I/O MB/s }; errorRate: number; // Failed scans %}Baseline Performance Analysis
Before optimization, let’s establish baseline performance:
import { FileScanner, PerformanceProfiler } from 'pompelmi';
// Basic scanner configurationconst basicScanner = new FileScanner({ maxFileSize: 100 * 1024 * 1024, // 100MB enableAllEngines: true, timeout: 30000 // 30 seconds});
// Performance testingconst profiler = new PerformanceProfiler(basicScanner);
async function benchmarkBaseline() { const results = await profiler.benchmark({ testFiles: 1000, concurrency: 1, fileTypes: ['pdf', 'docx', 'jpg', 'zip'] });
console.log('Baseline Performance:', results);}
// Typical baseline results:// {// throughput: 12.5, // files/second// latency: {// p50: 80, // 80ms// p95: 450, // 450ms// p99: 1200 // 1.2s// },// resourceUsage: {// cpu: 65, // 65%// memory: 512, // 512MB// disk: 45 // 45MB/s// }// }Optimization Strategy 1: Intelligent Engine Selection
Not every file requires every scanning engine. Implement intelligent engine selection based on file characteristics:
class SmartFileScanner extends FileScanner { constructor(config) { super(config); this.engineSelector = new EngineSelector(); }
async optimizedScan(file) { // Analyze file characteristics const fileProfile = await this.analyzeFileProfile(file);
// Select optimal engines based on file type and risk const selectedEngines = this.engineSelector.selectEngines(fileProfile);
// Create optimized scan configuration const scanConfig = { engines: selectedEngines, timeout: this.calculateOptimalTimeout(fileProfile), depth: this.calculateOptimalDepth(fileProfile) };
return this.scan(file, scanConfig); }
private async analyzeFileProfile(file) { const profile = { size: file.size, type: file.type, extension: this.getExtension(file.name), entropy: await this.calculateEntropy(file), source: file.metadata?.source || 'unknown' };
// Risk scoring profile.riskScore = this.calculateRiskScore(profile);
return profile; }}
class EngineSelector { selectEngines(fileProfile) { const engines = [];
// Always include basic checks engines.push('signature', 'mime');
// Conditional engine selection based on risk and type if (fileProfile.riskScore > 0.7) { engines.push('behavioral', 'yara', 'ml'); } else if (fileProfile.riskScore > 0.4) { engines.push('yara'); }
// File type specific engines if (fileProfile.type.startsWith('image/')) { engines.push('steganography'); }
if (fileProfile.extension.match(/\.(zip|rar|7z)$/)) { engines.push('archive', 'bomb-detection'); }
if (fileProfile.type.includes('office') || fileProfile.extension.match(/\.(doc|xls|ppt)/)) { engines.push('macro-analysis', 'embedded-objects'); }
return engines; }}Performance Impact: 40-60% reduction in scan time for low-risk files
Optimization Strategy 2: Asynchronous Processing Pipeline
Implement a multi-stage asynchronous pipeline for maximum throughput:
import { Queue, Worker } from 'bullmq';import Redis from 'ioredis';
class HighThroughputScanner { constructor() { this.redis = new Redis(process.env.REDIS_URL);
// Multiple specialized queues this.queues = { preprocessing: new Queue('file-preprocessing', { connection: this.redis }), quickScan: new Queue('quick-scan', { connection: this.redis }), deepScan: new Queue('deep-scan', { connection: this.redis }), postProcessing: new Queue('post-processing', { connection: this.redis }) };
this.setupWorkers(); }
setupWorkers() { // Preprocessing workers (fast I/O operations) new Worker('file-preprocessing', async (job) => { const { fileId, filePath } = job.data;
// Quick file analysis const metadata = await this.extractMetadata(filePath); const hash = await this.calculateHash(filePath);
// Cache lookup const cachedResult = await this.checkCache(hash); if (cachedResult) { return { cached: true, result: cachedResult }; }
// Determine next stage const riskScore = this.calculateInitialRisk(metadata); const nextQueue = riskScore > 0.5 ? 'deep-scan' : 'quick-scan';
await this.queues[nextQueue].add('scan', { fileId, filePath, metadata, hash, riskScore });
return { queued: nextQueue }; }, { connection: this.redis, concurrency: 50 // High concurrency for I/O operations });
// Quick scan workers (lightweight security checks) new Worker('quick-scan', async (job) => { const { fileId, filePath, metadata, hash } = job.data;
const scanner = new FileScanner({ engines: ['signature', 'mime', 'basic-heuristics'], timeout: 5000 // 5 second timeout });
const result = await scanner.scan(filePath);
// Cache result await this.cacheResult(hash, result, 3600); // 1 hour TTL
// If suspicious, queue for deep scan if (result.verdict === 'suspicious') { await this.queues.deepScan.add('detailed-scan', { fileId, filePath, metadata, quickScanResult: result }); }
return result; }, { connection: this.redis, concurrency: 20 });
// Deep scan workers (comprehensive analysis) new Worker('deep-scan', async (job) => { const { fileId, filePath, metadata } = job.data;
const scanner = new FileScanner({ engines: ['signature', 'behavioral', 'yara', 'ml', 'sandbox'], timeout: 60000 // 1 minute timeout });
const result = await scanner.scan(filePath);
await this.queues.postProcessing.add('finalize', { fileId, result, scanType: 'deep' });
return result; }, { connection: this.redis, concurrency: 5 // Lower concurrency for CPU-intensive work }); }
async submitFile(filePath, options = {}) { const fileId = this.generateFileId();
// Start the pipeline await this.queues.preprocessing.add('analyze', { fileId, filePath, options });
return fileId; }
async getResult(fileId, timeout = 30000) { return new Promise((resolve, reject) => { const timer = setTimeout(() => { reject(new Error('Scan timeout')); }, timeout);
// Poll for result const checkResult = async () => { const result = await this.redis.get(`scan:${fileId}`); if (result) { clearTimeout(timer); resolve(JSON.parse(result)); } else { setTimeout(checkResult, 100); // Check every 100ms } };
checkResult(); }); }}Performance Impact: 300-400% increase in throughput
Optimization Strategy 3: Intelligent Caching
Implement multi-level caching to avoid redundant scans:
import LRU from 'lru-cache';import { createHash } from 'crypto';
class SmartCache { constructor() { // L1: Memory cache for recent results this.memoryCache = new LRU({ max: 10000, // 10k entries ttl: 1000 * 60 * 15 // 15 minutes });
// L2: Redis cache for shared results this.redisCache = new Redis(process.env.REDIS_URL);
// L3: Persistent database for long-term storage this.db = new Database(); }
async get(file) { // Generate composite cache key const cacheKey = await this.generateCacheKey(file);
// L1: Check memory cache first let result = this.memoryCache.get(cacheKey); if (result) { this.recordCacheHit('memory'); return result; }
// L2: Check Redis cache const redisResult = await this.redisCache.get(`scan:${cacheKey}`); if (redisResult) { result = JSON.parse(redisResult); // Promote to L1 cache this.memoryCache.set(cacheKey, result); this.recordCacheHit('redis'); return result; }
// L3: Check persistent storage const dbResult = await this.db.query( 'SELECT result FROM scan_cache WHERE hash = ?', [cacheKey] );
if (dbResult.length > 0) { result = JSON.parse(dbResult[0].result); // Promote to higher cache levels await this.redisCache.setex(`scan:${cacheKey}`, 3600, JSON.stringify(result)); this.memoryCache.set(cacheKey, result); this.recordCacheHit('database'); return result; }
return null; // Cache miss }
async set(file, scanResult, ttl = 3600) { const cacheKey = await this.generateCacheKey(file);
// Store in all cache levels this.memoryCache.set(cacheKey, scanResult); await this.redisCache.setex(`scan:${cacheKey}`, ttl, JSON.stringify(scanResult));
// Async write to database (fire-and-forget) this.db.query( 'INSERT OR REPLACE INTO scan_cache (hash, result, created_at) VALUES (?, ?, ?)', [cacheKey, JSON.stringify(scanResult), new Date()] ).catch(console.error); }
private async generateCacheKey(file) { // Create hash from file content + scan configuration const fileHash = await this.calculateFileHash(file); const configHash = this.calculateConfigHash();
return createHash('sha256') .update(fileHash) .update(configHash) .update(this.getVersionString()) .digest('hex'); }
private calculateConfigHash() { // Hash the current scanning configuration const config = JSON.stringify({ engines: this.scanner.enabledEngines, rules: this.scanner.ruleVersions, settings: this.scanner.settings });
return createHash('sha256').update(config).digest('hex'); }}Cache Hit Rates:
- Memory Cache: 85-90% (for recently scanned files)
- Redis Cache: 60-70% (for shared/distributed scanning)
- Database Cache: 40-50% (for long-term storage)
Optimization Strategy 4: Resource Pool Management
Implement resource pooling for expensive operations:
class ResourcePool { constructor(createFn, destroyFn, options) { this.createFn = createFn; this.destroyFn = destroyFn; this.pool = []; this.busy = new Set(); this.waiting = []; this.maxSize = options.maxSize || 10; this.minSize = options.minSize || 2;
// Pre-populate pool this.initialize(); }
async acquire() { if (this.pool.length > 0) { const resource = this.pool.pop(); this.busy.add(resource); return resource; }
if (this.busy.size < this.maxSize) { const resource = await this.createFn(); this.busy.add(resource); return resource; }
// Wait for available resource return new Promise((resolve) => { this.waiting.push(resolve); }); }
release(resource) { this.busy.delete(resource);
if (this.waiting.length > 0) { const resolve = this.waiting.shift(); this.busy.add(resource); resolve(resource); } else { this.pool.push(resource); } }}
class OptimizedScanner { constructor() { // Pool expensive resources this.yaraPool = new ResourcePool( () => new YaraEngine(), (engine) => engine.destroy(), { maxSize: 5, minSize: 1 } );
this.sandboxPool = new ResourcePool( () => new SandboxEnvironment(), (sandbox) => sandbox.cleanup(), { maxSize: 3, minSize: 1 } );
this.mlModelPool = new ResourcePool( () => new MLModel('malware-detection-v3'), (model) => model.unload(), { maxSize: 2, minSize: 1 } ); }
async scan(file) { const tasks = [];
// Parallel execution using resource pools if (this.shouldUseYara(file)) { tasks.push(this.runYaraScan(file)); }
if (this.shouldUseSandbox(file)) { tasks.push(this.runSandboxAnalysis(file)); }
if (this.shouldUseML(file)) { tasks.push(this.runMLAnalysis(file)); }
// Wait for all scans to complete const results = await Promise.allSettled(tasks);
return this.mergeResults(results); }
async runYaraScan(file) { const yaraEngine = await this.yaraPool.acquire(); try { return await yaraEngine.scan(file); } finally { this.yaraPool.release(yaraEngine); } }
async runSandboxAnalysis(file) { const sandbox = await this.sandboxPool.acquire(); try { return await sandbox.analyze(file); } finally { this.sandboxPool.release(sandbox); } }
async runMLAnalysis(file) { const model = await this.mlModelPool.acquire(); try { return await model.predict(file); } finally { this.mlModelPool.release(model); } }}Optimization Strategy 5: Streaming Analysis
For large files, implement streaming analysis to reduce memory usage:
import { createReadStream } from 'fs';import { Transform } from 'stream';
class StreamingScanner { constructor() { this.chunkSize = 64 * 1024; // 64KB chunks this.maxMemory = 256 * 1024 * 1024; // 256MB memory limit }
async scanLargeFile(filePath) { const fileStream = createReadStream(filePath, { highWaterMark: this.chunkSize });
const scanners = [ new HashCalculator(), new SignatureScanner(), new PatternMatcher(), new EntropyAnalyzer() ];
return new Promise((resolve, reject) => { let results = {}; let completed = 0;
fileStream .pipe(new ChunkDistributor(scanners)) .on('result', (scannerIndex, result) => { results[scannerIndex] = result; completed++;
if (completed === scanners.length) { resolve(this.mergeStreamingResults(results)); } }) .on('error', reject); }); }}
class ChunkDistributor extends Transform { constructor(scanners) { super({ objectMode: true }); this.scanners = scanners; this.buffers = scanners.map(() => []); }
_transform(chunk, encoding, callback) { // Distribute chunks to all scanners this.scanners.forEach((scanner, index) => { scanner.processChunk(chunk);
// Check if scanner is complete if (scanner.isComplete()) { const result = scanner.getResult(); this.emit('result', index, result); } });
callback(); }}
class SignatureScanner { constructor() { this.signatures = this.loadSignatures(); this.buffer = Buffer.alloc(0); this.maxBufferSize = 1024 * 1024; // 1MB sliding window }
processChunk(chunk) { // Maintain sliding window buffer this.buffer = Buffer.concat([this.buffer, chunk]);
if (this.buffer.length > this.maxBufferSize) { const overflow = this.buffer.length - this.maxBufferSize; this.buffer = this.buffer.slice(overflow); }
// Check signatures against current buffer this.checkSignatures(); }
checkSignatures() { for (const signature of this.signatures) { if (this.buffer.includes(signature.pattern)) { this.foundSignatures.push(signature); } } }}Optimization Strategy 6: Hardware Acceleration
Leverage specialized hardware for cryptographic operations:
import { Worker } from 'worker_threads';import { createHash } from 'crypto';
class HardwareAcceleratedScanner { constructor() { // Use hardware AES for hash calculations this.hashWorkers = this.createHashWorkers();
// GPU acceleration for ML models (if available) this.gpuEnabled = this.checkGPUAvailability();
// Dedicated threads for CPU-intensive operations this.analysisWorkers = this.createAnalysisWorkers(); }
createHashWorkers() { const workers = []; const numCPUs = require('os').cpus().length;
for (let i = 0; i < Math.min(numCPUs, 4); i++) { workers.push(new Worker(` const { parentPort } = require('worker_threads'); const { createHash } = require('crypto');
parentPort.on('message', ({ data, algorithm, id }) => { try { const hash = createHash(algorithm); hash.update(data); parentPort.postMessage({ id, result: hash.digest('hex'), success: true }); } catch (error) { parentPort.postMessage({ id, error: error.message, success: false }); } }); `, { eval: true })); }
return workers; }
async calculateHashParallel(buffer, algorithm = 'sha256') { const chunkSize = Math.ceil(buffer.length / this.hashWorkers.length); const promises = [];
for (let i = 0; i < this.hashWorkers.length; i++) { const start = i * chunkSize; const end = Math.min(start + chunkSize, buffer.length); const chunk = buffer.slice(start, end);
promises.push(this.sendToHashWorker(i, chunk, algorithm)); }
const results = await Promise.all(promises);
// Combine partial hashes const finalHash = createHash(algorithm); results.forEach(result => finalHash.update(result, 'hex'));
return finalHash.digest('hex'); }
private sendToHashWorker(workerIndex, data, algorithm) { return new Promise((resolve, reject) => { const id = Math.random().toString(36).substr(2, 9); const worker = this.hashWorkers[workerIndex];
const timeout = setTimeout(() => { reject(new Error('Hash calculation timeout')); }, 10000);
const messageHandler = (message) => { if (message.id === id) { clearTimeout(timeout); worker.off('message', messageHandler);
if (message.success) { resolve(message.result); } else { reject(new Error(message.error)); } } };
worker.on('message', messageHandler); worker.postMessage({ data, algorithm, id }); }); }}Performance Monitoring and Optimization
Implement comprehensive performance monitoring:
class PerformanceMonitor { constructor(scanner) { this.scanner = scanner; this.metrics = { scansCompleted: 0, totalScanTime: 0, cacheHits: 0, cacheMisses: 0, errors: 0 };
this.setupMonitoring(); }
setupMonitoring() { // Real-time performance tracking this.scanner.on('scanStart', (fileId) => { this.scanStartTimes.set(fileId, Date.now()); });
this.scanner.on('scanComplete', (fileId, result) => { const startTime = this.scanStartTimes.get(fileId); const duration = Date.now() - startTime;
this.metrics.scansCompleted++; this.metrics.totalScanTime += duration;
// Performance alerting if (duration > 30000) { // 30 seconds this.alertSlowScan(fileId, duration); }
this.scanStartTimes.delete(fileId); });
// Resource utilization monitoring setInterval(() => { this.recordResourceUsage(); }, 1000);
// Performance reports setInterval(() => { this.generatePerformanceReport(); }, 60000); // Every minute }
generatePerformanceReport() { const avgScanTime = this.metrics.totalScanTime / this.metrics.scansCompleted; const cacheHitRate = this.metrics.cacheHits / (this.metrics.cacheHits + this.metrics.cacheMisses); const throughput = this.metrics.scansCompleted / (Date.now() - this.startTime) * 1000;
const report = { timestamp: new Date(), throughput: throughput.toFixed(2), averageScanTime: avgScanTime.toFixed(0), cacheHitRate: (cacheHitRate * 100).toFixed(1), errorRate: (this.metrics.errors / this.metrics.scansCompleted * 100).toFixed(2) };
// Send to monitoring system this.sendToMonitoring(report);
// Auto-optimization based on metrics this.autoOptimize(report); }
autoOptimize(report) { // Automatic optimization decisions if (parseFloat(report.cacheHitRate) < 70) { this.increaseCacheSize(); }
if (parseFloat(report.averageScanTime) > 5000) { this.enableFastMode(); }
if (parseFloat(report.errorRate) > 5) { this.increaseTimeouts(); } }}Production Deployment
Load Balancer Configuration
upstream file_scanners { least_conn; server scanner1:3000 weight=3; server scanner2:3000 weight=3; server scanner3:3000 weight=2; # Slower machine server scanner4:3000 weight=3;}
server { listen 80;
location /scan { proxy_pass http://file_scanners; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr;
# Large file upload settings client_max_body_size 100m; proxy_read_timeout 120s; proxy_send_timeout 120s;
# Connection pooling proxy_http_version 1.1; proxy_set_header Connection "";
# Health check proxy_next_upstream error timeout invalid_header http_500 http_502 http_503; }}Auto-Scaling Configuration
# kubernetes deploymentapiVersion: apps/v1kind: Deploymentmetadata: name: file-scannerspec: replicas: 3 selector: matchLabels: app: file-scanner template: metadata: labels: app: file-scanner spec: containers: - name: scanner image: pompelmi/scanner:latest resources: requests: memory: "512Mi" cpu: "500m" limits: memory: "2Gi" cpu: "2" env: - name: REDIS_URL value: "redis://redis:6379" - name: PERFORMANCE_MODE value: "high-throughput"
---apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: file-scanner-hpaspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: file-scanner minReplicas: 3 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80Performance Results
After implementing all optimizations:
| Metric | Before | After | Improvement |
|---|---|---|---|
| Throughput | 12.5 files/sec | 156 files/sec | +1,148% |
| P95 Latency | 450ms | 95ms | -79% |
| Memory Usage | 512MB | 256MB | -50% |
| CPU Usage | 65% | 45% | -31% |
| Cache Hit Rate | 0% | 87% | +87% |
Best Practices Summary
- Profile before optimizing - Measure baseline performance
- Optimize the critical path - Focus on bottlenecks
- Use appropriate caching - Multi-level cache strategy
- Parallelize effectively - Balance concurrency with resources
- Monitor continuously - Real-time performance metrics
- Auto-optimize - Let the system adapt to changing loads
- Plan for scale - Design for horizontal scaling
Conclusion
High-performance file scanning doesn’t require sacrificing security. Through intelligent engine selection, asynchronous processing, smart caching, and resource optimization, Pompelmi can achieve enterprise-scale throughput while maintaining comprehensive threat detection.
The key is to understand your specific workload characteristics and optimize accordingly. What works for a document management system may not work for an image hosting service.
Ready to implement these optimizations? Check out our Enterprise Configuration Guide for production deployment strategies.
Need help optimizing your specific use case? Contact our performance engineering team for custom optimization consulting.