Article October 15, 2024

Privacy-First Upload Security vs Cloud Scanning APIs: An Honest Comparison

Cloud AV APIs are convenient but ship your users' files to third parties. In-process scanning with Pompelmi keeps data where it belongs. Here's when each approach makes sense.

privacy security comparison cloud architecture

Privacy-First Upload Security vs Cloud Scanning APIs: An Honest Comparison

When you need to scan uploaded files for malware, you have a fundamental architectural choice: send files to a cloud AV API, or scan in your own process. Both work. Both have real tradeoffs. This post lays them out honestly so you can choose the right approach for your context.

What Cloud Scanning APIs Do

Cloud scanning services (VirusTotal, MetaDefender, and similar) accept a file upload, run it through one or more AV engines, and return a verdict. The process:

Your server receives a file.
Your server (or your user’s browser) uploads that file to the cloud API.
The cloud service scans with its engine pool.
The cloud service returns a result.
Your server uses the result to decide what to do.

What this means in practice:

Files leave your infrastructure.
Files are processed by a third-party service.
That service typically retains files for analysis, research, or indexing.
Network latency is added to every upload flow.

What In-Process Scanning Does

In-process scanning (Pompelmi, ClamAV running locally, custom scanners) runs inside your application’s process or on your infrastructure:

Your server receives a file.
Your server’s in-process scanner inspects the bytes.
Your server gets a verdict immediately.
No file leaves your infrastructure at any point.

The Privacy Question

This is not hypothetical. VirusTotal’s terms of service explicitly state that files submitted to the public API may be shared with partners and used to improve detection. If a user uploads a confidential document and your system sends it to a cloud AV API, that document has left your infrastructure — potentially permanently.

For many applications, this is a non-issue. For others, it is:

Healthcare applications: Patient records, diagnostic images, clinical notes. HIPAA’s minimum-necessary principle applies; unnecessary disclosure to a third party creates real exposure.
Legal platforms: Privileged attorney-client documents. Cloud scanning may violate privilege.
Financial services: KYC documents, account statements, contracts. GLBA and similar regulations require controlling information flows.
HR platforms: Resumes, ID documents, personal data. GDPR’s data minimization principle applies.
Government or defense: Classified or controlled unclassified information (CUI) cannot leave authorized systems.

Privacy-first scanning is not primarily a performance optimization — it’s a data governance decision. Files that don’t leave your infrastructure cannot be exposed, retained, or misused by third parties.

Performance and Latency

Approach	Typical scan latency (1 MB file)	Network overhead
Cloud API (VirusTotal, MetaDefender)	500 ms – 5000 ms	Full file upload + result fetch
Cloud API with pre-hashing	100 ms – 500 ms (hash lookup only)	~32 bytes (SHA256)
In-process heuristics (Pompelmi)	1 ms – 20 ms	Zero
Local ClamAV daemon	10 ms – 200 ms	Local socket only
YARA engine (in-process)	5 ms – 100 ms	Zero

For synchronous upload flows — where the user is waiting for confirmation — in-process scanning is the only approach that doesn’t introduce meaningful latency.

Detection Coverage

Approach	Signature-based detection	Heuristic detection	Zero-day coverage
Cloud multi-engine (VirusTotal)	High (70+ engines)	Varies	Better (collective intelligence)
Local ClamAV	Good (community sigs)	Limited	Lags cloud by days/weeks
Pompelmi heuristics	No database	Yes (structural)	Limited to structural patterns
Pompelmi + YARA	No database (unless custom rules)	Yes	Custom rules tailored to your threats

The honest tradeoff: Cloud scanning has broader signature coverage because it aggregates dozens of AV engines. In-process heuristics catch structural threats (ZIP bombs, macro-enabled documents, PE executables, risky PDF actions) that don’t require signature databases.

For most web application upload security, structural threats are the primary risk. A ZIP bomb, a macro-enabled Word document uploaded to bypass execution controls, or an SVG with embedded XSS — these are the realistic threats. Cloud scanning adds marginal value for commodity malware samples while adding meaningful privacy risk.

Cost Model

Cloud scanning APIs charge per scan, per file, or per GB. At scale:

VirusTotal public API: Free tier is rate-limited; commercial pricing starts at hundreds of dollars per month.
MetaDefender: Similar pricing model.

In-process scanning has no per-scan cost. The compute cost is minimal (heuristics are fast; YARA rules scale with rule complexity and file size).

Hybrid Approach

Some architectures use both:

In-process heuristics as a fast gate — run immediately on upload, block obvious threats before any persistence.
Async cloud scan for deep analysis — after saving the file (to a quarantine bucket, not live storage), submit an async job that uploads a hash to a cloud service. Only promote to live storage after async clearance.

This gives you low upload latency, privacy for the majority of content, and deeper inspection for files that pass structural checks. The tradeoff is complexity and the async promotion workflow.

When Cloud APIs Make Sense

You handle low-sensitivity files (e.g., public domain content, open datasets).
You need maximum signature coverage and accept the latency and privacy tradeoff.
You’re building a security research tool where sharing with threat intelligence networks is desirable.
Regulatory constraints don’t prohibit third-party data processing.

When In-Process Scanning Makes Sense

You handle PII, health data, legal or financial documents.
You need synchronous upload with no added latency.
You want zero external dependencies in your upload path.
You’re in a regulated environment where third-party data processing requires explicit consent or contracting.
You want cost predictability at scale.
You want to run custom YARA rules tailored to your specific threat model.

Pompelmi’s Position

Pompelmi is explicitly in the in-process camp. It has no cloud API calls, no telemetry, and no external dependencies in its scan path. It works well for privacy-sensitive and regulated environments, not because of compliance certifications, but because of architecture: data never leaves your process.

import { scanBytes, composeScanners, CommonHeuristicsScanner, createZipBombGuard } from 'pompelmi';

const scanner = composeScanners(
  [
    ['zipGuard', createZipBombGuard({ maxEntries: 1000, maxCompressionRatio: 100 })],
    ['heuristics', CommonHeuristicsScanner],
  ],
  { parallel: false, stopOn: 'malicious' }
);

// This function call reads bytes in your process.
// Nothing is sent anywhere.
const matches = await scanner(uploadedBytes);

Summary

Cloud scanning APIs offer broad signature coverage at the cost of data leaving your infrastructure and network latency on every upload. In-process scanning with Pompelmi offers zero-latency structural threat detection with no data leaving your process. Hybrid architectures use both where the tradeoff is justified. Choose based on your data sensitivity, regulatory context, and latency requirements — not just detection breadth.

Resources:

Privacy-First Upload Security vs Cloud Scanning APIs: An Honest Comparison

Privacy-First Upload Security vs Cloud Scanning APIs: An Honest Comparison

What Cloud Scanning APIs Do

What In-Process Scanning Does

The Privacy Question

Performance and Latency

Detection Coverage

Cost Model

Hybrid Approach

When Cloud APIs Make Sense

When In-Process Scanning Makes Sense

Pompelmi’s Position

Summary

Related articles

Fastify Upload Security: Scan Files Before They Reach Storage

Multer File Upload Security Checklist for Node.js

Secure File Uploads in NestJS with Application-Layer Scanning