Detection Models

DFPN ships with four pre-configured detection models that cover the most common deepfake and synthetic-media threats. Each model is maintained as a standalone module under the models/ directory and can be run by any worker node.

Pre-configured Models

face-forensics -- Face Manipulation Detection

Detects face-swap and reenactment forgeries in still images using the Self-Blended Images (SBI) training strategy on an EfficientNet-B4 backbone.

Benchmark	Accuracy
FaceForensics++ (c23)	97.2%
Celeb-DF v2	91.4%

Processing speed: ~50 ms on GPU / ~500 ms on CPU

Best for

Profile photos, identity documents, social-media headshots -- any image where a face is the primary subject.

universal-fake-detect -- AI-Generated Image Detection

A CLIP-based classifier (CLIP-ViT-L/14) fine-tuned to separate real photographs from outputs of modern generative models.

Generator	Accuracy
ProGAN	99.8%
Stable Diffusion	89.4%
DALL-E	85.7%

Processing speed: ~100 ms on GPU / ~800 ms on CPU

Best for

Identifying fully synthetic images produced by diffusion models, GANs, or other generative pipelines.

video-ftcn -- Video Authenticity Detection

Combines an Xception frame-level feature extractor with a Temporal CNN to capture inter-frame inconsistencies that reveal video-level manipulation.

Benchmark	Accuracy
FaceForensics++	96.4%
Celeb-DF v2	88.9%

Processing speed: ~2 s on GPU / ~30 s on CPU

CPU note

Video analysis is extremely slow on CPU-only nodes. The default CPU configuration disables this modality. See CPU-only configuration for details.

ssl-antispoofing -- Voice Cloning Detection

Leverages wav2vec 2.0 / XLSR-53 self-supervised speech representations to detect synthetic and cloned voices.

Benchmark	Accuracy
ASVspoof 2021 (LA)	99.2%

Processing speed: ~200 ms on GPU / ~2 s on CPU

Best for

Voice messages, phone-call recordings, podcast clips -- any audio where speaker authenticity matters.

Performance Comparison

Accuracy by model

Model	Primary Benchmark	Accuracy	Secondary Benchmark	Accuracy
face-forensics	FF++ (c23)	97.2%	Celeb-DF v2	91.4%
universal-fake-detect	ProGAN	99.8%	Stable Diffusion	89.4%
video-ftcn	FF++	96.4%	Celeb-DF v2	88.9%
ssl-antispoofing	ASVspoof 2021	99.2%	--	--

Processing speed

Model	GPU Latency	CPU Latency	GPU Required?
face-forensics	50 ms	500 ms	Recommended
universal-fake-detect	100 ms	800 ms	Recommended
video-ftcn	2 s	30 s	Strongly recommended
ssl-antispoofing	200 ms	2 s	Recommended

Supported Modalities

Every model, worker, and analysis request declares which modalities it supports using a bitfield. This allows efficient on-chain filtering without string comparison.

Modality	Description	Bit Value
`ImageAuthenticity`	Real vs. AI-generated image classification	`1` (bit 0)
`VideoAuthenticity`	Temporal forgery detection in video	`2` (bit 1)
`AudioAuthenticity`	General audio manipulation detection	`4` (bit 2)
`FaceManipulation`	Face-swap and reenactment detection	`8` (bit 3)
`VoiceCloning`	Synthetic / cloned voice detection	`16` (bit 4)
`GeneratedContent`	Fully AI-generated media detection	`32` (bit 5)

Combine values with bitwise OR. For example, a worker that handles face manipulation and AI-generated images advertises modalities = 8 | 1 | 32 = 41.

Model-to-modality mapping

Model	Modalities
face-forensics	`FaceManipulation` (8)
universal-fake-detect	`ImageAuthenticity` (1) + `GeneratedContent` (32) = 33
video-ftcn	`VideoAuthenticity` (2)
ssl-antispoofing	`VoiceCloning` (16)

Standardized Output Format

All models produce a JSON result with the same envelope so that workers and on-chain aggregation logic can treat them uniformly.

{
  "verdict": "manipulated",
  "confidence": 0.973,
  "detections": [
    {
      "type": "face_swap",
      "confidence": 0.973,
      "region": {
        "x": 120,
        "y": 80,
        "width": 256,
        "height": 256
      },
      "metadata": {
        "model": "face-forensics-sbi",
        "version": "1.0.0"
      }
    }
  ]
}

Field	Type	Description
`verdict`	string	One of `authentic`, `manipulated`, or `inconclusive`
`confidence`	float	Overall confidence score between 0.0 and 1.0
`detections`	array	Individual findings, each with its own confidence and optional spatial region
`detections[].type`	string	Detection category (e.g. `face_swap`, `generated_image`, `voice_clone`)
`detections[].region`	object	Bounding box for spatial detections (images/video); omitted for audio
`detections[].metadata`	object	Model identifier and version that produced this detection