NSFW Classifier#

The NSFW (Not Safe For Work) Classifier detects the likelihood that an image contains explicit or unsafe content. It outputs a probability score from 0 (safe) to 1 (NSFW), helping you filter or flag images in your datasets.

Model Details#

Architecture: MLP trained on OpenAI CLIP ViT-L/14 image embeddings
Source: CLIP-based NSFW Detector
Output Field: nsfw_score
Score Range: 0–1 (higher is more likely NSFW)
Embedding Requirement: CLIP ViT-L/14 (see Image Embedding)

How It Works#

The classifier takes normalized image embeddings and predicts the probability of NSFW content. It is lightweight and can be run on the GPU alongside embedding computation for efficient batch processing.

Usage#

Python

from nemo_curator import get_client
from nemo_curator.datasets import ImageTextPairDataset
from nemo_curator.image.embedders import TimmImageEmbedder
from nemo_curator.image.classifiers import NsfwClassifier

client = get_client(cluster_type="gpu")
dataset = ImageTextPairDataset.from_webdataset(path="/path/to/dataset", id_col="key")

embedding_model = TimmImageEmbedder(
    "vit_large_patch14_clip_quickgelu_224.openai",
    pretrained=True,
    batch_size=1024,
    num_threads_per_worker=16,
    normalize_embeddings=True,
)
safety_classifier = NsfwClassifier()

dataset_with_embeddings = embedding_model(dataset)
dataset_with_nsfw_scores = safety_classifier(dataset_with_embeddings)

dataset_with_nsfw_scores.save_metadata()

Key Parameters#

Parameter	Default	Description
`embedding_column`	`image_embedding`	Name of the column with image embeddings
`pred_column`	`nsfw_score`	Name of the output column for scores
`batch_size`	`-1`	Batch size for inference; `-1` processes all at once
`model_path`	auto	Path to model weights; downloads if not provided

Performance Notes#

The model is small and can be loaded onto the GPU with the embedding model for fast, in-place scoring.
Batch size can be increased for faster throughput if memory allows.

Best Practices#

Use normalized CLIP ViT-L/14 embeddings for best results.
Run the classifier immediately after embedding to avoid extra I/O.
Review a sample of scores to calibrate thresholds for your use case.