realutils.metrics.clip

Overview:

CLIP (Contrastive Language-Image Pre-training) model utilities’ module.

This module provides functions for working with CLIP models, including image and text embedding generation and classification. It supports loading ONNX-converted CLIP models from Hugging Face Hub and performing inference for both image and text inputs.

All models and preprocessors are hosted on Huggingface repository deepghs/clip_onnx

This is an overall benchmark of all the CLIP models:

../../_images/clip_image_benchmark.plot.py.svg

../../_images/clip_text_benchmark.plot.py.svg

get_clip_text_embedding

realutils.metrics.clip.get_clip_text_embedding(texts: str | List[str], model_name: str = 'openai/clip-vit-base-patch32', fmt='embeddings')[source]

Generate CLIP embeddings for input texts.

Parameters:

texts (Union[str, List[str]]) – Input text or list of texts to encode
model_name (str) – Name of the CLIP model to use
fmt – Output format (‘embeddings’ or ‘encodings’)

Returns:

Text embeddings or encodings based on fmt parameter

Example:

>>> from realutils.metrics.clip import get_clip_text_embedding
>>>
>>> # one text
>>> emb = get_clip_text_embedding('a photo of a cat')
>>> emb.shape, emb.dtype
((1, 512), dtype('float32'))
>>>
>>> # multiple texts
>>> emb = get_clip_text_embedding([
...     'a photo of a cat',
...     'a photo of a dog',
...     'a photo of a human',
... ])
>>> emb.shape, emb.dtype
((3, 512), dtype('float32'))

get_clip_image_embedding

Generate CLIP embeddings for input images.

Parameters:

images (MultiImagesTyping) – Input images to encode
model_name (str) – Name of the CLIP model to use
fmt – Output format (‘embeddings’ or ‘encodings’)

Returns:

Image embeddings or encodings based on fmt parameter

Example:

>>> from realutils.metrics.clip import get_clip_image_embedding
>>>
>>> # one image
>>> emb = get_clip_image_embedding('xlip/1.jpg')
>>> emb.shape, emb.dtype
((1, 512), dtype('float32'))
>>>
>>> # multiple images
>>> emb = get_clip_image_embedding(['xlip/1.jpg', 'xlip/2.jpg'])
>>> emb.shape, emb.dtype
((2, 512), dtype('float32'))

classify_with_clip

Perform classification using CLIP model by comparing image and text embeddings.

Parameters:

images (Union[MultiImagesTyping, numpy.ndarray]) – Input images or pre-computed image embeddings
texts (Union[List[str], str, numpy.ndarray]) – Input texts or pre-computed text embeddings
model_name (str) – Name of the CLIP model to use
fmt – Output format (‘predictions’, ‘similarities’, or ‘logits’)

Returns:

Classification results based on fmt parameter

Example:

>>> from realutils.metrics.clip import classify_with_clip
>>>
>>> classify_with_clip(
...     images=[
...         'xlip/1.jpg',
...         'xlip/2.jpg'
...     ],
...     texts=[
...         'a photo of a cat',
...         'a photo of a dog',
...         'a photo of a human',
...     ],
... )
array([[0.98039913, 0.00506729, 0.01453355],
       [0.05586662, 0.02006196, 0.92407143]], dtype=float32)