core.datasets.retro.config.bert_embedders#

Container dataclass for holding both in-memory and on-disk Bert embedders.

Module Contents#

Classes#

Embedder

Base class for all Bert embedders.

RetroBertEmbedders

Container dataclass for in-memory and on-disk Bert embedders.

API#

class core.datasets.retro.config.bert_embedders.Embedder#

Bases: abc.ABC

Base class for all Bert embedders.

All embedders should be able to embed either an entire text dataset (to a 2D numpy array), or a single text string (to a 1D numpy array).

abstractmethod embed_text_dataset(
text_dataset: torch.utils.data.Dataset,
) numpy.ndarray#

Embed a text dataset.

Parameters:

text_dataset (torch.utils.data.Dataset) – Text dataset to embed. Each sample of the text dataset should output a dict with a key ‘text’ and a string value.

Returns:

A 2D ndarray with shape (len(text_dataset), dimension(embedder)).

abstractmethod embed_text(text: str) numpy.ndarray#

Embed a simple string of text.

Parameters:

text (str) – A single text sample.

Returns:

A 1D ndarray with shape (dimensions(embedder),).

class core.datasets.retro.config.bert_embedders.RetroBertEmbedders#

Container dataclass for in-memory and on-disk Bert embedders.

disk: core.datasets.retro.config.bert_embedders.Embedder#

None

mem: core.datasets.retro.config.bert_embedders.Embedder#

None