Fastai transforms¶

I directly copied and pasted part of the transforms.py module from the fastai library (from an old version). The reason to do such a thing is because pytorch_widedeep only needs the Tokenizer and the Vocab classes there. This way I avoid extra dependencies. Credit for all the code in the fastai_transforms module in this pytorch-widedeep package goes to Jeremy Howard and the fastai team. I only include the documentation here for completion, but I strongly advise the user to read the fastai documentation.

Tokenizer ¶

Tokenizer(tok_func=SpacyTokenizer, lang='en', pre_rules=None, post_rules=None, special_cases=None, n_cpus=None)

Class to combine a series of rules and a tokenizer function to tokenize text with multiprocessing.

Setting some of the parameters of this class require perhaps some familiarity with the source code.

Parameters:

tok_func (Callable, default: SpacyTokenizer ) –

Tokenizer Object. See pytorch_widedeep.utils.fastai_transforms.SpacyTokenizer
lang (str, default: 'en' ) –

Text's Language
pre_rules (Optional[ListRules], default: None ) –

Custom type: Collection[Callable[[str], str]]. These are Callable objects that will be applied to the text (str) directly as rule(tok) before being tokenized.
post_rules (Optional[ListRules], default: None ) –

Custom type: Collection[Callable[[str], str]]. These are Callable objects that will be applied to the tokens as rule(tokens) after the text has been tokenized.
special_cases (Optional[Collection[str]], default: None ) –

special cases to be added to the tokenizer via Spacy's add_special_case method
n_cpus (Optional[int], default: None ) –

number of CPUs to used during the tokenization process

Source code in pytorch_widedeep/utils/fastai_transforms.py

def __init__(
    self,
    tok_func: Callable = SpacyTokenizer,
    lang: str = "en",
    pre_rules: Optional[ListRules] = None,
    post_rules: Optional[ListRules] = None,
    special_cases: Optional[Collection[str]] = None,
    n_cpus: Optional[int] = None,
):
    self.tok_func, self.lang, self.special_cases = tok_func, lang, special_cases
    self.pre_rules = ifnone(pre_rules, defaults.text_pre_rules)
    self.post_rules = ifnone(post_rules, defaults.text_post_rules)
    self.special_cases = (
        special_cases if special_cases is not None else defaults.text_spec_tok
    )
    self.n_cpus = ifnone(n_cpus, defaults.cpus)

process_text ¶

process_text(t, tok)

Process and tokenize one text t with tokenizer tok.

Parameters:

t (str) –

text to be processed and tokenized
tok (BaseTokenizer) –

Instance of BaseTokenizer. See pytorch_widedeep.utils.fastai_transforms.BaseTokenizer

Returns:

List[str] –

List of tokens

Source code in pytorch_widedeep/utils/fastai_transforms.py

def process_text(self, t: str, tok: BaseTokenizer) -> List[str]:
    r"""Process and tokenize one text ``t`` with tokenizer ``tok``.

    Parameters
    ----------
    t: str
        text to be processed and tokenized
    tok: ``BaseTokenizer``
        Instance of `BaseTokenizer`. See
        `pytorch_widedeep.utils.fastai_transforms.BaseTokenizer`

    Returns
    -------
    List[str]
        List of tokens
    """
    for rule in self.pre_rules:
        t = rule(t)
    toks = tok.tokenizer(t)
    for rule in self.post_rules:
        toks = rule(toks)
    return toks

process_all ¶

process_all(texts)

Process a list of texts. Parallel execution of process_text.

Examples:

>>> from pytorch_widedeep.utils import Tokenizer
>>> texts = ['Machine learning is great', 'but building stuff is even better']
>>> tok = Tokenizer()
>>> tok.process_all(texts)
[['xxmaj', 'machine', 'learning', 'is', 'great'], ['but', 'building', 'stuff', 'is', 'even', 'better']]

NOTE: Note the token TK_MAJ (xxmaj), used to indicate the next word begins with a capital in the original text. For more details of special tokens please see the fastai docs.

Returns:

List[List[str]] –

List containing lists of tokens. One list per "document"

Source code in pytorch_widedeep/utils/fastai_transforms.py

def process_all(self, texts: Collection[str]) -> List[List[str]]:
    r"""Process a list of texts. Parallel execution of ``process_text``.

    Examples
    --------
    >>> from pytorch_widedeep.utils import Tokenizer
    >>> texts = ['Machine learning is great', 'but building stuff is even better']
    >>> tok = Tokenizer()
    >>> tok.process_all(texts)
    [['xxmaj', 'machine', 'learning', 'is', 'great'], ['but', 'building', 'stuff', 'is', 'even', 'better']]

    :information_source: **NOTE**:
    Note the token ``TK_MAJ`` (`xxmaj`), used to indicate the
    next word begins with a capital in the original text. For more
    details of special tokens please see the [``fastai`` docs](https://docs.fast.ai/text.core.html#Tokenizing).

    Returns
    -------
    List[List[str]]
        List containing lists of tokens. One list per "_document_"

    """

    if self.n_cpus <= 1:
        return self._process_all_1(texts)
    with ProcessPoolExecutor(self.n_cpus) as e:
        return sum(
            e.map(self._process_all_1, partition_by_cores(texts, self.n_cpus)), []
        )

Vocab ¶

Vocab(max_vocab, min_freq, pad_idx=None)

Contains the correspondence between numbers and tokens.

Parameters:

max_vocab (int) –

maximum vocabulary size
min_freq (int) –

minimum frequency for a token to be considereds
pad_idx (Optional[int], default: None ) –

padding index. If None, Fastai's Tokenizer leaves the 0 index for the unknown token ('xxunk') and defaults to 1 for the padding token ('xxpad').

Attributes:

itos (Collection) –

index to str. Collection of strings that are the tokens of the vocabulary
stoi (defaultdict) –

str to index. Dictionary containing the tokens of the vocabulary and their corresponding index

Source code in pytorch_widedeep/utils/fastai_transforms.py

def __init__(
    self,
    max_vocab: int,
    min_freq: int,
    pad_idx: Optional[int] = None,
):
    self.max_vocab = max_vocab
    self.min_freq = min_freq
    self.pad_idx = pad_idx

create ¶

create(tokens)

Create a vocabulary object from a set of tokens.

Parameters:

tokens (Tokens) –

Custom type: Collection[Collection[str]] see pytorch_widedeep.wdtypes. Collection of collection of strings (e.g. list of tokenized sentences)

Examples:

>>> from pytorch_widedeep.utils import Tokenizer, Vocab
>>> texts = ['Machine learning is great', 'but building stuff is even better']
>>> tokens = Tokenizer().process_all(texts)
>>> vocab = Vocab(max_vocab=18, min_freq=1).create(tokens)
>>> vocab.numericalize(['machine', 'learning', 'is', 'great'])
[10, 11, 9, 12]
>>> vocab.textify([10, 11, 9, 12])
'machine learning is great'

NOTE: Note the many special tokens that fastai's' tokenizer adds. These are particularly useful when building Language models and/or in classification/Regression tasks. Please see the fastai docs.

Returns:

Vocab –

An instance of a Vocab object

Source code in pytorch_widedeep/utils/fastai_transforms.py

def create(
    self,
    tokens: Tokens,
) -> "Vocab":
    r"""Create a vocabulary object from a set of tokens.

    Parameters
    ----------
    tokens: Tokens
        Custom type: ``Collection[Collection[str]]``  see
        `pytorch_widedeep.wdtypes`. Collection of collection of
        strings (e.g. list of tokenized sentences)

    Examples
    --------
    >>> from pytorch_widedeep.utils import Tokenizer, Vocab
    >>> texts = ['Machine learning is great', 'but building stuff is even better']
    >>> tokens = Tokenizer().process_all(texts)
    >>> vocab = Vocab(max_vocab=18, min_freq=1).create(tokens)
    >>> vocab.numericalize(['machine', 'learning', 'is', 'great'])
    [10, 11, 9, 12]
    >>> vocab.textify([10, 11, 9, 12])
    'machine learning is great'

    :information_source: **NOTE**:
    Note the many special tokens that ``fastai``'s' tokenizer adds. These
    are particularly useful when building Language models and/or in
    classification/Regression tasks. Please see the [``fastai`` docs](https://docs.fast.ai/text.core.html#Tokenizing).

    Returns
    -------
    Vocab
        An instance of a `Vocab` object
    """

    freq = Counter(p for o in tokens for p in o)
    itos = [o for o, c in freq.most_common(self.max_vocab) if c >= self.min_freq]
    for o in reversed(defaults.text_spec_tok):
        if o in itos:
            itos.remove(o)
        itos.insert(0, o)

    if self.pad_idx is not None and self.pad_idx != 1:
        itos.remove(PAD)
        itos.insert(self.pad_idx, PAD)
        # get the new 'xxunk' index
        xxunk_idx = np.where([el == "xxunk" for el in itos])[0][0]
    else:
        xxunk_idx = 0

    itos = itos[: self.max_vocab]
    if (
        len(itos) < self.max_vocab
    ):  # Make sure vocab size is a multiple of 8 for fast mixed precision training
        while len(itos) % 8 != 0:
            itos.append("xxfake")

    self.itos = itos
    self.stoi = defaultdict(
        lambda: xxunk_idx, {v: k for k, v in enumerate(self.itos)}
    )

    return self

fit ¶

fit(tokens)

Calls the create method. I simply want to honor fast ai naming, but for consistency with the rest of the library I am including a fit method

Source code in pytorch_widedeep/utils/fastai_transforms.py

def fit(
    self,
    tokens: Tokens,
) -> "Vocab":
    """
    Calls the `create` method. I simply want to honor fast ai naming, but
    for consistency with the rest of the library I am including a fit method
    """
    return self.create(tokens)

numericalize ¶

numericalize(t)

Convert a list of tokens t to their ids.

Returns:

List[int] –

List of 'numericalsed' tokens

Source code in pytorch_widedeep/utils/fastai_transforms.py

def numericalize(self, t: Collection[str]) -> List[int]:
    """Convert a list of tokens ``t`` to their ids.

    Returns
    -------
    List[int]
        List of '_numericalsed_' tokens
    """
    return [self.stoi[w] for w in t]

transform ¶

transform(t)

Calls the numericalize method. I simply want to honor fast ai naming, but for consistency with the rest of the library I am including a transform method

Source code in pytorch_widedeep/utils/fastai_transforms.py

def transform(self, t: Collection[str]) -> List[int]:
    """
    Calls the `numericalize` method. I simply want to honor fast ai naming,
    but for consistency with the rest of the library I am including a
    transform method
    """
    return self.numericalize(t)

textify ¶

textify(nums, sep=' ')

Convert a list of nums (or indexes) to their tokens.

Returns:

List[str] –

List of tokens

Source code in pytorch_widedeep/utils/fastai_transforms.py

def textify(self, nums: Collection[int], sep=" ") -> Union[str, List[str]]:
    """Convert a list of ``nums`` (or indexes) to their tokens.

    Returns
    -------
    List[str]
        List of tokens
    """
    return (
        sep.join([self.itos[i] for i in nums])
        if sep is not None
        else [self.itos[i] for i in nums]
    )

inverse_transform ¶

inverse_transform(nums, sep=' ')

Calls the textify method. I simply want to honor fast ai naming, but for consistency with the rest of the library I am including an inverse_transform method

Source code in pytorch_widedeep/utils/fastai_transforms.py

def inverse_transform(
    self, nums: Collection[int], sep=" "
) -> Union[str, List[str]]:
    """
    Calls the `textify` method. I simply want to honor fast ai naming, but
    for consistency with the rest of the library I am including an
    inverse_transform method
    """
    # I simply want to honor fast ai naming, but for consistency with the
    # rest of the library I am including an inverse_transform method
    return self.textify(nums, sep)