Skip to content

The load_from_folder module

The load_from_folder module contains the classes that are necessary to load data from disk and these are inspired by the ImageFolder class in the torchvision library. This module is designed with one specific case in mind. Such case is the following: given a multi-modal dataset with tabular data, images and text, the images do not fit in memory, and therefore, they have to be loaded from disk. However, as any other functionality in this library, there is some flexibility and some additional cases can also be addressed using this module.

For this module to be used, the datasets must be prepared in a certain way:

  1. the tabular data must contain a column with the images names as stored in disk, including the extension (.jpg, .png, etc...).

  2. Regarding to the text dataset, the tabular data can contain a column with the texts themselves or the names of the files containing the texts as stored in disk.

The tabular data might or might not fit in disk itself. If it does not, please see the ChunkPreprocessor utilities at the[preprocessing] (preprocessing.md) module and the examples folder in the repo, which illustrate such case. Finally note that only csv format is currently supported in that case(more formats coming soon).

TabFromFolder

TabFromFolder(fname, directory=None, target_col=None, preprocessor=None, text_col=None, img_col=None, ignore_target=False, reference=None, verbose=1)

This class is used to load tabular data from disk. The current constrains are:

  1. The only file format supported right now is csv
  2. The csv file must contain headers

For examples, please, see the examples folder in the repo.

Parameters:

  • fname (str) –

    the name of the csv file

  • directory (Optional[str], default: None ) –

    the path to the directory where the csv file is located. If None, a TabFromFolder reference object must be provided

  • target_col (Optional[str], default: None ) –

    the name of the target column. If None, a TabFromFolder reference object must be provided

  • preprocessor (Optional[TabularPreprocessor], default: None ) –

    a fitted TabularPreprocessor object. If None, a TabFromFolder reference object must be provided

  • text_col (Optional[str], default: None ) –

    the name of the column with the texts themselves or the names of the files that contain the text dataset. If None, either there is no text column or a TabFromFolder reference object must be provided

  • img_col (Optional[str], default: None ) –

    the name of the column with the the names of the images. If None, either there is no image column or a TabFromFolder reference object must be provided

  • ignore_target (bool, default: False ) –

    whether to ignore the target column. This is normally set to True when this class is used for a test dataset.

  • reference (Optional[Any], default: None ) –

    a reference TabFromFolder object. If provided, the TabFromFolder object will be created using the attributes of the reference object. This is useful to instantiate a TabFromFolder object for evaluation or test purposes

  • verbose (Optional[int], default: 1 ) –

    verbosity. If 0, no output will be printed during the process.

Source code in pytorch_widedeep/load_from_folder/tabular/tabular_from_folder.py
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
def __init__(
    self,
    fname: str,
    directory: Optional[str] = None,
    target_col: Optional[str] = None,
    preprocessor: Optional[TabularPreprocessor] = None,
    text_col: Optional[str] = None,
    img_col: Optional[str] = None,
    ignore_target: bool = False,
    reference: Optional[Any] = None,  # is Type["TabFromFolder"],
    verbose: Optional[int] = 1,
):
    self.fname = fname
    self.ignore_target = ignore_target
    self.verbose = verbose

    if reference is not None:
        (
            self.directory,
            self.target_col,
            self.preprocessor,
            self.text_col,
            self.img_col,
        ) = self._set_from_reference(reference, preprocessor)
    else:
        assert (
            directory is not None
            and (target_col is not None and not ignore_target)
            and preprocessor is not None
        ), (
            "if no reference is provided, 'directory', 'target_col' and 'preprocessor' "
            "must be provided"
        )

        self.directory = directory
        self.target_col = target_col
        self.preprocessor = preprocessor
        self.text_col = text_col
        self.img_col = img_col

    assert (
        self.preprocessor.is_fitted
    ), "The preprocessor must be fitted before passing it to this class"

WideFromFolder

WideFromFolder(fname, directory=None, target_col=None, preprocessor=None, text_col=None, img_col=None, ignore_target=False, reference=None, verbose=1)

Bases: TabFromFolder

This class is mostly identical to TabFromFolder but exists because we want to separate the treatment of the wide and the deep tabular components

Parameters:

  • fname (str) –

    the name of the csv file

  • directory (Optional[str], default: None ) –

    the path to the directory where the csv file is located. If None, a WideFromFolder reference object must be provided

  • target_col (Optional[str], default: None ) –

    the name of the target column. If None, a WideFromFolder reference object must be provided

  • preprocessor (Optional[TabularPreprocessor], default: None ) –

    a fitted TabularPreprocessor object. If None, a WideFromFolder reference object must be provided

  • text_col (Optional[str], default: None ) –

    the name of the column with the texts themselves or the names of the files that contain the text dataset. If None, either there is no text column or a WideFromFolder reference object must be provided=

  • img_col (Optional[str], default: None ) –

    the name of the column with the the names of the images. If None, either there is no image column or a WideFromFolder reference object must be provided

  • ignore_target (bool, default: False ) –

    whether to ignore the target column. This is normally used when this class is used for a test dataset.

  • reference (Optional[Any], default: None ) –

    a reference WideFromFolder object. If provided, the WideFromFolder object will be created using the attributes of the reference object. This is useful to instantiate a WideFromFolder object for evaluation or test purposes

  • verbose (int, default: 1 ) –

    verbosity. If 0, no output will be printed during the process.

Source code in pytorch_widedeep/load_from_folder/tabular/tabular_from_folder.py
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
def __init__(
    self,
    fname: str,
    directory: Optional[str] = None,
    target_col: Optional[str] = None,
    preprocessor: Optional[TabularPreprocessor] = None,
    text_col: Optional[str] = None,
    img_col: Optional[str] = None,
    ignore_target: bool = False,
    reference: Optional[Any] = None,  # is Type["WideFromFolder"],
    verbose: int = 1,
):
    super(WideFromFolder, self).__init__(
        fname=fname,
        directory=directory,
        target_col=target_col,
        preprocessor=preprocessor,
        text_col=text_col,
        img_col=img_col,
        reference=reference,
        ignore_target=ignore_target,
        verbose=verbose,
    )

TextFromFolder

TextFromFolder(preprocessor)

This class is used to load the text dataset (i.e. the text files) from a folder, or to retrieve the text given a texts column specified within the preprocessor object.

For examples, please, see the examples folder in the repo.

Parameters:

Source code in pytorch_widedeep/load_from_folder/text/text_from_folder.py
27
28
29
30
31
32
33
34
35
def __init__(
    self,
    preprocessor: Union[TextPreprocessor, ChunkTextPreprocessor],
):
    assert (
        preprocessor.is_fitted
    ), "The preprocessor must be fitted before using this class"

    self.preprocessor = preprocessor

ImageFromFolder

ImageFromFolder(directory=None, preprocessor=None, loader=default_loader, extensions=None, transforms=None)

This class is used to load the image dataset from disk. It is inspired by the ImageFolder class at the torchvision library. Here, we have simply adapted to work within the context of a Wide and Deep multi-modal model.

For examples, please, see the examples folder in the repo.

Parameters:

  • directory (Optional[str], default: None ) –

    the path to the directory where the images are located. If None, a preprocessor must be provided.

  • preprocessor (Optional[ImagePreprocessor], default: None ) –

    a fitted ImagePreprocessor object.

  • loader (Callable[[str], Any], default: default_loader ) –

    a function to load a sample given its path.

  • extensions (Optional[Tuple[str, ...]], default: None ) –

    a tuple with the allowed extensions. If None, IMG_EXTENSIONS will be used where IMG_EXTENSIONS =".jpg", ".jpeg", ".png", ".ppm", ".bmp", ".pgm", ".tif", ".tiff", ".webp"

  • transforms (Optional[Any], default: None ) –

    a torchvision.transforms object. If None, this class will simply return an array representation of the PIL Image

Source code in pytorch_widedeep/load_from_folder/image/image_from_folder.py
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
def __init__(
    self,
    directory: Optional[str] = None,
    preprocessor: Optional[ImagePreprocessor] = None,
    loader: Callable[[str], Any] = default_loader,
    extensions: Optional[Tuple[str, ...]] = None,
    transforms: Optional[Any] = None,
) -> None:
    assert (
        directory is not None or preprocessor is not None
    ), "Either a directory or an instance of ImagePreprocessor must be provided"

    if directory is not None and preprocessor is not None:  # pragma: no cover
        assert directory == preprocessor.img_path, (
            "If both 'directory' and 'preprocessor' are provided, the 'img_path' "
            "attribute of the 'preprocessor' must be the same as the 'directory'"
        )

    if directory is not None:
        self.directory = directory
    else:
        assert (
            preprocessor is not None
        ), "Either a directory or an instance of ImagePreprocessor must be provided"
        self.directory = preprocessor.img_path

    self.preprocessor = preprocessor
    self.loader = loader
    self.extensions = extensions if extensions is not None else IMG_EXTENSIONS
    self.transforms = transforms
    if self.transforms:
        self.transforms_names = [
            tr.__class__.__name__ for tr in self.transforms.transforms
        ]
    else:
        self.transforms_names = []

        self.transpose = True

WideDeepDatasetFromFolder

WideDeepDatasetFromFolder(n_samples, tab_from_folder=None, wide_from_folder=None, text_from_folder=None, img_from_folder=None, reference=None)

Bases: Dataset

This class is the Dataset counterpart of the WideDeepDataset class.

Given a reference tabular dataset, with columns that indicate the path to the images and to the text files or the texts themselves, it will use the [...]FromFolder classes to load the data consistently from disk per batch.

For examples, please, see the examples folder in the repo.

Parameters:

  • n_samples (int) –

    Number of samples in the dataset

  • tab_from_folder (Optional[TabFromFolder], default: None ) –

    Instance of the TabFromFolder class

  • wide_from_folder (Optional[WideFromFolder], default: None ) –

    Instance of the WideFromFolder class

  • text_from_folder (Optional[TextFromFolder], default: None ) –

    Instance of the TextFromFolder class

  • img_from_folder (Optional[ImageFromFolder], default: None ) –

    Instance of the ImageFromFolder class

  • reference (Optional[Any], default: None ) –

    If not None, the 'text_from_folder' and 'img_from_folder' objects will be retrieved from the reference class. This is useful when we want to use a WideDeepDatasetFromFolder class used for a train dataset as a reference for the validation and test datasets. In this case, the text_from_folder and img_from_folder objects will be the same for all three datasets, so there is no need to create a new instance for each dataset.

Source code in pytorch_widedeep/load_from_folder/wd_dataset_from_folder.py
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
def __init__(
    self,
    n_samples: int,
    tab_from_folder: Optional[TabFromFolder] = None,
    wide_from_folder: Optional[WideFromFolder] = None,
    text_from_folder: Optional[TextFromFolder] = None,
    img_from_folder: Optional[ImageFromFolder] = None,
    reference: Optional[Any] = None,  # is Type["WideDeepDatasetFromFolder"],
):
    super(WideDeepDatasetFromFolder, self).__init__()

    if tab_from_folder is None and wide_from_folder is None:
        raise ValueError(
            "Either 'tab_from_folder' or 'wide_from_folder' must be not None"
        )

    if reference is not None:
        assert (
            img_from_folder is None and text_from_folder is None
        ), "If reference is not None, 'img_from_folder' and 'text_from_folder' left as None"
        self.text_from_folder, self.img_from_folder = self._get_from_reference(
            reference
        )
    else:
        assert (
            text_from_folder is not None and img_from_folder is not None
        ), "If reference is None, 'img_from_folder' and 'text_from_folder' must be not None"
        self.text_from_folder = text_from_folder
        self.img_from_folder = img_from_folder

    self.n_samples = n_samples
    self.tab_from_folder = tab_from_folder
    self.wide_from_folder = wide_from_folder