The load_from_folder
module¶
The load_from_folder
module contains the classes that are necessary to
load data from disk and these are inspired by the
ImageFolder
class in the torchvision
library. This module is designed with one specific case in mind.
Such case is the following: given a multi-modal dataset with tabular data,
images and text, the images do not fit in memory, and therefore, they have to
be loaded from disk. However, as any other functionality in this library,
there is some flexibility and some additional cases can also be addressed
using this module.
For this module to be used, the datasets must be prepared in a certain way:
-
the tabular data must contain a column with the images names as stored in disk, including the extension (
.jpg
,.png
, etc...). -
Regarding to the text dataset, the tabular data can contain a column with the texts themselves or the names of the files containing the texts as stored in disk.
The tabular data might or might not fit in disk itself. If it does
not, please see the ChunkPreprocessor
utilities at the[preprocessing
]
(preprocessing.md) module and the examples folder in the repo, which
illustrate such case. Finally note that only csv
format is currently
supported in that case(more formats coming soon).
TabFromFolder ¶
TabFromFolder(fname, directory=None, target_col=None, preprocessor=None, text_col=None, img_col=None, ignore_target=False, reference=None, verbose=1)
This class is used to load tabular data from disk. The current constrains are:
- The only file format supported right now is csv
- The csv file must contain headers
For examples, please, see the examples folder in the repo.
Parameters:
-
fname
(str
) –the name of the csv file
-
directory
(Optional[str]
, default:None
) –the path to the directory where the csv file is located. If None, a
TabFromFolder
reference object must be provided -
target_col
(Optional[str]
, default:None
) –the name of the target column. If None, a
TabFromFolder
reference object must be provided -
preprocessor
(Optional[TabularPreprocessor]
, default:None
) –a fitted
TabularPreprocessor
object. If None, aTabFromFolder
reference object must be provided -
text_col
(Optional[str]
, default:None
) –the name of the column with the texts themselves or the names of the files that contain the text dataset. If None, either there is no text column or a
TabFromFolder
reference object must be provided -
img_col
(Optional[str]
, default:None
) –the name of the column with the the names of the images. If None, either there is no image column or a
TabFromFolder
reference object must be provided -
ignore_target
(bool
, default:False
) –whether to ignore the target column. This is normally set to True when this class is used for a test dataset.
-
reference
(Optional[Any]
, default:None
) –a reference
TabFromFolder
object. If provided, theTabFromFolder
object will be created using the attributes of the reference object. This is useful to instantiate aTabFromFolder
object for evaluation or test purposes -
verbose
(Optional[int]
, default:1
) –verbosity. If 0, no output will be printed during the process.
Source code in pytorch_widedeep/load_from_folder/tabular/tabular_from_folder.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
|
WideFromFolder ¶
WideFromFolder(fname, directory=None, target_col=None, preprocessor=None, text_col=None, img_col=None, ignore_target=False, reference=None, verbose=1)
Bases: TabFromFolder
This class is mostly identical to TabFromFolder
but exists because we
want to separate the treatment of the wide and the deep tabular
components
Parameters:
-
fname
(str
) –the name of the csv file
-
directory
(Optional[str]
, default:None
) –the path to the directory where the csv file is located. If None, a
WideFromFolder
reference object must be provided -
target_col
(Optional[str]
, default:None
) –the name of the target column. If None, a
WideFromFolder
reference object must be provided -
preprocessor
(Optional[TabularPreprocessor]
, default:None
) –a fitted
TabularPreprocessor
object. If None, aWideFromFolder
reference object must be provided -
text_col
(Optional[str]
, default:None
) –the name of the column with the texts themselves or the names of the files that contain the text dataset. If None, either there is no text column or a
WideFromFolder
reference object must be provided= -
img_col
(Optional[str]
, default:None
) –the name of the column with the the names of the images. If None, either there is no image column or a
WideFromFolder
reference object must be provided -
ignore_target
(bool
, default:False
) –whether to ignore the target column. This is normally used when this class is used for a test dataset.
-
reference
(Optional[Any]
, default:None
) –a reference
WideFromFolder
object. If provided, theWideFromFolder
object will be created using the attributes of the reference object. This is useful to instantiate aWideFromFolder
object for evaluation or test purposes -
verbose
(int
, default:1
) –verbosity. If 0, no output will be printed during the process.
Source code in pytorch_widedeep/load_from_folder/tabular/tabular_from_folder.py
241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 |
|
TextFromFolder ¶
TextFromFolder(preprocessor)
This class is used to load the text dataset (i.e. the text files) from a folder, or to retrieve the text given a texts column specified within the preprocessor object.
For examples, please, see the examples folder in the repo.
Parameters:
-
preprocessor
(Union[TextPreprocessor, ChunkTextPreprocessor]
) –The preprocessor used to process the text. It must be fitted before using this class
Source code in pytorch_widedeep/load_from_folder/text/text_from_folder.py
27 28 29 30 31 32 33 34 35 |
|
ImageFromFolder ¶
ImageFromFolder(directory=None, preprocessor=None, loader=default_loader, extensions=None, transforms=None)
This class is used to load the image dataset from disk. It is inspired by
the ImageFolder
class at the torchvision
library. Here, we have
simply adapted to work within the context of a Wide and Deep multi-modal
model.
For examples, please, see the examples folder in the repo.
Parameters:
-
directory
(Optional[str]
, default:None
) –the path to the directory where the images are located. If None, a preprocessor must be provided.
-
preprocessor
(Optional[ImagePreprocessor]
, default:None
) –a fitted
ImagePreprocessor
object. -
loader
(Callable[[str], Any]
, default:default_loader
) –a function to load a sample given its path.
-
extensions
(Optional[Tuple[str, ...]]
, default:None
) –a tuple with the allowed extensions. If None, IMG_EXTENSIONS will be used where IMG_EXTENSIONS =".jpg", ".jpeg", ".png", ".ppm", ".bmp", ".pgm", ".tif", ".tiff", ".webp"
-
transforms
(Optional[Any]
, default:None
) –a
torchvision.transforms
object. If None, this class will simply return an array representation of the PIL Image
Source code in pytorch_widedeep/load_from_folder/image/image_from_folder.py
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
|
WideDeepDatasetFromFolder ¶
WideDeepDatasetFromFolder(n_samples, tab_from_folder=None, wide_from_folder=None, text_from_folder=None, img_from_folder=None, reference=None)
Bases: Dataset
This class is the Dataset counterpart of the WideDeepDataset
class.
Given a reference tabular dataset, with columns that indicate the path to
the images and to the text files or the texts themselves, it will use the
[...]FromFolder
classes to load the data consistently from disk per batch.
For examples, please, see the examples folder in the repo.
Parameters:
-
n_samples
(int
) –Number of samples in the dataset
-
tab_from_folder
(Optional[TabFromFolder]
, default:None
) –Instance of the
TabFromFolder
class -
wide_from_folder
(Optional[WideFromFolder]
, default:None
) –Instance of the
WideFromFolder
class -
text_from_folder
(Optional[TextFromFolder]
, default:None
) –Instance of the
TextFromFolder
class -
img_from_folder
(Optional[ImageFromFolder]
, default:None
) –Instance of the
ImageFromFolder
class -
reference
(Optional[Any]
, default:None
) –If not None, the 'text_from_folder' and 'img_from_folder' objects will be retrieved from the reference class. This is useful when we want to use a
WideDeepDatasetFromFolder
class used for a train dataset as a reference for the validation and test datasets. In this case, thetext_from_folder
andimg_from_folder
objects will be the same for all three datasets, so there is no need to create a new instance for each dataset.
Source code in pytorch_widedeep/load_from_folder/wd_dataset_from_folder.py
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
|