This is the first of a series of posts introducing pytorch-widedeep, which is intended to be a flexible package to use Deep Learning (hereafter DL) with tabular data and combine it with text and images via wide and deep models. pytorch-widedeep is partially based on Heng-Tze Cheng et al., 2016 paper [1].

in this post I describe the data preprocessing functionalities of the library, the main components of the model, and the basic use of the library. In a separate post I will show a more advance use of pytorch-widedeep.

Before I move any further I just want to emphasize that there are a number of libraries that implement functionalities to use DL on tabular data. To cite a few, the ubiquitous and fantastic FastAI (and their tabular api), NVIDIA's NVTabular, the powerful pytorch-tabnet based on work of Sercan O. Arik and Tomas Pfisterfrom [2], which is starting to take victories in Kaggle competitions, and perhaps my favourite AutoGluon Tabular [3].

It is not my intention to "compete" against these libraries. pytorch-widedeep started as an attempt to package and automate an algorithm I had to use a couple of times at work and ended up becoming the entertaining process that is building a library. Needless to say that if you wanted to apply DL to tabular data you should go and check all the libraries I mentioned before (as well as this one πŸ™‚. You can find the source code here)).

1. Installation

To install the package simply use pip:

pip install pytorch-widedeep

or directly from github

pip install git+https://github.com/jrzaurin/pytorch-widedeep.git

Important note for Mac Users

Note that the following comments are not directly related to the package, but to the interplay between pytorch and OSX (more precisely pytorch's dependency on OpenMP I believe) and in general parallel processing in Mac.

In the first place, at the time of writing the latest pytorch version is 1.7. This version is known to have some issues when running on Mac and the data-loaders might not run in parallel.

On the other hand, since Python 3.8 the multiprocessing library start method changed from 'fork' to 'spawn'. This also affects the data-loaders (for any torch version) and they will not run in parallel.

Therefore, for Mac users I suggest using python 3.7 and torch <= 1.6 (with its corresponding torchvision version, i.e. <= 0.7.0). I could have enforced this versioning via the setup.py file. However, there are a number of unknowns and I preferred to leave it as it is. For example I developed the package using macOS Catalina and maybe some of this issues are not present in the new release Big Sur. Also, I hope that they release soon a patch for pytorch 1.7 and some, if not all these problems disappear.

Installing pytorch-widedeep via pip will install the latest version. Therefore, if these problems are present and the dataloaders do not run in parallel, one can easily downgrade manually:

pip install torch==1.6.0 torchvision==0.7.0

None of these issues affect Linux users

2. pytorch-widedeep architectures

In general terms, pytorch-widedeep is a package to use deep learning with tabular data. In particular, is intended to facilitate the combination of text and images with corresponding tabular data using wide and deep models. With that in mind there are a number of architectures that can be implemented with just a few lines of code. The main components of those architectures are shown in the Figure below:

The dashed boxes in the figure represent optional, overall components, and the dashed lines/arrows indicate the corresponding connections, depending on whether or not certain components are present. For example, the dashed, blue-arrows indicate that the deeptabular, deeptext and deepimage components are connected directly to the output neuron or neurons (depending on whether we are performing a binary classification or regression, or a multi-class classification) if the optional deephead is not present. Finally, the components within the faded-pink rectangle are concatenated.

Note that it is not possible to illustrate the number of architectures and components available in pytorch-widedeep in one Figure. This is why I wrote before "overall components", because within the components represented by the boxes, there are a number of options as well. Therefore, for more details on possible architectures (and more) please, see the documentation, or the Examples folders and the notebooks in the repo.

In math terms, and following the notation in the paper, the expression for the architecture without a deephead component can be formulated as:

$$ preds = \sigma(W^{T}_{wide}[x, \phi(x)] + W^{T}_{deeptabular}a^{(l_f)}_{dense} + W^{T}_{deeptext}a^{(l_f)}_{text} + W^{T}_{deepimage}a^{(l_f)}_{image} + b) $$

Where $W$ are the weight matrices applied to the wide model and to the final activations of the deep models, $a$ are these final activations, and $\phi(x)$ are the cross product transformations of the original features $x$. In case you are wondering what are "cross product transformations", here is a quote taken directly from the paper: "For binary features, a cross-product transformation (e.g., β€œAND(gender=female, language=en)”) is 1 if and only if the constituent features (β€œgender=female” and β€œlanguage=en”) are all 1, and 0 otherwise".

While if there is a deephead component, the previous expression turns into:

$$ preds = \sigma(W^{T}_{wide}[x, \phi(x)] + W^{T}_{deephead}a^{(l_f)}_{deephead} + b) $$

It is important to emphasize that each individual component, wide, deeptabular, deeptext and deepimage, can be used independently and in isolation. For example, one could use only wide, which is in simply a linear model. In fact, one of the most interesting offerings in pytorch-widedeep is the deeptabular component, and I intend to write a dedicated post focused on that component alone.

Finally, while I recommend using the wide and deeptabular models in pytorch-widedeep it is very likely that users will want to use their own models for the deeptext and deepimage components. That is perfectly possible as long as the the custom models have an attribute called output_dim with the size of the last layer of activations, so that WideDeep can be constructed. Again, examples on how to use custom components can be found in the Examples folder in the repo. Just in case pytorch-widedeep includes standard text (stack of LSTMs) and image (pre-trained ResNets or stack of CNNs) models.

3. Quick start (TL;DR)

Maybe I should have started with this section, but I thought that knowing at least the architectures one can build with pytorch-widedeep was "kind-off" necessary. In any case and before diving into the details of the library, let's just say that you just want to quickly run one example and get the feel of how pytorch-widedeep works. Let's do so using the adult census dataset.

In this example we will be fitting a model comprised by two components: wide and deeptabular.

#collapse-hide
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

#collapse-hide
adult = pd.read_csv("data/adult/adult.csv.zip")
adult.columns = [c.replace("-", "_") for c in adult.columns]
adult["income_label"] = (adult["income"].apply(lambda x: ">50K" in x)).astype(int)
adult.drop("income", axis=1, inplace=True)

for c in adult.columns:
    if adult[c].dtype == 'O':
        adult[c] = adult[c].apply(lambda x: "unknown" if x == "?" else x)
        adult[c] = adult[c].str.lower()
adult_train, adult_test = train_test_split(adult, test_size=0.2, stratify=adult.income_label)

adult.head()
age workclass fnlwgt education educational_num marital_status occupation relationship race gender capital_gain capital_loss hours_per_week native_country income_label
0 25 private 226802 11th 7 never-married machine-op-inspct own-child black male 0 0 40 united-states 0
1 38 private 89814 hs-grad 9 married-civ-spouse farming-fishing husband white male 0 0 50 united-states 0
2 28 local-gov 336951 assoc-acdm 12 married-civ-spouse protective-serv husband white male 0 0 40 united-states 1
3 44 private 160323 some-college 10 married-civ-spouse machine-op-inspct husband black male 7688 0 40 united-states 1
4 18 unknown 103497 some-college 10 never-married unknown own-child white female 0 0 30 united-states 0

The following lines below is all you need

from pytorch_widedeep import Trainer
from pytorch_widedeep.preprocessing import WidePreprocessor, TabPreprocessor
from pytorch_widedeep.models import Wide, TabMlp, WideDeep
from pytorch_widedeep.metrics import Accuracy

# define wide, crossed, embedding and continuous columns, and target
wide_cols = ["education", "relationship", "workclass", "occupation", "native_country", "gender"]
cross_cols = [("education", "occupation"), ("native_country", "occupation")]
embed_cols = [("education", 32), ("workclass", 32), ("occupation", 32), ("native_country", 32)]
cont_cols = ["age", "hours_per_week"]
target = adult_train["income_label"].values

# prepare wide component
wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=cross_cols)
X_wide = wide_preprocessor.fit_transform(adult_train)
wide = Wide(wide_dim=np.unique(X_wide).shape[0], pred_dim=1)

# prepare deeptabular component
tab_preprocessor = TabPreprocessor(embed_cols=embed_cols, continuous_cols=cont_cols)
X_tab = tab_preprocessor.fit_transform(adult_train)
deeptabular = TabMlp(
    mlp_hidden_dims=[200, 100],
    column_idx=tab_preprocessor.column_idx,
    embed_input=tab_preprocessor.embeddings_input, 
    continuous_cols=cont_cols,
)
                   
# build, compile and fit
model = WideDeep(wide=wide, deeptabular=deeptabular)

# Train
trainer = Trainer(model, objective="binary", metrics=[(Accuracy)])
trainer.fit(X_wide=X_wide, X_tab=X_tab, target=target, n_epochs=2, batch_size=256) 

# predict
X_wide_te = wide_preprocessor.transform(adult_test)
X_tab_te = tab_preprocessor.transform(adult_test)
preds = trainer.predict(X_wide=X_wide_te, X_tab=X_tab_te)
epoch 1: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 153/153 [00:03<00:00, 43.06it/s, loss=0.428, metrics={'acc': 0.802}] 
epoch 2: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 153/153 [00:03<00:00, 44.41it/s, loss=0.389, metrics={'acc': 0.8217}]
predict: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 39/39 [00:00<00:00, 149.41it/s]

4. Preprocessors

As you can see in Section 3, and as with any ML algorithm, the data need to be prepared/preprocessed before going through the model. This is handled by the pytorch-widedeep preprocessors. There is one preprocessor per WideDeep model component:

WidePreprocessor
TabPreprocessor
TextPreprocessor
ImagePreprocessor

"Behind the scenes", these preprocessors use a series of helper functions and classes that are in the utils module. Initially I did not intend to "expose" them to the user, but I believe they can be useful for all sorts of preprocessing tasks, even if they are not related to pytorch-widedeep, so I made them available. The utils tools are:

deep_utils.LabelEncoder
text_utils.simple_preprocess
text_utils.get_texts
text_utils.pad_sequences
text_utils.build_embeddings_matrix
fastai_transforms.Tokenizer
fastai_transforms.Vocab
image_utils.SimplePreprocessor
image_utils.AspectAwarePreprocessor

They are accessible directly from utils, e.g.:

from pytorch_widedeep.utils import LabelEncoder

Note that here I will be concentrating directly on the preprocessors. If you want more details on the utils tools, have a look to the source code or read the documentation.

4.1. WidePreprocessor

The Wide component of the model is a linear model that in principle, could be implemented as a linear layer receiving the result of on one-hot encoded categorical columns. However, this is not memory efficient (at all). Therefore, we implement a liner layer as an Embedding layer plus a bias. I will explain it in a bit more detail later. For now, just know that WidePreprocessor simply encodes the categories numerically so that they are the indexes of the lookup table that is an Embedding layer.

from pytorch_widedeep.preprocessing import WidePreprocessor

wide_cols = ['education', 'relationship','workclass','occupation','native_country','gender']
crossed_cols = [('education', 'occupation'), ('native_country', 'occupation')]

wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols)
X_wide = wide_preprocessor.fit_transform(adult)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_wide = wide_preprocessor.transform(new_df)
X_wide
array([[  1,  17,  23, ...,  89,  91, 316],
       [  2,  18,  23, ...,  89,  92, 317],
       [  3,  18,  24, ...,  89,  93, 318],
       ...,
       [  2,  20,  23, ...,  90, 103, 323],
       [  2,  17,  23, ...,  89, 103, 323],
       [  2,  21,  29, ...,  90, 115, 324]])
X_wide[0]
array([  1,  17,  23,  32,  47,  89,  91, 316])

Note that the label encoding starts from 1. This is because it is convenient to leave 0 for padding, i.e. unknown categories. Let's take from example the first entry

wide_preprocessor.inverse_transform(X_wide[:1])
education relationship workclass occupation native_country gender education_occupation native_country_occupation
0 11th own-child private machine-op-inspct united-states male 11th-machine-op-inspct united-states-machine-op-inspct

As we can see, wide_preprocessor numerically encodes the wide_cols and the crossed_cols, which can be recovered using the method inverse_transform.

4.2 TabPreprocessor

Simply, TabPreprocessor label-encodes the categorical columns and normalizes the numerical ones (unless otherwise specified).

from pytorch_widedeep.preprocessing import TabPreprocessor

# cat_embed_cols = [(column_name, embed_dim), ...]
cat_embed_cols = [('education',10), ('relationship',8), ('workclass',10), ('occupation',10),('native_country',10)]
continuous_cols = ["age","hours_per_week"]

tab_preprocessor = TabPreprocessor(embed_cols=cat_embed_cols, continuous_cols=continuous_cols)
X_tab = tab_preprocessor.fit_transform(adult)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_deep = deep_preprocessor.transform(new_df)
print(X_tab[:5])
[[ 1.          1.          1.          1.          1.         -0.99512893
  -0.03408696]
 [ 2.          2.          1.          2.          1.         -0.04694151
   0.77292975]
 [ 3.          2.          2.          3.          1.         -0.77631645
  -0.03408696]
 [ 4.          2.          1.          1.          1.          0.39068346
  -0.03408696]
 [ 4.          1.          3.          4.          1.         -1.50569139
  -0.84110367]]

Note that the label encoding starts from 1. This is because it is convenient to leave 0 for padding, i.e. unknown categories. Let's take from example the first entry

Behind the scenes, TabPreprocessor uses LabelEncoder, simply a custom numerical encoder for categorical features, available via

from pytorch_widedeep.utils import LabelEncoder

4.3. TextPreprocessor

This preprocessor returns the tokenized, padded sequences that will be directly "fed" to the deeptext component.

To illustrate the text and image preprocessors I will use a small sample of the Airbnb listing dataset, which you can get here.

airbnb=pd.read_csv("data/airbnb/airbnb_sample.csv")
texts = airbnb.description.tolist()
texts[0]
"My bright double bedroom with a large window has a relaxed feeling! It comfortably fits one or two and is centrally located just two blocks from Finsbury Park. Enjoy great restaurants in the area and easy access to easy transport tubes, trains and buses. Babies and children of all ages are welcome. Hello Everyone, I'm offering my lovely double bedroom in Finsbury Park area (zone 2) for let in a shared apartment.  You will share the apartment with me and it is fully furnished with a self catering kitchen. Two people can easily sleep well as the room has a queen size bed. I also have a travel cot for a baby for guest with small children.  I will require a deposit up front as a security gesture on both our parts and will be given back to you when you return the keys.  I trust anyone who will be responding to this add would treat my home with care and respect .  Best Wishes  Alina Guest will have access to the self catering kitchen and bathroom. There is the flat is equipped wifi internet,"
from pytorch_widedeep.preprocessing import TextPreprocessor

text_preprocessor = TextPreprocessor(text_col='description')
X_text = text_preprocessor.fit_transform(airbnb)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_text = text_preprocessor.transform(new_df)
The vocabulary contains 2192 tokens
print(X_text[0])
[  29   48   37  367  818   17  910   17  177   15  122  349   53  879
 1174  126  393   40  911    0   23  228   71  819    9   53   55 1380
  225   11   18  308   18 1564   10  755    0  942  239   53   55    0
   11   36 1013  277 1974   70   62   15 1475    9  943    5  251    5
    0    5    0    5  177   53   37   75   11   10  294  726   32    9
   42    5   25   12   10   22   12  136  100  145]

TextPreprocessor uses the utilities within the text_utils and the fastai_transforms modules. Again, all the utilities within those modules are are directly accessible from utils, e.g.:

from pytorch_widedeep.utils import simple_preprocess, pad_sequences, build_embeddings_matrix, Tokenizer, Vocab

4.4 ImagePreprocessor

Finally, ImagePreprocessor simply resizes the images, being aware of the aspect ratio. By default they will be resized to (224, 224, ...). This is because the default deepdense component of the model is a pre-trained ResNet model, which requires inputs of height and width of 224.

Let's have a look

from pytorch_widedeep.preprocessing import ImagePreprocessor

image_preprocessor = ImagePreprocessor(img_col='id', img_path="data/airbnb/property_picture/")
X_images = image_preprocessor.fit_transform(airbnb)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_images = image_preprocessor.transform(new_df)
Reading Images from data/airbnb/property_picture/
  4%|▍         | 41/1001 [00:00<00:02, 396.72it/s]
Resizing
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1001/1001 [00:02<00:00, 354.70it/s]
Computing normalisation metrics
X_images[0].shape
(224, 224, 3)

ImagePreprocessor uses two helpers: SimplePreprocessor and AspectAwarePreprocessor, available from the utils module, e.g.:

from pytorch_widedeep.utils import SimplePreprocessor, AspectAwarePreprocessor

These two classes are directly taken from Adrian Rosebrock's fantastic book "Deep Learning for Computer Vision". Therefore, all credit to Adrian.

5. Model Components

Let's now have a look to the components that can be used to build a wide and deep model. The 5 main components of WideDeep are:

wide
deeptabular
deeptext
deepimage
deephead

The first 4 will be collected and combined by the WideDeep class, while the 5th one can be optionally added to the WideDeep model through its corresponding parameters: deephead or alternatively head_layers, head_dropout and head_batchnorm.

5.1. wide

The wide component is a Linear layer "plugged" into the output neuron(s)

The only particularity of our implementation is that we have implemented the linear layer via an Embedding layer plus a bias. While the implementations are equivalent, the latter is faster and far more memory efficient, since we do not need to one hot encode the categorical features.

Let's have a look:

import torch
import pandas as pd
import numpy as np

from torch import nn
df = pd.DataFrame({'color': ['r', 'b', 'g'], 'size': ['s', 'n', 'l']})
df.head()
color size
0 r s
1 b n
2 g l

one hot encoded, the first observation (color: r, size: s) would be

obs_0_oh = (np.array([1., 0., 0., 1., 0., 0.])).astype('float32')

if we simply numerically encode (or label encode) the values:

obs_0_le = (np.array([0, 3])).astype('int64')

Note that in the implementation of the package we start from 1, saving 0 for padding, i.e. unseen values.

Now, let's see if the two implementations are equivalent

# we have 6 different values. Let's assume we are performing a regression, so pred_dim = 1
lin = nn.Linear(6, 1)
emb = nn.Embedding(6, 1) 
emb.weight = nn.Parameter(lin.weight.reshape_as(emb.weight))
lin(torch.tensor(obs_0_oh))
tensor([0.0656], grad_fn=<AddBackward0>)
emb(torch.tensor(obs_0_le)).sum() + lin.bias
tensor([0.0656], grad_fn=<AddBackward0>)

And this is precisely how the linear component Wide is implemented

from pytorch_widedeep.models import Wide
wide = Wide(wide_dim=10, pred_dim=1)
wide
Wide(
  (wide_linear): Embedding(11, 1, padding_idx=0)
)

Again, let me emphasize that even though the input dim is 10, the Embedding layer has 11 weights. This is because we save 0 for padding, which is used for unseen values during the encoding process

5.2. deeptabular

There are 3 alternatives for the so called deepdense component of the model: TabMlp and TabResnet and the TabTransformer:

  1. TabMlp: this is almost identical to the tabular model in the fantastic fastai library, and consists simply in embeddings representing the categorical features, concatenated with the continuous features, and passed then through a MLP.

  2. TabRenset: This is similar to the previous model but the embeddings are passed through a series of ResNet blocks built with dense layers.

  3. TabTransformer: Details on the TabTransformer can be found in: TabTransformer: Tabular Data Modeling Using Contextual Embeddings

For details on these 3 models and their options please see the examples in the Examples folder and the documentation.

Through the development of the package, the deeptabular component became one of the core values of the package. The possibilities are numerous, and therefore, I will further describe this component in detail in a separate post.

For now let's have a quick look:

Let's have a look first to TabMlp:

from pytorch_widedeep.models import TabMlp

# fake dataset
X_tab = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
colnames = ['a', 'b', 'c', 'd', 'e']
embed_input = [(u,i,j) for u,i,j in zip(colnames[:4], [4]*4, [8]*4)]
column_idx = {k:v for v,k in enumerate(colnames)}
continuous_cols = ['e']

# my advice would be to not use dropout in the last layer, but I add the option because you never 
# know..there is crazy people everywhere.
tabmlp = TabMlp(
    mlp_hidden_dims=[16,8], 
    mlp_dropout=[0.5, 0.], 
    mlp_batchnorm=True, 
    mlp_activation="leaky_relu",
    column_idx=column_idx,
    embed_input=embed_input, 
    continuous_cols=continuous_cols)
tabmlp
TabMlp(
  (embed_layers): ModuleDict(
    (emb_layer_a): Embedding(5, 8, padding_idx=0)
    (emb_layer_b): Embedding(5, 8, padding_idx=0)
    (emb_layer_c): Embedding(5, 8, padding_idx=0)
    (emb_layer_d): Embedding(5, 8, padding_idx=0)
  )
  (embedding_dropout): Dropout(p=0.1, inplace=False)
  (tab_mlp): MLP(
    (mlp): Sequential(
      (dense_layer_0): Sequential(
        (0): BatchNorm1d(33, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (1): Dropout(p=0.5, inplace=False)
        (2): Linear(in_features=33, out_features=16, bias=False)
        (3): LeakyReLU(negative_slope=0.01, inplace=True)
      )
      (dense_layer_1): Sequential(
        (0): Linear(in_features=16, out_features=8, bias=True)
        (1): LeakyReLU(negative_slope=0.01, inplace=True)
      )
    )
  )
)
tabmlp(X_tab)
tensor([[-2.0658e-03,  5.0888e-01,  2.1883e-01, -3.1523e-03, -3.2836e-03,
          8.3450e-02, -3.4315e-03, -8.6029e-04],
        [-2.8116e-03,  2.1922e-01,  5.0364e-01, -1.3522e-03, -9.8741e-04,
         -1.2356e-03, -1.4323e-03,  2.7542e-03],
        [ 1.1020e-01,  4.0867e-01,  4.3776e-01,  3.1146e-03,  2.7392e-01,
         -1.2640e-02,  1.2793e-02,  5.7851e-01],
        [-4.4498e-03,  2.0174e-01,  1.1082e+00,  2.3353e-01, -1.9922e-05,
         -4.9581e-03,  6.1367e-01,  9.4608e-01],
        [-5.7167e-03,  2.7813e-01,  7.8706e-01, -3.6171e-03,  1.5563e-01,
         -1.1303e-02, -7.6483e-04,  5.0236e-01]], grad_fn=<LeakyReluBackward1>)

Let's now have a look to TabResnet:

from pytorch_widedeep.models import TabResnet

tabresnet = TabResnet(
    blocks_dims=[16, 8],
    blocks_dropout=0.1, 
    column_idx=column_idx,
    embed_input=embed_input, 
    continuous_cols=continuous_cols,
)
    

tabresnet
TabResnet(
  (embed_layers): ModuleDict(
    (emb_layer_a): Embedding(5, 8, padding_idx=0)
    (emb_layer_b): Embedding(5, 8, padding_idx=0)
    (emb_layer_c): Embedding(5, 8, padding_idx=0)
    (emb_layer_d): Embedding(5, 8, padding_idx=0)
  )
  (embedding_dropout): Dropout(p=0.1, inplace=False)
  (tab_resnet): DenseResnet(
    (dense_resnet): Sequential(
      (lin1): Linear(in_features=33, out_features=16, bias=True)
      (bn1): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (block_0): BasicBlock(
        (lin1): Linear(in_features=16, out_features=8, bias=True)
        (bn1): BatchNorm1d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (leaky_relu): LeakyReLU(negative_slope=0.01, inplace=True)
        (dp): Dropout(p=0.1, inplace=False)
        (lin2): Linear(in_features=8, out_features=8, bias=True)
        (bn2): BatchNorm1d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (resize): Sequential(
          (0): Linear(in_features=16, out_features=8, bias=True)
          (1): BatchNorm1d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
  )
)
tabresnet(X_tab)
tensor([[-1.7038e-02, -2.2898e-03,  6.7239e-01, -1.1374e-02, -1.4843e-03,
         -1.0570e-02,  5.0264e-01, -1.3277e-02],
        [ 2.2679e+00, -5.1538e-04, -2.6135e-02, -2.9038e-02, -2.2504e-02,
          5.5052e-01,  1.0497e+00,  1.3348e+00],
        [ 2.5005e-01,  7.7862e-01,  4.0052e-01,  7.6070e-01,  5.2203e-01,
          6.5057e-01, -2.3226e-02, -4.0509e-04],
        [-1.3928e-02, -6.9325e-03,  1.6976e-01,  1.3968e+00,  5.9813e-01,
         -9.4279e-03, -9.0917e-03,  7.7908e-01],
        [ 5.7862e-01,  1.9515e-01,  1.3709e+00,  1.8836e+00,  1.2787e+00,
          7.9873e-01,  1.6794e+00, -7.4565e-03]], grad_fn=<LeakyReluBackward1>)

and finally, the TabTransformer:

from pytorch_widedeep.models import TabTransformer
embed_input = [(u,i) for u,i in zip(colnames[:4], [4]*4)]
tabtransformer = TabTransformer(
    column_idx=column_idx, 
    embed_input=embed_input, 
    continuous_cols=continuous_cols
)
tabtransformer
TabTransformer(
  (embed_layers): ModuleDict(
    (emb_layer_a): Embedding(5, 32, padding_idx=0)
    (emb_layer_b): Embedding(5, 32, padding_idx=0)
    (emb_layer_c): Embedding(5, 32, padding_idx=0)
    (emb_layer_d): Embedding(5, 32, padding_idx=0)
  )
  (embedding_dropout): Dropout(p=0.1, inplace=False)
  (blks): Sequential(
    (block0): TransformerEncoder(
      (self_attn): MultiHeadedAttention(
        (dropout): Dropout(p=0.1, inplace=False)
        (inp_proj): Linear(in_features=32, out_features=96, bias=True)
        (out_proj): Linear(in_features=32, out_features=32, bias=True)
      )
      (feed_forward): PositionwiseFF(
        (w_1): Linear(in_features=32, out_features=128, bias=True)
        (w_2): Linear(in_features=128, out_features=32, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (activation): GELU()
      )
      (attn_addnorm): AddNorm(
        (dropout): Dropout(p=0.1, inplace=False)
        (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      )
      (ff_addnorm): AddNorm(
        (dropout): Dropout(p=0.1, inplace=False)
        (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      )
    )
    (block1): TransformerEncoder(
      (self_attn): MultiHeadedAttention(
        (dropout): Dropout(p=0.1, inplace=False)
        (inp_proj): Linear(in_features=32, out_features=96, bias=True)
        (out_proj): Linear(in_features=32, out_features=32, bias=True)
      )
      (feed_forward): PositionwiseFF(
        (w_1): Linear(in_features=32, out_features=128, bias=True)
        (w_2): Linear(in_features=128, out_features=32, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (activation): GELU()
      )
      (attn_addnorm): AddNorm(
        (dropout): Dropout(p=0.1, inplace=False)
        (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      )
      (ff_addnorm): AddNorm(
        (dropout): Dropout(p=0.1, inplace=False)
        (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      )
    )
    (block2): TransformerEncoder(
      (self_attn): MultiHeadedAttention(
        (dropout): Dropout(p=0.1, inplace=False)
        (inp_proj): Linear(in_features=32, out_features=96, bias=True)
        (out_proj): Linear(in_features=32, out_features=32, bias=True)
      )
      (feed_forward): PositionwiseFF(
        (w_1): Linear(in_features=32, out_features=128, bias=True)
        (w_2): Linear(in_features=128, out_features=32, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (activation): GELU()
      )
      (attn_addnorm): AddNorm(
        (dropout): Dropout(p=0.1, inplace=False)
        (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      )
      (ff_addnorm): AddNorm(
        (dropout): Dropout(p=0.1, inplace=False)
        (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      )
    )
    (block3): TransformerEncoder(
      (self_attn): MultiHeadedAttention(
        (dropout): Dropout(p=0.1, inplace=False)
        (inp_proj): Linear(in_features=32, out_features=96, bias=True)
        (out_proj): Linear(in_features=32, out_features=32, bias=True)
      )
      (feed_forward): PositionwiseFF(
        (w_1): Linear(in_features=32, out_features=128, bias=True)
        (w_2): Linear(in_features=128, out_features=32, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (activation): GELU()
      )
      (attn_addnorm): AddNorm(
        (dropout): Dropout(p=0.1, inplace=False)
        (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      )
      (ff_addnorm): AddNorm(
        (dropout): Dropout(p=0.1, inplace=False)
        (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      )
    )
    (block4): TransformerEncoder(
      (self_attn): MultiHeadedAttention(
        (dropout): Dropout(p=0.1, inplace=False)
        (inp_proj): Linear(in_features=32, out_features=96, bias=True)
        (out_proj): Linear(in_features=32, out_features=32, bias=True)
      )
      (feed_forward): PositionwiseFF(
        (w_1): Linear(in_features=32, out_features=128, bias=True)
        (w_2): Linear(in_features=128, out_features=32, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (activation): GELU()
      )
      (attn_addnorm): AddNorm(
        (dropout): Dropout(p=0.1, inplace=False)
        (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      )
      (ff_addnorm): AddNorm(
        (dropout): Dropout(p=0.1, inplace=False)
        (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      )
    )
    (block5): TransformerEncoder(
      (self_attn): MultiHeadedAttention(
        (dropout): Dropout(p=0.1, inplace=False)
        (inp_proj): Linear(in_features=32, out_features=96, bias=True)
        (out_proj): Linear(in_features=32, out_features=32, bias=True)
      )
      (feed_forward): PositionwiseFF(
        (w_1): Linear(in_features=32, out_features=128, bias=True)
        (w_2): Linear(in_features=128, out_features=32, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (activation): GELU()
      )
      (attn_addnorm): AddNorm(
        (dropout): Dropout(p=0.1, inplace=False)
        (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      )
      (ff_addnorm): AddNorm(
        (dropout): Dropout(p=0.1, inplace=False)
        (ln): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      )
    )
  )
  (tab_transformer_mlp): MLP(
    (mlp): Sequential(
      (dense_layer_0): Sequential(
        (0): Linear(in_features=129, out_features=516, bias=True)
        (1): ReLU(inplace=True)
        (2): Dropout(p=0.1, inplace=False)
      )
      (dense_layer_1): Sequential(
        (0): Linear(in_features=516, out_features=258, bias=True)
        (1): ReLU(inplace=True)
        (2): Dropout(p=0.1, inplace=False)
      )
    )
  )
)
tabtransformer(X_tab)
tensor([[0.0000, 0.0000, 0.0000,  ..., 0.0399, 0.2358, 0.3762],
        [0.1373, 0.0000, 0.0000,  ..., 0.0550, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0212, 0.0000],
        [0.3322, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [0.2914, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.6590]],
       grad_fn=<MulBackward0>)

5.3. deeptext

pytorch-widedeep offers one model that can be passed to WideDeep as the deeptext component, DeepText, which is a standard and simple stack of LSTMs on top of word embeddings. You could also add a FC-Head on top of the LSTMs. The word embeddings can be pre-trained. In the future I aim to include some simple pre-trained models so that the combination between text and images is fair.

On the other hand, while I recommend using the wide and deeptabular models within this package when building the corresponding wide and deep model components, it is very likely that the user will want to use custom text and image models. That is perfectly possible. Simply, build them and pass them as the corresponding parameters. Note that the custom models MUST return a last layer of activations (i.e. not the final prediction) so that these activations are collected by WideDeep and combined accordingly. In addition, the models MUST also contain an attribute output_dim with the size of these last layers of activations.

I will illustrate all of the above more in detail in the second post of these series.

Let's have a look to DeepText

import torch
from pytorch_widedeep.models import DeepText
X_text = torch.cat((torch.zeros([5,1]), torch.empty(5, 4).random_(1,4)), axis=1)
deeptext = DeepText(vocab_size=4, hidden_dim=4, n_layers=1, padding_idx=0, embed_dim=4)
deeptext
/Users/javier/.pyenv/versions/3.7.9/envs/wdposts/lib/python3.7/site-packages/torch/nn/modules/rnn.py:60: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.1 and num_layers=1
  "num_layers={}".format(dropout, num_layers))
DeepText(
  (word_embed): Embedding(4, 4, padding_idx=0)
  (rnn): LSTM(4, 4, batch_first=True, dropout=0.1)
)
deeptext(X_text)
tensor([[ 0.1727, -0.0800, -0.2599, -0.1245],
        [ 0.1530, -0.2874, -0.2385, -0.1379],
        [-0.0747, -0.1666, -0.0124, -0.1875],
        [-0.0382, -0.1085, -0.0167, -0.1702],
        [-0.0393, -0.0926, -0.0141, -0.1371]], grad_fn=<SelectBackward>)

You could, if you wanted, add a Fully Connected Head (FC-Head) on top of it

deeptext = DeepText(vocab_size=4, hidden_dim=8, n_layers=3, padding_idx=0, embed_dim=4, 
                    head_hidden_dims=[8,4], head_batchnorm=True, head_dropout=[0.5, 0.5])
deeptext
DeepText(
  (word_embed): Embedding(4, 4, padding_idx=0)
  (rnn): LSTM(4, 8, num_layers=3, batch_first=True, dropout=0.1)
  (texthead): MLP(
    (mlp): Sequential(
      (dense_layer_0): Sequential(
        (0): Dropout(p=0.5, inplace=False)
        (1): Linear(in_features=8, out_features=4, bias=True)
        (2): ReLU(inplace=True)
      )
    )
  )
)
deeptext(X_text)
tensor([[0.4726, 0.0555, 0.0000, 0.1431],
        [0.4907, 0.1357, 0.0000, 0.2591],
        [0.4019, 0.0831, 0.0000, 0.1308],
        [0.3942, 0.1759, 0.0000, 0.2517],
        [0.3184, 0.0902, 0.0000, 0.1955]], grad_fn=<ReluBackward1>)

5.4. deepimage

Similarly to deeptext, pytorch-widedeep offers one model that can be passed to WideDeep as the deepimage component, DeepImage, which is either a pre-trained ResNet (18, 34, or 50. Default is 18) or a stack of CNNs, to which one can add a FC-Head. If is a pre-trained ResNet, you can chose how many layers you want to defrost deep into the network with the parameter freeze_n

from pytorch_widedeep.models import DeepImage

X_img = torch.rand((2,3,224,224))
deepimage = DeepImage(head_hidden_dims=[512, 64, 8], head_activation="leaky_relu")

deepimage
DeepImage(
  (backbone): Sequential(
    (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace=True)
    (3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (4): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (1): BasicBlock(
        (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (5): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (downsample): Sequential(
          (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): BasicBlock(
        (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (6): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (downsample): Sequential(
          (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
          (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): BasicBlock(
        (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (7): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (downsample): Sequential(
          (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
          (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): BasicBlock(
        (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (8): AdaptiveAvgPool2d(output_size=(1, 1))
  )
  (imagehead): MLP(
    (mlp): Sequential(
      (dense_layer_0): Sequential(
        (0): Dropout(p=0.1, inplace=False)
        (1): Linear(in_features=512, out_features=64, bias=True)
        (2): LeakyReLU(negative_slope=0.01, inplace=True)
      )
      (dense_layer_1): Sequential(
        (0): Dropout(p=0.1, inplace=False)
        (1): Linear(in_features=64, out_features=8, bias=True)
        (2): LeakyReLU(negative_slope=0.01, inplace=True)
      )
    )
  )
)
deepimage(X_img)
tensor([[ 0.0965,  0.0056,  0.1143, -0.0007,  0.3860, -0.0050, -0.0023, -0.0011],
        [ 0.2437, -0.0020, -0.0021,  0.2480,  0.6217, -0.0033, -0.0030,  0.0566]],
       grad_fn=<LeakyReluBackward1>)

5.5. deephead

The are two possibilities when defining the so-called deephead component.

  1. When defining the WideDeep model there is a parameter called head_hidden_dims (and the corresponding related parameters. See the package documentation) that define the FC-head on top of the deeptabular, deeptext and deepimage components.

  2. Of course, you could also chose to define it yourself externally and pass it using the parameter deephead. Have a look at the documentation.

6. Conclusion

This is the first of a series of posts introducing the python library pytorch-widedeep. This library is intended to be a flexible frame to combine tabular data with text and images via wide and deep models. Of course, it can also be used directly on "traditional" tabular data, without text and/or images.

In this post I have shown how to quickly start using the library (Section 3) and explained the utilities available in the preprocessing module (Section 4) and and model component definitions (Section 5), available in the models module.

In the next post I will show more advance uses that hopefully will illustrate pytorch-widedeep's flexibility to build wide and deep models.

References

[1] Wide & Deep Learning for Recommender Systems. Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, et al. 2016. arXiv:1606.07792

[2] TabNet: Attentive Interpretable Tabular Learning. Sercan O. Arik, Tomas Pfister, 2020. arXiv:1908.07442

[3] AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data Nick Erickson, Jonas Mueller, Alexander Shirkov, et al., 2020. arXiv:2003.06505

[4] Universal Language Model Fine-tuning for Text Classification. Jeremy Howard, Sebastian Ruder, 2018 arXiv:1801.06146v5

[5] Single Headed Attention RNN: Stop Thinking With Your Head. Stephen Merity, 2019 arXiv:1801.06146v5