Notes for – Building makemore Part 2: MLP
January 23rd, 2023Attempting to scale normalized probability counts grows exponentially You can use Multi Layer Perceptrons (MLPs) as a solution to maximize the log-likelihood of the training data. MLPs let you make predictions by embedding words close togehter in a space such that knowledge transfer of interchangability can occur with good confidence.
With a vocabulary of 17000 we use a lookup table of some dimension, say 30, so each lookup for a word provides a 30-dimensional embedding vector for that word.
To look up 3 words we would have 3 embedding vectors each with 30 neurons making up 90 neurons in total.
Let C represent this lookup table.
Then we have a hidden layer of the NN which has a size that is represented by a hyperparameter. They hyperparameter can have different sizes which we will evaluate as part of this exercise. Most of the compute happense between this hidden layer and the next layer which is the softmax.
The softmax returns logits which represents a normalized distribution of the 17000 words with probabilities which reflect the likelihood of the next word.
The parameters of the NN are weights and biases for the output layer, the weights and biases of the hidden layer and the lookup table C.
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt # for making figures
%matplotlib inline
# read in all the words
words = open('names.txt', 'r').read().splitlines()
words[:8]
len(words)
# build the vocabulary of characters and mappings to/from integers
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}
print(itos)
We are using a block_size which as the comment below suggests is how many characters we take to predict the next one
block_size = 3
num_neurons = 16
num_features = 32
num_steps = 200000
# build the dataset
X, Y = [], []
for w in words[:5]:
#print(w)
context = [0] * block_size
for ch in w + '.':
ix = stoi[ch]
X.append(context)
Y.append(ix)
print(''.join(itos[i] for i in context), '--->', itos[ix])
context = context[1:] + [ix] # crop and append
X = torch.tensor(X)
Y = torch.tensor(Y)
X.shape, X.dtype, Y.shape, Y.dtype
We are going to squeeze the 27 characters into a 2-dimensional embedding
C = torch.randn((27, 2))
fun with C
C[5]
C[[5,6,7,7,9,9]]
C[torch.tensor([5,6,7])]
X.shape
X[228140:]
F.one_hot(torch.tensor(5), num_classes=27).float() @ C
we split the training set into 3 splits Xtr, Ytr (training split) Xdev, Ydev (dev split) Xte, Yte (test split)
# build the dataset
def build_dataset(words):
X, Y = [], []
for w in words:
#print(w)
context = [0] * block_size
for ch in w + '.':
ix = stoi[ch]
X.append(context)
Y.append(ix)
#print(''.join(itos[i] for i in context), '--->', itos[ix])
context = context[1:] + [ix] # crop and append
X = torch.tensor(X)
Y = torch.tensor(Y)
print(X.shape, Y.shape)
return X, Y
import random
random.seed(42)
random.shuffle(words)
n1 = int(0.8*len(words))
n2 = int(0.9*len(words))
Xtr, Ytr = build_dataset(words[:n1])
Xdev, Ydev = build_dataset(words[n1:n2])
Xte, Yte = build_dataset(words[n2:])
emb = C[X]
emb.shape
hidden layer W1 weights we initialize randomly number of inputs is 3 * 2 (2 dimensional embeddings, 3 of them) = 6 number of neurons is up to us 100 as an example
W1 = torch.randn((6, 100))
b1 = torch.randn(100)
the problem is we take the input emb @ W1 + b1 but the problem is these dimensions are stacked up in the tensor shapes are [32, 3, 2] and W1 is [6, 100] we need to concat the inputs
static way to concat
torch.cat([emb[:, 0, :], emb[:, 1, :], emb[:, 2, :]], 1).shape
dynamic way to concat !! CAUTION this creates new memory, better to use .view()
torch.cat(torch.unbind(emb, 1), 1).shape
a.view()
a = torch.arange(18)
a.shape
a.view(3,2,3)
a.storage()
emb.view(32, 6) == torch.cat(torch.unbind(emb, 1), 1)
the -1 in emb.view(-1, 6) tells pytorch to infer the first param automatically we want to tanh to get our h to make numbers between -1 and 1
h = torch.tanh(emb.view(-1, 6) @ W1 + b1)
h
h.shape
(emb.view(-1, 6) @ W1).shape
b1.shape
32, 100 1 , 100 the 1 is automatically filled in so the same bias vector will be added to each row of the matrix
W2 = torch.randn((100, 27))
b2 = torch.randn(27)
logits = h @ W2 + b2
logits.shape
this would create another array for counts, probs and loss. better to use a function cross_entropy to be more efficient and numerically well behaved
counts = logits.exp()
prob = counts / counts.sum(1, keepdims=True)
prob.shape
recall, Y is the next character in the sequence we'd like to predict
loss = -prob[torch.arange(32), Y].log().mean()
loss
# ------------ now made respectable :) ---------------
Xtr.shape, Ytr.shape # dataset
C is embedding lookup table we have 27 possible characters we are going to embed them in a lower dimensional space (10) "feature vector" The feature vector represents different aspects of the character. They are learned but could be initialized using prior knowledge of semantic features. Similar letters will have a similar feature vector. A small change in the features will induce a small change in the probability, not only for that character but also of its combinatorial number of "neighbours" in character space (as represented by sequences of feature vectors)
W1 is the hidden layer it's size is (block_size * feature_vector, num_neurons) num_neurons is a number we can play with
g = torch.Generator().manual_seed(2147483647) # for reproducibility
C = torch.randn((27, num_features), generator=g)
W1 = torch.randn((block_size * num_features, num_neurons), generator=g)
b1 = torch.randn(num_neurons, generator=g)
W2 = torch.randn((num_neurons, 27), generator=g)
b2 = torch.randn(27, generator=g)
parameters = [C, W1, b1, W2, b2]
C[5]
sum(p.nelement() for p in parameters) # number of parameters in total
for p in parameters:
p.requires_grad = True
lre = torch.linspace(-3, 0, 1000)
lrs = 10**lre
lri = []
lossi = []
stepi = []
cross_entropy pytorch will cluster up the operations and very often create fused kernels that evaluate the expressions the backward pass is also much more efficient analytically and mathematically from https://stackoverflow.com/questions/56601075/what-is-a-fused-kernel-or-fused-layer-in-deep-learning "Fusing" means commonalization of computation steps. Basically, it's an implementation trick to run code more efficiently by combining similar operations in a single hardware (GPU, CPU or TPU) operation. Therefore, a "fusedLayer" is a layer where operations benefit from a "fused" implementation.
cross_entropy is more well behaved when logits take on more extreme values where we run out of range on the floating point number
minibatching is much better to have an approximate gradient and take more steps than an exact gradient and fewer steps
we want to find a reasonable learning rate to do this you can play around with exponential learning rates and then plot the learning rate vs loss you look for a stable min value in this chart and that should give you a fairly good learning rate
for i in range(200000):
ix = torch.randint(0, Xtr.shape[0], (32,))
break;
Xtr.shape
forwarding and backwarding tens of thousands of examples is too much work so we construct a minibatch of the data randomly selected
this minibatch will generate 32 samples between 0 and Xtr.shape[0] (182441) we use these 32 samples to index into the X training set at those indicies, we use the same 32 samples to also index into Y training set
C.shape
for i in range(num_steps):
# minibatch construct
ix = torch.randint(0, Xtr.shape[0], (33,))
# forward pass
emb = C[Xtr[ix]] # (32, 3, 10)
h = torch.tanh(emb.view(-1, block_size * num_features) @ W1 + b1) # (32, 200)
logits = h @ W2 + b2 # (32, 27)
loss = F.cross_entropy(logits, Ytr[ix])
# backward pass
for p in parameters:
p.grad = None
loss.backward()
# update
#lr = lrs[i]
lr = 0.1 if i < 100000 else 0.01
for p in parameters:
p.data += -lr * p.grad
# track stats
#lri.append(lre[i])
stepi.append(i)
lossi.append(loss.log10().item())
print(loss.item())
logits.max(1)
plt.plot(stepi, lossi)
models can get larger and larger as you add parameters as the capacity of the NN grows it becomes more and more capable of overfitting your training set that is, it will have very low loss and will only give you back examples from your training set then when you explore the loss on examples that are outside of the training set, you get loss that is very high
the standard in the field is to split up the data into 3 splits 1) training split 80% 2) dev / validation split 10% 3) test split 10%
the dev/validation split is used to tune hyperparamaters for example size of hidden layer or embedding
the test split is evaluated very sparingly because you risk overfitting your data again
emb = C[Xtr] # (32, 3, 2)
h = torch.tanh(emb.view(-1, block_size * num_features) @ W1 + b1) # (32, 100)
logits = h @ W2 + b2 # (32, 27)
loss = F.cross_entropy(logits, Ytr)
loss
emb = C[Xdev] # (32, 3, 2)
h = torch.tanh(emb.view(-1, block_size * num_features) @ W1 + b1) # (32, 100)
logits = h @ W2 + b2 # (32, 27)
loss = F.cross_entropy(logits, Ydev)
loss
# visualize dimensions 0 and 1 of the embedding matrix C for all characters
plt.figure(figsize=(8,8))
plt.scatter(C[:,0].data, C[:,1].data, s=200)
for i in range(C.shape[0]):
plt.text(C[i,0].item(), C[i,1].item(), itos[i], ha="center", va="center", color='white')
plt.grid('minor')
# training split, dev/validation split, test split
# 80%, 10%, 10%
context = [0] * block_size
C[torch.tensor([context])].shape
# sample from the model
g = torch.Generator().manual_seed(2147483647 + 10)
for _ in range(20):
out = []
context = [0] * block_size # initialize with all ...
while True:
emb = C[torch.tensor([context])] # (1,block_size,d)
h = torch.tanh(emb.view(1, -1) @ W1 + b1)
logits = h @ W2 + b2
probs = F.softmax(logits, dim=1)
ix = torch.multinomial(probs, num_samples=1, generator=g).item()
context = context[1:] + [ix]
out.append(ix)
if ix == 0:
break
print(''.join(itos[i] for i in out))
Homework:
Tuning
block_size = 4 num_neurons = 10 num_features = 30 num_steps = 200000 batch_size = 16test loss: 2.3414
block_size = 4 num_neurons = 10 num_features = 2 num_steps = 20000 batch_size = 16test loss: 2.5306
block_size = 4 num_neurons = 10 num_features = 4 num_steps = 200000 batch_size = 2test loss: 2.3566
block_size = 2 num_neurons = 10 num_features = 4 num_steps = 200000 batch_size = 16test loss: 2.3719
block_size = 3 num_neurons = 10 num_features = 4 num_steps = 200000 batch_size = 16test loss: 2.3340
block_size = 3 num_neurons = 100 num_features = 10 num_steps = 200000 batch_size = 32test loss: 2.1759
block_size = 3 num_neurons = 100 num_features = 20 num_steps = 200000 batch_size = 8test loss: 2.1759
block_size = 3 num_neurons = 500 num_features = 5 num_steps = 200000 batch_size = 32test loss: 2.1734
block_size = 3 num_neurons = 80 num_features = 4 num_steps = 200000 batch_size = 32test loss: 2.2322
block_size = 4 num_neurons = 300 num_features = 14 num_steps = 400000 batch_size = 40test loss: 2.1812
block_size = 3 num_neurons = 200 num_features = 6 num_steps = 200000 batch_size = 32test loss: 2.1625
block_size = 3 num_neurons = 20 num_features = 6 num_steps = 200000 batch_size = 32test loss: 2.2723
block_size = 3 num_neurons = 300 num_features = 6 num_steps = 200000 batch_size = 32test loss: 2.1847
block_size = 3 num_neurons = 220 num_features = 6 num_steps = 200000 batch_size = 32test loss: 2.1871
block_size = 3 num_neurons = 180 num_features = 6 num_steps = 200000 batch_size = 32test loss: 2.1850
Posted In:
ABOUT THE AUTHOR:Software Developer always striving to be better. Learn from others' mistakes, learn by doing, fail fast, maximize productivity, and really think hard about good defaults. Computer developers have the power to add an entire infinite dimension with a single Int (or maybe BigInt). The least we can do with that power is be creative.