Post

Optimize Bigram based on Yoshua et.'s paper

Optimize Bigram based on Yoshua et.'s paper

Paper Link

Brief Description

Desktop View

Introducing embedding word to solve the problem of matrix being too sparse

  • Tranditional NLP: using one_hot dog -> [1, 0, 0, 0, 0]
  • Embedding word: Take the corresponding word vector from a shared parameter matrix C
    • dog - > [1.324, 2.23, 4.234]
  • PS: In this section, same logic in bigrams mapping

Note

  • Overview
    • The model is designed for predicting the probability of a word sequence.
    • Uses one-hot encoding for input word representation.
    • Applies a hidden layer with activation (e.g., tanh).
    • Uses softmax to normalize output probabilities (ensuring they sum to 1).
  • Encoding and Table Mapping C Table: A lookup table that maps input sequences to specific values. -CXC[X] is used to convert a 2D table (C) into 3D representations.
    • Example:
      • C[X][1][2] maps to a stored value.
      • C[X][1][2]=C[1], meaning indexing refers to specific stored values.
  • Overfitting or Underfitting
    • If training accuracy is high but validation accuracy is low, overfitting might be happening
    • Parameter Tuning
      • Optimize learning rate (draw the pic)
      • Adjust batch size(big size for stabilize training, small size for improve generalization)
      • Tune hidden layer sizes to balance model complexity and performance.
  • Torch
    • cross_entropy same as the last one
      1
      2
      3
      4
      
      # cross_entropy equals
      probs = counts / counts.sum(1, keepdim=True)
      loss = -probs[torch.arange(len(xs)), ys].log().mean() + 0.01*(W**2).mean()
      loss
      
    • Scalar: only number
    • Tensor: n-dim array
    • When to use scalars and when to use tensors?
      • Scalars: loss, autograd, mathematical operations
      • Tensors: Input to the neural network (batch data) or matrix operation

Code

Pre-setting

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import random

words = open('names.txt', 'r').read().splitlines()

chars=sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}

# context length: how many characters do we take to predict the next one?
# slide window
block_size = 3 

# Calculate to make dataset respectively
def build_dataset(words):  
    X, Y = [], []
    for w in words:
        #print(w)
        context = [0] * block_size
        for ch in w + '.':
            ix = stoi[ch]
            X.append(context)
            Y.append(ix)
            context = context[1:] + [ix] 
        
    X = torch.tensor(X)
    Y = torch.tensor(Y)
    return X, Y
random.seed(1337)
random.shuffle(words)
n = len(words)
n1 = int(0.8 * n)
n2 = int(0.9 * n)

# Training data | validation data | test data
Xtr, Ytr = build_dataset(words[:n1])
Xdev, Ydev = build_dataset(words[n1:n2])
Xte, Yte = build_dataset(words[n2:])

g = torch.Generator().manual_seed(2147483647) 
# Means each block has 10 dim -> [lens(X), 10]
C = torch.randn((27, 10), generator=g)

# Input layer to hidden layer(tanh)
W1 = torch.randn((30, 200), generator=g)
b1 = torch.randn(200, generator=g)
# Hidden layer to output layer
W2 = torch.randn((200, 27), generator=g)
b2 = torch.randn(27, generator=g)
parameters = [C, W1, b1, W2, b2]

# in this case: 30*200(W1) + 200(b1) + 200*27(W2) + 27(b2) + 27*10(C)
sum(p.nelement() for p in parameters) 

for p in parameters:
    p.requires_grad = True

# Help us the know the best learning rate
lre = torch.linspace(-3, 0, 1000)
lrs = 10**lre
lri = []
lossi = []
stepi = []

Training procedure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
for i in range(200000):
    # minibatch construct to reduce time
    # returns a tensor of random indices of size 32
    # for each indice, the range (0, Xtr.shape[0]) = (0, 1537)
    ix = torch.randint(0, Xtr.shape[0], (32,))
    
    # forward pass(training data only)
    emb = C[Xtr[ix]] # (32, 3, 10)
    
    h = torch.tanh(emb.view(-1, 30) @ W1 + b1) # (32, 200)
    logits = h @ W2 + b2 # (32, 27)
    loss = F.cross_entropy(logits, Ytr[ix])
    #print(loss.item())
    
    # backward pass
    for p in parameters:
    # avoid accumulating gradients
        p.grad = None
    loss.backward()
    
    # update
    lr = 0.1 if i < 100000 else 0.01
    for p in parameters:
        p.data += -lr * p.grad

    # track stats
    stepi.append(i)
    lossi.append(loss.log10().item())

Samples

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# sample from the model
g = torch.Generator().manual_seed(2147483647 + 10)

for _ in range(20):
    
    out = []
    context = [0] * block_size # initialize with all ...
    while True:
        emb = C[torch.tensor([context])] # (1,block_size,d)
        h = torch.tanh(emb.view(1, -1) @ W1 + b1)
        logits = h @ W2 + b2
        probs = F.softmax(logits, dim=1)
        ix = torch.multinomial(probs, num_samples=1, generator=g).item()
        context = context[1:] + [ix]
        out.append(ix)
        if ix == 0:
            break
    
    print(''.join(itos[i] for i in out))

More details about word embedding

Desktop View

This post is licensed under CC BY 4.0 by the author.