Complete Self-Attention from Scratch

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

It follows the same steps as the Simple Self-Attention from Scratch, but does not rely on any of the helper functions defined in the attention package, rather it implements everything in base R.

# encoder representations of four different words
word_1 = matrix(c(1,0,0), nrow=1)
word_2 = matrix(c(0,1,0), nrow=1)
word_3 = matrix(c(1,1,0), nrow=1)
word_4 = matrix(c(0,0,1), nrow=1)

Next, we stack the word embeddings into a single array (in this case a matrix) which we call words.

# stacking the word embeddings into a single array
words = rbind(word_1,
              word_2,
              word_3,
              word_4)

print(words)
#>      [,1] [,2] [,3]
#> [1,]    1    0    0
#> [2,]    0    1    0
#> [3,]    1    1    0
#> [4,]    0    0    1

# initializing the weight matrices (with random values)
set.seed(0)
W_Q = matrix(floor(runif(9, min=0, max=3)),nrow=3,ncol=3)
W_K = matrix(floor(runif(9, min=0, max=3)),nrow=3,ncol=3)
W_V = matrix(floor(runif(9, min=0, max=3)),nrow=3,ncol=3)

Next, we generate the Queries (Q), Keys (K), and Values (V). The %*% operator performs the matrix multiplication. You can view the R help page using help('%*%') (or the online An Introduction to R).

# generating the queries, keys and values
Q = words %*% W_Q
K = words %*% W_K
V = words %*% W_V

Following this, we score the Queries (Q) against the Key (K) vectors (which are transposed for the multiplation using t(), see help('t') for more info).

# scoring the query vectors against all key vectors
scores = Q %*% t(K)
print(scores)
#>      [,1] [,2] [,3] [,4]
#> [1,]    6    4   10    5
#> [2,]    4    6   10    6
#> [3,]   10   10   20   11
#> [4,]    3    1    4    2

We now calculate the maximum value for each row and preserve the structure (i.e. the 4 rows, now with only one column which contains the maximum value for the corresponding row).

# calculate the max for each row of the scores matrix
maxs = as.matrix(apply(scores, MARGIN=1, FUN=max))
print(maxs)
#>      [,1]
#> [1,]   10
#> [2,]   10
#> [3,]   20
#> [4,]    4

As you can see, the value for each row in maxs is the maximum value of the corresponding row in scores.

# initialize weights matrix
weights = matrix(0, nrow=4, ncol=4)

# computing the weights by a softmax operation
for (i in 1:dim(scores)[1]) {
  weights[i,] = exp((scores[i,]-maxs[i,]) / ncol(K) ^ 0.5)/sum(exp((scores[i,]-maxs[i,]) / ncol(K) ^ 0.5))
}

print(weights)
#>             [,1]        [,2]      [,3]        [,4]
#> [1,] 0.083717538 0.026383741 0.8429010 0.046997679
#> [2,] 0.025449248 0.080752324 0.8130461 0.080752324
#> [3,] 0.003072728 0.003072728 0.9883811 0.005473487
#> [4,] 0.273384789 0.086157735 0.4869837 0.153473823

Finally, we compute the attention as a weighted sum of the value vectors (which are combined in the matrix V).

# computing the attention by a weighted sum of the value vectors
attention = weights %*% V

print(attention)
#>          [,1]     [,2]        [,3]
#> [1,] 2.816517 1.900235 0.046997679
#> [2,] 2.732294 1.757743 0.080752324
#> [3,] 2.985308 1.988381 0.005473487
#> [4,] 2.400826 1.674211 0.153473823

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.