Each block contained an FFN that consisted of two linear transformations with a Gaussian Error Linear Unit (GELU) activation function in between:
where
and
were weight matrices of the linear transformations. Here,
was the FFN dimension.
was the model’s hidden dimension.
and
were the bias vectors of the linear transformations. The GELU function was defined as:
where
represented the cumulative distribution function of the standard Gaussian distribution. Additionally, two dropout layers were applied in this network: the first after the GELU activation function and the second after the final linear transformation.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.