Position-wise feed-forward networks

JX Jiajing Xie
YS Yuhang Song
HZ Hailong Zheng
SL Shijie Luo
YC Ying Chen
CZ Chen Zhang
RY Rongshan Yu
MT Mengsha Tong
ask Ask a question
Favorite

Each block contained an FFN that consisted of two linear transformations with a Gaussian Error Linear Unit (GELU) activation function in between:

where Inline graphic and Inline graphic were weight matrices of the linear transformations. Here, Inline graphic was the FFN dimension. Inline graphic was the model’s hidden dimension. Inline graphic and Inline graphic were the bias vectors of the linear transformations. The GELU function was defined as:

where Inline graphic represented the cumulative distribution function of the standard Gaussian distribution. Additionally, two dropout layers were applied in this network: the first after the GELU activation function and the second after the final linear transformation.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A