Electra#

ELECTRA : PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS#

Masked language modeling(MLM)๋“ค์€ ์ผ๋ฐ˜์ ์œผ๋กœ ๋งŽ์€ ์–‘์˜ ๊ณ„์‚ฐ์„ ํ•„์š”๋กœํ•œ๋‹ค. ๊ทธ์— ๋Œ€ํ•œ ๋Œ€์•ˆ์œผ๋กœ ์ด ๋…ผ๋ฌธ์€ replaced token detection์ด๋ผ๊ณ ๋„ ํ•˜๋Š” pre-training์„ ํšจ์œจ์ ์œผ๋กœ ํ•˜๋Š” ๊ฒƒ์— ์˜์˜๋ฅผ ๋‘”๋‹ค. ์ž…๋ ฅ์„ masking ํ•˜๋Š” ๋Œ€์‹  ์ž‘์€ generator ๋ชจ๋ธ์„ ํ†ตํ•ด ์ƒ์„ฑ๋œ ํ† ํฐ์œผ๋กœ ๋Œ€์ฒดํ•œ๋‹ค. ๊ทธ๋ž˜์„œ corrupted ํ† ํฐ๋“ค์˜ ์›๋ณธ์„ ์˜ˆ์ธกํ•˜๋Š” ๋Œ€์‹  ์ด ํ† ํฐ์ด ์ƒ์„ฑ๋œ ํ† ํฐ์ธ์ง€ ์•„๋‹Œ์ง€๋ฅผ ๋ถ„๋ณ„ํ•œ๋‹ค.
๊ทธ๋ž˜์„œ BERT์™€ ๋˜‘๊ฐ™์€ ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ, ๋ฐ์ดํ„ฐ, ํ•™์Šต์–‘์œผ๋กœ ๋” ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ง€๊ณ , RoBERTa๋‚˜ XLNet ๋ณด๋‹ค 1/4์˜ ๊ณ„์‚ฐ๋Ÿ‰์œผ๋กœ ๋น„์Šทํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ๊ฐ™์€ ๊ณ„์‚ฐ๋Ÿ‰์ด๋ฉด ๋” ๋Šฅ๊ฐ€ํ•œ๋‹ค.

intro#

ํ˜„์žฌ ๋‹ค์–‘ํ•œ ์–ธ์–ด๋ชจ๋ธ๋“ค์€ denoising autoencoders๋กœ ๋ณด์—ฌ์ง„๋‹ค. ์ด ๋•Œ ๋ณดํ†ต 15%์˜ ์ž…๋ ฅ์— mask๋ฅผ ํ•˜๊ฑฐ๋‚˜ ์ด ํ† ํฐ์— attention์„ ํ•œ๋‹ค. ๊ทธ ํ›„ Bart๊ฐ™์€ ๋ชจ๋ธ๋“ค์€ sentence ์ˆœ์„œ๋ฅผ ๋ฐ”๊พธ๊ณ  span์ž์ฒด๋ฅผ ๋ฐ”๊พธ๊ธฐ๋„ ํ•œ๋‹ค~ ๊ทธ๋ž˜์„œ ์ด ํ† ํฐ๋“ค์„ recoverํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. ์ด์— ๋Œ€ํ•œ ๋Œ€์•ˆ์œผ๋กœ replaced token detection์„ ๋ชฉ์ ์œผ๋กœ ์ผ์ œ ์ž…๋ ฅ๊ณผ ์ƒ์„ฑํ•ด์„œ ๋Œ€์ฒด๋œ ํ† ํฐ๋“ค์„ ๊ตฌ๋ณ„ํ•˜๋Š” ๊ฒƒ์„ ํ•™์Šตํ•œ๋‹ค. ๋Œ€์ฒด๋œ ํ† ํฐ๋“ค์€ ๋งˆ์Šคํ‚น ๋Œ€์‹  ์ž‘์€ MLM์—์„œ proposal distribution์œผ๋กœ ๋ถ€ํ„ฐ ๋‚˜์˜จ ํ† ํฐ๋“ค์ด๋‹ค.

์ด ๊ณผ์ •์€ GAN๊ณผ ๋น„์Šทํ•ด๋ณด์ผ์ˆ˜๋„ ์žˆ์œผ๋‚˜, generator๋Š” text์— ์ ์šฉํ•˜๊ธฐ ์–ด๋ ค์›Œ์„œ maximum likelihood๋กœ ํ›ˆ๋ จ๋˜๊ธฐ ๋•Œ๋ฌธ์— adversarial ๋ฐฉ๋ฒ•์€ ์•„๋‹ˆ๋‹ค(Language GANs Falling Short) ๊ทธ๋ž˜์„œ ๊ฒฐ๊ตญ 1/4์˜ ๊ณ„์‚ฐ๋Ÿ‰์œผ๋กœ ALBERT๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚ฌ๊ณ  ์ด ๋‹น์‹œ์˜ SQUAD 2.0 SOTA๋ฅผ ๋‹ฌ์„ฑํ–ˆ๋‹ค.

method..#

2๊ฐœ์˜ ์‹ ๊ฒฝ๋ง์„ ํ•™์Šต์‚ฌ๋Š”๋ฐ generator G์™€ discriminator D๋ฅผ ํ•™์Šตํ•œ๋‹ค. vector representation h(x) ์™€ embedding e, position t ์ผ ๋•Œ, generator๋Š” softmax layer๋ฅผ ํ†ตํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ถœ๋ ฅ๋œ๋‹ค $$pG_(x_t|x) =exp(e(x_t)^T hG(x)t) / \sum{xโ€™}exp(e(xโ€ฒ)^T hG(x)_t)$$ discriminator๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค $$D(x,t)=sigmoid(w^ThD(x)_t)$$

์ถ”๊ฐ€์ ์œผ๋กœ genrator์™€ discriminator ๊ฐ„์— sharing weights๋ฅผ ํ†ตํ•ด์„œ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. ์ด ๋•Œ token๊ณผ positional embedding์„ ๊ณต์œ ํ–ˆ๋‹ค.S ์ด ๋•Œ, discriminator ๋ชจ๋ธ์˜ ํฌ๊ธฐ๋Š” generator ๋ณด๋‹ค ์ปค์•ผ ์ˆ˜์›”ํ•˜๊ฒŒ ๊ตฌ๋ณ„์„ ํ•˜๋ฉด์„œ ํ•™์Šต์ด ๋œ๋‹ค. ๋งŒ์•ฝ ์‚ฌ์ด์ฆˆ๊ฐ€ ๊ฐ™๋‹ค๋ฉด ๊ฑฐ์˜ 2๋ฐฐ์ •๋„ ๋” ํ•™์Šต์„ ์ง„ํ–‰ํ•ด์•ผ ํ•œ๋‹ค.

Training Algorithms#

ํšจ๊ณผ์ ์œผ๋กœ jointly trainํ•˜๋Š” two-stage ์ ˆ์ฐจ์ด๋‹ค.

  1. generator MLM์„ n step์ง„ํ–‰ํ•œ๋‹ค.

  2. generator์˜ weights๋กœ discriminator๋ฅผ Initializeํ•œ ํ›„, generator์˜ weights๋ฅผ ๋ฉˆ์ถ˜ ํ›„์— discriminator๋ฅผ n steps ํ•™์Šตํ•œ๋‹ค.