Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebrón, F., & Sanghai, S. (2023). Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245. (291회 인용)https://arxiv.org/abs/2305.13245 GQA: Training Generalized Multi-Query Transformer Models from Multi-Head CheckpointsMulti-query attention (MQA), which only uses a single key-val..