GPT-4: Key Details and Impressive Specifications

GPT-4 is over 10 times larger than GPT-3, boasting approximately 1.8 trillion parameters across 120 layers.

Unlike GPT-3, GPT-4 is not a dense transformer like PaLM. It employs MQA (Mixture of Quantized Activations) instead of MHA (Multi-Head Attention).

Each forward pass (generation of 1 token) of GPT-4 utilizes around 280 billion parameters and 560 TFLOPs, contrasting with the massive total parameters and TFLOPs it possesses.

OpenAI uses 8-way tensor parallelism to parallelize across all A100 GPUs, along with 15-way pipeline parallelism for increased efficiency.

DeepSpeed ZeRo Stage 1 or block-level FSDP is employed to optimize the training process.

GPT-4 incorporates a separate vision encoder with cross-attention, inspired by Google DeepMind's Flamingo architecture.

The vision encoder adds additional parameters on top of the 1.8 trillion parameters of GPT-4 and is fine-tuned with approximately 2 trillion tokens.

GPT-4 is trained on around 13 trillion tokens, with multiple epochs and a mix of unique and non-unique tokens.

The pre-training phase of GPT-4 utilizes an 8k context length, and the 32k seqlen version is achieved through fine-tuning based on the 8k model.

OpenAI's training FLOPS for GPT-4 amount to around 2.15e25, involving approximately 25,000 A100s over 90 to 100 days, with costs estimated at $63 million for this run alone.

Read More Stories

GPT-4 unleashed for broad usage