Reading time: 13 minutes

What is GPT-3?

The launch of Open AI’s 3rd generation of the pre-trained language model, GPT-3 (Generative Pre-training Transformer) has got the data science fraternity buzzing with excitement!

The world of Language Models (LM) is quite fascinating. To give a brief introduction, these models learn the probabilities of a sequence of words that occur in a commonly spoken language (say, English) and predict the next possible word in that sequence. They are essential for numerous NLP tasks like:

  • Language Translation
  • Text Classification
  • Sentiment Extraction
  • Reading Comprehension
  • Named Entity Recognition
  • Question Answer Systems
  • News Article Generation, etc

They’ve become immensely popular since the release of BERT by Google, with a host of companies competing to build the next big thing in the NLP domain!

Did You Know?

Open AI’s GPT-3 is the largest Language Model having 175 BN parameters, 10x more than that of Microsoft’s Turing NLG

Open AI has been in the race for a long time now. The capabilities, features, and limitations of their latest edition, GPT-3, have been described in a detailed research paper. Its predecessor GPT-2 (released in Feb 2019) was trained on 40GB of text data and had 1.5 BN parameters. In comparison, GPT-3 has a whopping 175 BN parameters, 10 times more than the next largest LM, the Turing NLG, developed by Microsoft with 17 BN parameters!

Fig-1: Comparison of all available language models (LMs) parameter wise
Source: TowardsDataScience

GPT-3 is based on the concepts of transformer and attention similar to GPT-2. It has been trained on a large and variety of data like Common Crawl, webtexts, books and Wikipedia, based on the tokens from each data. Prior to training the model, the average quality of the datasets have been improved in 3 steps.

The following table shows the training corpus of GPT-3:

Datasets Quantity (Tokens) Weight in Training Mix Epochs elapsed when training for 300 BN tokens
Common Crawl (filtered) 410 BN 60% 0.44
WebText2 19 BN 22% 2.90
Books1 12 BN 8% 1.90
Books2 55 BN 8% 0.43
Wikipedia 3 BN 3% 3.40

GPT-3 has variants in terms of

  • Sizes (Parameters and Layers)
  • Architectures
  • Learning hyper-parameters (batch size in tokens and learning rate) ranging from 125 MN to 175 BN parameters

Did You Know?

The largest version of GPT-3 has 175 BN Parameters, 96 Attention Layers and 3.2 MN Batch Size

Here are the details of the different variants of GPT-3 model:

Fig-2: Details of variants of the GPT-3 model

Fig-2: Details of variants of the GPT-3 model

What can it do?

Many of the NLP tasks discussed in this blog can be performed by GPT-3 without any gradient, parameter updates or fine tuning. This makes it a Task-Agnostic Model as it can perform tasks without any or very few prompts or examples or demonstrations called shots.

The following image displays a Zero / One / Few-Shot based task accuracy comparison for various sizes of the model (in terms of parameters) for a simple task to remove random symbols from a word with the number of in-context examples ranging between 10 to 100.

Fig-2: Details of variants of the GPT-3 model

Fig-3: Zero / One / Few-Shot based task accuracy comparison for models of different sizes

The “Fake News” Conundrum
Earlier, the release of the largest model of GPT-2 was briefly stalled due to a controversial debate of it being capable of generating fake news. It was later published on Colab notebooks. In recent times, however, this has been quite common and the real news themselves have been hard to believe!

The fake news generated by GPT-3 has been so difficult to distinguish from the real ones, and in one of the experiments, the results show that only 50% of the fake news could actually be detected!

Fig-4: Accuracy comparison of manual fake news detection for models of different sizes

In a task to predict the last word of a sentence, GPT-3 outperformed the current SOTA (state of the art) algorithm by 8% with an accuracy score of 76% in a zero-shot setting. In the few-shots setting, it has achieved an accuracy score of 86.4%!

In a closed book question answering tasks, GPT-3 outperformed a fine-tuned SOTA that uses an Information Retrieval component in both one and a few-shot settings.

Fig-5: Performance of GPT-3 on Trivia QA for models of different sizes

The GPT-3 API has been on the waiting list, but all the folks who could get a chance to try it shared their interesting findings and amazing results of this powerful model. Here are a few things that were observed while experimenting on the API’s interface called the Playground.

Summary of the Open AI GPT-3 API Playground:

  • Settings and Presets:
    Upon clicking on the settings icon, one can configure various parameters like the text length, temperature (from low/boring to standard to chaotic/creative), start and stop generated text etc. And there are multiple presets to choose and pay around with like Chat, Q&A, Parsing Unstructured Data, Summarize for a 2nd grader
    • Chat:
      The chat preset looks more like a chatbot where you can set the character of the AI as friendly, creative, clever and helpful mode which provides informative answers in a very polite manner whereas if you set the character of the AI to brutal it responds exactly as the character suggests! 
    • Q&A:
      Question answering needs some training before it starts answering our questions and people did not have any complaints with the kind of answers received.
    • Parsing Unstructured Data:
      This is an interesting preset of the model which can comprehend and extract structured information from the unstructured text
    • Summarize for 2nd Grader:
      This preset shows another level of text compression by rephrasing the difficult sentences and concepts into simpler words and sentences that can be easily understood by a kid
  • Multilingual text processing:
    GPT-3 can handle languages other than English better than the GPT-2. People have tried tasks in various languages German, Russian and Japanese it did perform well and were very much ready for multilingual text processing.
  • Text Generation:
    It can generate poems on demand that too in a particular style if required, can  write stories and essays with some fine tuning even in other languages.
  • Code Generation:
    People have claimed that this API can generate code with a minimum prompts.

Here is an article which showcases all it’s capabilities and excerpts from social media. 

And this is how the AI interface looks like (Below image shows the Q&A preset):

Fig-6: Preview of the AI Playground page for a Q&A preset

Fig-6: Preview of the AI Playground page for a Q&A preset

How can we use it?

Unlike a lot of language models, GPT-3 does not need Transfer Learning, where the model is fine-tuned on task specific data sets for specific tasks. The author of a research paper on GPT-3 mentions the following advantages of having a task-agnostic model:

  • Collecting task-specific data is difficult
  • Fine-tuning might yield out-of-distribution performance
  • Need for an adaptable NLP system similar to humans, which can understand the natural language (English) and perform tasks with few or no prompts

The applications of GPT-3 are in-context learning, where a model is fed with a task/prompt/shot or an example and it responds to it on the basis of the skills and pattern recognition abilities that were learnt during the training to adapt the current specific task.

Despite its tremendous useability, the huge model size is the biggest factor hindering the usage for most people, except those with available resources. However, there are discussions in the fraternity that distillation might come to the rescue!

What are the limitations?

The Open AI founder himself said that “GPT-3 has weaknesses and it makes silly mistakes”. It is weak in the segment of sentence comparison where it has to see the usage of a word in 2 different sentences. 

As per the researchers, it still faces some problems in the following tasks: 

  • Repetitions
  • Coherence loss
  • Contradictions
  • Drawing real conclusions
  • Multiple digit additions and subtractions
Fig-7: Chart showing results of different arithmetic tasks in a few-shot setting for models of different sizes

Conclusion

It is great to have an NLP system that doesn’t require large amounts of custom-task specific datasets and custom-model architecture to solve specific NLP tasks. The experiments conducted show its power, potential and impact on the future of NLP advancement.

Though GPT-3 doesn’t do well on everything and the size of it makes it difficult to use by everyone, this is just the threshold of a lot of new improvements to come in the field of NLP!

References

  1. GPT-3 paper: https://arxiv.org/pdf/2005.14165.pdf
    Images are picked from this paper.
  2. GPT-3 Github page: https://github.com/openai/gpt-3
  3. Blog on GPT-2: https://openai.com/blog/better-language-models/
  4. Paper about the concepts of transformers and attention: https://arxiv.org/abs/1706.03762
  5. Link to the GPT-3 API: https://beta.openai.com/
  6. Paper on Distillation in Neural Networks: https://arxiv.org/abs/1503.02531
  7. Article on GPT-3 in action: https://towardsdatascience.com/gpt-3-creative-potential-of-nlp-d5ccae16c1ab

About the Author

Bhaskar Ammu is a Senior Data Scientist at Sigmoid. He specializes in designing data science solutions for clients, building database architectures and managing projects and teams.

To know more about our Data Science offerings

Click Here