Apart from the ChatGPT that made Gen AI accessible to the masses, OpenAI provides many powerful foundational models through its API endpoint, allowing users to build Gen AI into their own applications. OpenAI API SDK is available in JavaScript (node.js) and Python.
GPT-3.5 Turbo
GPT-3.5 Turbo models are for multi-turn conversations and are equally capable of single-turn text-completion tasks. They support a 16k context window by default.
Because the models have no memory of the messages from the previous requests, it is necessary to save the conversations and include them in the consequent requests. If the conversation exceeds the maximum token size of the model, it must be shortened or use the gpt-4-32k
model, which can handle up to 32k tokens.
GPT-4 Turbo
GPT-4 Turbo is a multimodal model that can accept text and images and output text. Under the hood, it encodes text and images into the same encoding space and processes the data from different sources through the same neural network. It has a 128k context window (as of 21/11/2023).
GP4- Turbo with vision, code-named gpt-4-vision-preview
, is the GPT-4 model that can accept one or more images as input and answer questions about them. It can analyse images in detail and read documents with figures.
Although OpenAI did not disclose the details of gpt-4-vision-preview
, the common knowledge about multimodal LLMs suggests it uses CLIP.
Whisper
Whisper is a general-purpose speech recognition model that is capable of multilingual recognition, translation and identification.
Code Interpreter
Code Interpreter is a GPT model that can execute Python code in a sandbox and generate charts, graphs, data files or PDFs. For example, you can ask GPT-4 to ‘write a Python function to analyse a data file and generate a chart to find the trend’, and the generated code can be fed to the Code Interpreter model to derive the desired results.
DALL·E 3
DALL-E 3 can generate images based on user prompts. It is available in ChatGPT Plus, ChatGPT Enterprise and as dall-e-d
through the OpenAI API endpoint.
Sora
Although there are no details from OpenAI on how they built Sora, the new model raised the bar in detail and the realism of AI video generation models.
Sora is a text-to-video model recently announced by OpenAI. It can produce extraordinarily realistic high-definition videos up to one minute. It is currently being tested by OpenAI red teamers to evaluate AI risks such as bias and harmful content. You can see examples of Sora-generated videos here.
Thanks to its deep understanding of languages, Sora can interpret text prompts accurately and generate imaginative videos that realistically reflect real-world physics and carry compelling emotions.
The model proves that diffusion transformers work well for videos, indicating that video generation could get as good as text generation with current technologies. The model is believed to be capable of learning physics and understanding the world to an extent. It is the GPT-3 moment of video models.
Under the guise of a video-generation model, this is another massive step towards AGI. It will likely propel AI adoption, raise deep concerns and disrupt industries.
GPT-4o
GTP-4o (“o” for “omni”) is the latest model from OpenAI (as of May 2024). Unlike previous OpenAI models that can only accept a single input type, GTP-4o can interact with humans through any combination of text, audio, images, and video and output the same rich-format responses. For example, it enables users to converse with the AI model through voice chat instead of typing. OpenAI states that the model can ‘see, hear and speak’. Another improvement of GPT-4o over GTP-4 is understanding text in non-English languages.
Text-to-speech (TTS)
OpenAI provides two TTS models: tts-1
for real-time voice generation and tts-1-hd
for high-quality voice generation.
OpenAI API
OpenAI API allows developers to build LLM-powered custom applications—either new native AI applications or add LLM capacity to existing business applications.
Projects
Projects is an OpenAI API feature that allows enterprise customers to scope permissions for model usage, internal file access, and cost management. Customers can assign roles and dedicated API keys to specific projects to deny/allow access to models and rate limits.
The Batch
Batch API allows users to save up to 50% in API calls for tasks that do not require a real-time response from the AI models. Users can send all the tokens in a single request, and OpenAI guarantees that a response will be returned within 24 hours. For most real-world use cases, the Batch API returns a response within 20 - 30 minutes.
Assistants API
Assistants API is OpenAI’s version of AI Agent.
Agent is not a new concept in the development of Generative AI applications. LangChain Agent and AgentGPT are all well-known frameworks that have existed since the beginning of the Gen AI excitement. They use LLMs to orchestrate complex tasks by breaking the tasks down into smaller simple steps and passing the sub-tasks to other tools (LangChain Tools, OpenAI Tools) that specialise in specific tasks - especially the tasks that LLMs are known to be bad at, such as maths or analysing structured data. This allows LLMs to expand their abilities infinitely - at least in theory. Tools can be external expert models, purpose-built applications either by the frameworks or users, or knowledge retrievals The outputs of each tool are then processed by the agent LLM to arrive at the final response to users.
OpenAI Assistant API uses OpenAI models to execute three types of tools in parallel: Code Interpreter (code_interpreter
), knowledge retrieval (file_search
) and function calling (tools you build/host).
OpenAI’s GPTs are a use case of the Assistants API.
Customer Models
OpenAI announced a Custom Models Program, allowing selected organisations to work with dedicated OpenAI teams to create domain-specific models.
Davinci (Legacy)
Updated @ 25/11/2023: Davinci models are considered legacy models (2020-2022.)
The text-davinci
models are for single-turn text completion.
As of 15/03/2023, gpt-3.5-turbo
and text-davinci-003
are on par in terms of capability & performance. However, the former is 10% of the price per token compared to the latter, and it should be used for most use cases.
*See the complete list of OpenAI Models.