Latent Space // MPT-7B and The Beginning of Context=Infinity — with Jonathan Frankle and Abhinav Venigalla of MosaicML

One of GPT3’s biggest limitations is context length - you can only send it up to 4000 tokens (3k words, 6 pages) before it throws a hard error, requiring you to bring in LangChain and other retrieval techniques to process long documents and prompts.

Latent Space  // MPT-7B and The Beginning of Context=Infinity — with Jonathan Frankle and Abhinav Venigalla of MosaicML

Key Points

  • Mosaic ML released MPT7B, a demo model to show they can do it efficiently and hands-free for a reasonable cost.
  • Evaluation of large language models is incredibly hard and current metrics do not truly capture what is expected from the models in practice.
  • The Flash attention and Alibi position encodings are unique to the MPT-3 model and provide significant benefits to training and inference speed, stability, and the ability to handle longer context lengths.
  • The company has released a model that can generate creative content through AI, which has caused some controversy due to copyright concerns and issues with open source licenses.
  • The Mosaic platform offers stable training infrastructure and pre-configured models (like MPT-7b) to help customers with their training needs, but customers should have good evaluation metrics and quality data before starting.
  • Mosaic's goal is to make it cheaper to train models so that we can run more experiments and understand how to make them work better.
  • There's a need for coexistence between open-source and closed-source AI models as different models serve different purposes.
  • The company was reluctant to do inference but was convinced to do it by customers because they wanted a good way to serve the models and they liked the ecosystem. The company is now excited to have a product and plans to put everything into it.
  • Training is not a one-time cost and there is always a need for future models. Training hardware is probably more expensive than inference hardware, and the new hardware coming out this year could significantly reduce the cost of training. The company is optimistic about the exciting year for training efficiency.
  • Mosaic is focused on sharing their research in open models to the world as a part of their company's imperative to do science and to share, even though they have a business imperative to make money. They believe that the open models serve their business as demos and they have a recent inference product that makes the decision easier for them to share with the world.
  • The unsolved questions in AI that Jonathan is interested in is how efficient we can make it to get models that are as good as the big ones while Avi is interested in how small we can go with quantization, getting down to analog or even binary weights. Both suggest staying balanced and focused on the science, following data to be a guide in creating useful tools for the world, and doing research in the open.

Episode Description

We are excited to be the first podcast in the world to release an in-depth interview on the new SOTA in commercially licensed open source models - MosiacML MPT-7B!

One of GPT3’s biggest limitations is context length - you can only send it up to 4000 tokens (3k words, 6 pages) before it throws a hard error, requiring you to bring in LangChain and other retrieval techniques to process long documents and prompts. But MosaicML recently open sourced MPT-7B, the newest addition to their Foundation Series, with context length going up to 84,000 tokens (63k words, 126 pages):

This transformer model, trained from scratch on 1 trillion tokens of text and code (compared to 300B for Pythia and OpenLLaMA, and 800B for StableLM), matches the quality of LLaMA-7B. It was trained on the MosaicML platform in 9.5 days on 440 GPUs with no human intervention, costing approximately $200,000. Unlike many open models, MPT-7B is licensed for commercial use and it’s optimized for fast training and inference through FlashAttention and FasterTransformer.

They also released 3 finetuned models starting from the base MPT-7B:

* MPT-7B-Instruct: finetuned on dolly_hhrlhf, a dataset built on top of dolly-5k (see our Dolly episode for more details).

* MPT-7B-Chat: finetuned on the ShareGPT-Vicuna, HC3, Alpaca, Helpful and Harmless, and Evol-Instruct datasets.

* MPT-7B-StoryWriter-65k+: it was finetuned with a context length of 65k tokens on a filtered fiction subset of the books3 dataset. While 65k is the advertised size, the team has gotten up to 84k tokens in response when running on a single node A100-80GB GPUs. ALiBi is the dark magic that makes this possible. Turns out The Great Gatsby is only about 68k tokens, so the team used the model to create new epilogues for it!

On top of the model checkpoints, the team also open-sourced the entire codebase for pretraining, finetuning, and evaluating MPT via their new MosaicML LLM Foundry. The table we showed above was created using LLM Foundry in-context-learning eval framework itself!

In this episode, we chatted with the leads of MPT-7B at Mosaic: Jonathan Frankle, Chief Scientist, and Abhinav Venigalla, Research Scientist who spearheaded the MPT-7B training run. We talked about some of the innovations they’ve brought into the training process to remove the need for 2am on-call PagerDutys, why the LLM dataset mix is such an important yet dark art, and why some of the traditional multiple-choice benchmarks might not be very helpful for the type of technology we are building.

Subscribe to ssv.ai

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe