Feed the LLaMAs and Falcons the right data

Whats the right amount and type of data required to train language models for enterprise use cases?

LLMs are trained on massive amounts of data collected from the internet which practically contains all kinds of topics from farming to aliens. But do we need all this information encoded in a single neural network? Probably not.


Consider how we design our course subjects in elementary education and how the subjects and field of study evolves into specializations as we graduate from universities. First we teach ourselves language and basic math skills. Then, later in our education we pick specializations like medicine, maths etc. We should design the training plan and data for language models in a similar fashion. Training process should be well structured into steps and data should be categorized into logical groups.

From our experiments and deployments, we have learned that a relatively small model like Flan T5 will suffice as the base model for further fine-tuning for most of the tasks. We don't need a 100+ billion parameter model to tackle many of the enterprise use cases. A large language model is needed only to frontend user interaction. Understanding users' query is difficult and error prone. But once users' query is understood or decomposed, any smaller specialized model can be used for further processing of the request.

Research from MIT and others have shown that training specialized models are much more efficient than a very large language model

From talk by Prof. Peter Szolovits (MIT)

So, how do we plan the training and data for such specialized models? Let's split the skills needed by specialized model into different sub-groups, grouped broadly under foundational and specialized skills.


Foundational skills:

Foundational skills are required for any model to have a basic understanding of how to understand and process requests.

Language: This is the most important training data. It needs to have the usage of language in different forms. News, speech, chat, poetry, discussions, and 100+ forms of language use. The basic skill any language model needs is to understand user input and to write output in any given form and language.

Reasoning: Reasoning is required for the model to make the right decisions and steps. Chain of Thoughts datasets on basic math problems and complex scenario-based reasoning are required for training the model in this skill.

Question answering: QA in itself has multiple forms. 

  • Unstructured text-based QA. This involves extracting answers from documents for eg: a PPT, PDF, or an online blog post
  • Structured data:  This is for extracting answers from structured content like spreadsheets and tables  

Prompting: Prompt training can be limited to a specific set of prompts given the business use case. And the set of prompts can be pre-defined since we are dealing with only a subset of scenarios. In most enterprise use cases where a model is used at the backend system, the generative models can be trained with very specific fixed prompts. This also helps to generate the same structure every time. Note that this is applicable only to generative text completion models. The models needed solely for a specific task, such as an NER classifier, do not require prompt training.


Specialized skills:


This is a long list and each business use case will need a different set of selected specialized skill datasets. Datasets for math, proteins, farming, space exploration, customer service, coding, deployment skills, etc. These can be combined similarly to how we select specializations in our education.


We have created a list of datasets and combinations for each of the enterprise use cases that we have come across. We will publish our findings and list of datasets used for various use cases soon. Watch our blog / subscribe to our newsletter for updates.

Advantages:

Most of LLMs currently available are prohibitively expensive to train and deploy. Given the speed at which AI technology is moving last few months, it's very important for enterprises to have a training and deployment ops that can produce new/updated models at a very high speed. And since these specialized models can run on a single GPU, the financial and operational overhead also is lessened.

Mathew Samuel
Updated
August 8, 2023
Category
AI
LLM