Sep 9, 2024

How to Generate Synthetic Data for RAG

Pratik Bhavsar

Galileo Labs

Pratik Bhavsar

Galileo Labs

Learn to create and filter synthetic data
Learn to create and filter synthetic data

High quality AI models heavily rely on large, diverse, and high-quality datasets for training and evaluation. But acquiring these datasets can be a significant challenge due to data scarcity, privacy concerns, and data collection and annotation costs. Synthetic data has emerged as a promising solution to address these challenges. This blog will explore its benefits, challenges, and how it can be effectively used for training and evaluating LLMs.

What is a Synthetic Dataset?

Synthetic data refers to artificially generated data that mimics the characteristics and patterns of real-world data. Unlike real data, collected from actual events or observations, synthetic data is created through algorithms, generative models, or simulations. This data type can be tailored to specific requirements, ensuring a balanced representation of different classes and introducing controlled variations to improve model performance and generalization.

Synthetic Data in Building LLMs

The development and training of LLMs have evolved significantly, incorporating both pre-training and post-training methodologies. Here are some key ways synthetic data can be utilized in the training pipelines of modern LLMs, as illustrated by recent advancements in models like Alibaba's Qwen 2, Apple's AFM, Google's Gemma 2, and Meta AI's Llama 3.1.

Pre-Training with Synthetic Data

Here are some examples of how synthetic data is used for pretraining LLMs.

Data Augmentation and Diversity

Qwen 2: The Qwen models leveraged synthetic data generated by previous iterations of Qwen models to augment their pre-training datasets. This approach helped enhance data diversity and improve the model's ability to handle various tasks.

AFM: Apple's AFM models included synthetic data in their pre-training stages, particularly for context lengthening. They augmented their datasets with synthetic long-context Q&A data to improve the model's performance on tasks requiring long-term dependencies.

Gemma 2: Google's Gemma models also utilized synthetic data generated through knowledge distillation. Smaller models were trained using outputs from larger teacher models, enriching the training data and improving the efficiency of the training process.

Improving Data Quality

Qwen 2: Emphasized improving the data filtering pipeline to remove low-quality data and enhance data mixing, ensuring the synthetic data used was of high quality.

AFM: Focused on using high-quality synthetic data for continued pre-training, particularly for math and code tasks. This ensured that the model received high-quality signals during training.

Context Lengthening

Qwen 2: Performed long-context training in the later stages of pre-training, using high-quality, lengthy synthetic data to increase the context length from 4,096 to 32,768 tokens.

AFM: Included a dedicated pre-training stage for context lengthening, where synthetic data was used to train the model on longer sequences, enhancing its ability to handle extended contexts.

Post-Training with Synthetic Data

Similarly, here are examples of how synthetic data is leveraged for post-training of LLMs.

Supervised Instruction Fine-Tuning (SFT)

Qwen 2: Used synthetic data to create instruction-response pairs, particularly for "high-quality literary data," to refine the model's response accuracy in predetermined scenarios.

AFM: Leveraged both human-annotated and synthetic data for SFT, fine-tuning the data mixture through multiple experiments to achieve the optimal balance.

Reinforcement Learning with Human Feedback (RLHF)

Qwen 2: Employed a two-stage alignment phase using Direct Preference Optimization (DPO) on an existing dataset and in real-time during training. Synthetic data played a role in forming the preference pairs for optimization.

AFM: Introduced new algorithms like Rejection Sampling Fine-tuning with Teacher Committee (iTeC) and RLHF with Mirror Descent Policy Optimization, using synthetic data to generate multiple responses and select the best ones for training.

Rejection Sampling

Qwen 2 and AFM: Used synthetic data to generate multiple responses during training, with a reward model selecting the preferred response. This approach, often called rejection sampling, helps refine the model's alignment with human preferences.

Model Distillation

Gemma 2: Applied knowledge distillation during both pre-training and post-training, using synthetic data generated by teacher models to train smaller models. This method, combined with model averaging techniques, helped stabilize and improve performance over time.

Llama 3.1: Employed model averaging not only for the reward models but also for the SFT and DPO models, using synthetic data to enhance the training process.

In addition, other state-of-the-art models, such as Phi-3 and NuminaMath, leverage synthetic data heavily to develop high-performance SLM.

Benefits of Synthetic Datasets

Synthetic data offers several benefits, making it an attractive option for training and evaluating AI models.

Limitless Data

Synthetic data can be generated at scale, providing an abundant supply of training and testing data for AI models. This is particularly valuable in domains where real-world data is scarce or difficult to obtain. For example, generating synthetic chat data can help improve chat models by covering various conditions and scenarios.

Absence of Closed Domain Data

In certain domains, such as healthcare, obtaining real-world data can be challenging due to privacy concerns and regulatory restrictions. For example, patient data in healthcare is highly sensitive and subject to strict privacy laws. Synthetic data can be used to create datasets that mimic real-world data without compromising privacy, enabling the development of AI models in these sensitive domains. This allows researchers and developers to work with realistic data while adhering to privacy regulations.

Improved Model Performance

The limits of human creativity can restrict the diversity and variety of data that can be generated manually. For example, creating diverse and realistic scenarios for training autonomous vehicles or medical diagnosis systems can be challenging. Synthetic data can overcome this limitation by using algorithms to create diverse and varied datasets that capture a wide range of scenarios and conditions, enhancing the robustness and generalization of AI models.

Cheaper Data

One of the primary challenges in creating high-quality datasets is the cost associated with data annotation. Annotating large volumes of data requires significant human effort and resources, making it costly and time-consuming.

Privacy Compliant

Synthetic data can help mitigate privacy concerns by creating anonymized or de-identified datasets that do not contain sensitive personal information. This is crucial in domains such as healthcare, where patient privacy is of utmost importance.

Limitations of Synthetic Datasets

Despite its promise, synthetic data is not perfect and has limitations that must be addressed.

Ensuring Quality

One of the main challenges is ensuring the factuality and fidelity of synthetic data. Models trained on false, hallucinated, or biased synthetic data may fail to generalize to real-world scenarios. For example, if a language model is trained on synthetic data that contains factual errors, it may produce inaccurate or misleading responses. Researchers must develop sophisticated generative models and evaluation metrics to create synthetic data that accurately reflects real-world data's complex patterns and relationships.

Bias Amplification

Synthetic data can amplify or introduce biases if not carefully designed and validated. For instance, the resulting AI model may exhibit biased behavior if the synthetic data generation process is biased towards certain demographics. Rigorous testing and fairness assessments are necessary to mitigate these risks and ensure that synthetic data does not perpetuate or exacerbate biases.

Evaluation Contamination

Using synthetic data in model training poses significant challenges to fair evaluation. Evaluation benchmarks are often created by referring to public text sources, which can lead to contamination if the synthetic data includes rephrased versions of the benchmark data. This can result in inflated performance metrics and misleading conclusions about the model's capabilities. Developing robust evaluation protocols and contamination detection techniques is essential to address this challenge.

Survey on Synthetic Datasets

Recent advancements in synthetic data generation have led to various synthetic datasets for different domains. For example, synthetic data has been used to improve performance on math-related tasks, code reasoning, tool-using abilities, and multilingual language models. These datasets have demonstrated synthetic data's potential in enhancing AI models' capabilities.

Reasoning Tasks

Synthetic data has been effectively used in reasoning tasks such as mathematical problem-solving and code generation. For example, models like Minerva, Llemma, and DeepSeekMath have been trained on synthetic math-targeted pre-training data, improving their performance on math-related tasks. Similarly, synthetic data has been used to generate complex questions and answers, enhancing the reasoning capabilities of language models.

Tool-Using and Planning

LLMs are used to build agents which require tool selection and usage. Synthetic data has also enabled language models to learn these tool-using abilities and planning skills. For example, models like GPT4o, Claude 3.5 Sonnet and Toolformer have been trained on interaction data annotated with calls to appropriate tools, enabling them to use calculators, search engines, and machine translators effectively. Synthetic trajectories in simulated environments have been used to teach models planning skills, such as decomposing complex tasks into subtasks and completing them in a reward-optimal way.

Multimodality

In multimodal tasks, synthetic data has been used to align visual input with language models. For example, Pix2Struct and MatCha have been trained on synthetic image-caption pairs generated from HTML code and tabular data, respectively. This has enabled these models to accurately ground visual input to language, improving their performance on tasks such as derendering screenshots and converting webpage screenshots into code.

Multilingual Data

Multilingual models are complicated to build due to lack of annotated data. Synthetic data has been key in improving multilingual language models by creating synthetic parallel training data from monolingual data sources. Techniques such as back-translation have been employed to generate synthetic multilingual question-answer pairs, enhancing the performance of language models on multilingual and cross-lingual question answering tasks.

A Framework for High-Quality Synthetic Datasets

Creating high-quality synthetic datasets is a multi-faceted challenge that requires a systematic and meticulous approach. This section outlines a robust framework to guide synthetic data generation, ensuring it meets the highest standards of accuracy, diversity, and applicability.

Step 1: Prompt Engineering

Task Specification: The first step in prompt engineering is clearly defining the task. This involves providing the necessary context, background information, and specific instructions that the model needs to understand the task. For instance, if the task is to generate synthetic medical records, the prompt should include details about the type of medical conditions, patient demographics, and the format of the records.

Generation Conditions: Next, define the attributes and characteristics of the desired data. This could include specifying the length of the text, the style of writing, and any particular focus areas. For example, in generating synthetic legal documents, the conditions might specify the inclusion of certain legal terminologies and the structure of the document.

In-Context Demonstrations: Providing examples or demonstrations within the prompt can significantly enhance the model's understanding and performance. These examples act as a guide, showing the model the desired output format and content. For instance, if the task is to generate customer service interactions, including a few example dialogues can help the model produce more accurate and relevant responses.

Step 2: Multi-Step Generation

Decomposition of Tasks: For complex data generation tasks, it is often beneficial to break down the task into smaller, manageable sub-tasks. This step-by-step approach can help ensure that each component of the data is generated accurately. For example, generating a synthetic research paper might involve separate steps for creating the abstract, introduction, methodology, results, and conclusion.

Iterative Refinement: Multi-step generation allows for iterative refinement, where the output from one step can be reviewed and improved before moving on to the next. This iterative process helps in catching and correcting errors early, ensuring higher quality in the final dataset. For instance, in generating synthetic financial reports, the initial draft can be reviewed for accuracy and completeness before adding detailed financial statements.

Contextual Conditioning: Each step of the multi-step generation can be conditioned on the outputs of previous steps. This ensures coherence and logical flow in the generated data. For example, in generating synthetic dialogues, each turn in the conversation can be conditioned on the previous turns, maintaining context and relevance.

Step 3: Data Curation

High-Quality Sample Filtering: After generating the synthetic data, it is crucial to filter out low-quality samples. This can be achieved using heuristic metrics such as confidence scores, influence functions, and generation probabilities. For instance, samples with low confidence scores or high uncertainty can be discarded to ensure only high-quality data is retained.

Label Enhancement: This can be done through human intervention or by using auxiliary models for knowledge distillation. For example, in a synthetic dataset of annotated images, human reviewers can verify and correct the labels, or a student model can be used to refine the annotations based on feedback from the teacher model.

Re-Weighting Strategies: Instead of discarding low-quality data, re-weighting strategies can be employed to assign varying importance to different samples. This ensures that influential and correctly annotated samples have a larger impact on the training process. For instance, in a synthetic text dataset, samples with higher relevance and accuracy can be given more weight during model training.

Bias Mitigation: This involves conducting fairness assessments and using techniques to balance the representation of different classes and demographics. For instance, in a synthetic dataset for customer feedback analysis, ensure balanced representation of positive, negative, and neutral sentiments across different demographic groups.

By following this comprehensive framework, researchers and practitioners can generate high-quality synthetic datasets that are accurate, diverse, and applicable to a wide range of AI tasks.

Build your Synthetic RAG Dataset

Here is a recipe to build your own synthetic dataset for training and evaluating RAG systems.

First, use any GPT to create the dataset, and then filter it based on context adherence score. This process ensures that our dataset is purely synthetic, as we won't rely on any pre-existing datasets.

Here are the steps for the complete process:

  1. Define the prompt to generate synthetic data

  2. Create OpenAI batch with the prompts

  3. Download the output of batch and parse it

  4. Convert data to RAG prompt and response

  5. Evaluate dataset with Galileo

  6. Download data and remove samples with low context adherence score

  7. Save the final high quality dataset

Future of Synthetic Datasets

The future of synthetic data looks very promising, with several key areas warranting further exploration.

Self-Improvement Capability

An intriguing question arises: can a model generate synthetic data that is better than the data it was trained on, thus enabling it to improve itself? This concept of self-improvement through synthetic data generation is an exciting avenue for future research. If a model can generate higher-quality data than its original training set, it could potentially bootstrap its own performance by iteratively learning from the enhanced synthetic data.

This process was leveraged in building Alibaba's Qwen 2 and Meta AI's Llama 3.1. Although there has been some success here, it’s still to be proven whether something like GPT-5 could be created from GPT-4 generated synthetic data. Papers like The Curse of Recursion: Training on Generated Data Makes Models Forget have raised concerns about LLMs being brittle if trained in such a manner.

Scaling Laws of Synthetic Data

Future research should investigate the scaling laws for synthetic data and determine the optimal balance between the quantity and quality of synthetic samples. Understanding how to scale synthetic data effectively can help maximize its benefits for AI model training and evaluation.

Improving Quality and Diversity

There is still room for improvement in creating high-quality, attributed synthetic samples that closely mimic real-world data. Future research should focus on developing new advanced techniques to control and manipulate specific attributes of the generated data. This will naturally improve as the instruction-following capability of LLMs improves.

Conclusion

Synthetic data has emerged as a promising solution to address the challenges of data scarcity, privacy concerns, and high costs in AI development.

Many companies are already generating realistic and diverse datasets, synthetic data to enable the training and evaluation of AI models at scale across various domains. Despite the challenges, the potential benefits of synthetic data in advancing AI research are unparalleled. Are you ready to generate some high-quality synthetic data with Galileo?

References

Best Practices and Lessons Learned on Synthetic Data for Language Models

On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

Cosmopedia: how to create large-scale synthetic data for pre-training

New LLM Pre-training and Post-training Paradigms

High quality AI models heavily rely on large, diverse, and high-quality datasets for training and evaluation. But acquiring these datasets can be a significant challenge due to data scarcity, privacy concerns, and data collection and annotation costs. Synthetic data has emerged as a promising solution to address these challenges. This blog will explore its benefits, challenges, and how it can be effectively used for training and evaluating LLMs.

What is a Synthetic Dataset?

Synthetic data refers to artificially generated data that mimics the characteristics and patterns of real-world data. Unlike real data, collected from actual events or observations, synthetic data is created through algorithms, generative models, or simulations. This data type can be tailored to specific requirements, ensuring a balanced representation of different classes and introducing controlled variations to improve model performance and generalization.

Synthetic Data in Building LLMs

The development and training of LLMs have evolved significantly, incorporating both pre-training and post-training methodologies. Here are some key ways synthetic data can be utilized in the training pipelines of modern LLMs, as illustrated by recent advancements in models like Alibaba's Qwen 2, Apple's AFM, Google's Gemma 2, and Meta AI's Llama 3.1.

Pre-Training with Synthetic Data

Here are some examples of how synthetic data is used for pretraining LLMs.

Data Augmentation and Diversity

Qwen 2: The Qwen models leveraged synthetic data generated by previous iterations of Qwen models to augment their pre-training datasets. This approach helped enhance data diversity and improve the model's ability to handle various tasks.

AFM: Apple's AFM models included synthetic data in their pre-training stages, particularly for context lengthening. They augmented their datasets with synthetic long-context Q&A data to improve the model's performance on tasks requiring long-term dependencies.

Gemma 2: Google's Gemma models also utilized synthetic data generated through knowledge distillation. Smaller models were trained using outputs from larger teacher models, enriching the training data and improving the efficiency of the training process.

Improving Data Quality

Qwen 2: Emphasized improving the data filtering pipeline to remove low-quality data and enhance data mixing, ensuring the synthetic data used was of high quality.

AFM: Focused on using high-quality synthetic data for continued pre-training, particularly for math and code tasks. This ensured that the model received high-quality signals during training.

Context Lengthening

Qwen 2: Performed long-context training in the later stages of pre-training, using high-quality, lengthy synthetic data to increase the context length from 4,096 to 32,768 tokens.

AFM: Included a dedicated pre-training stage for context lengthening, where synthetic data was used to train the model on longer sequences, enhancing its ability to handle extended contexts.

Post-Training with Synthetic Data

Similarly, here are examples of how synthetic data is leveraged for post-training of LLMs.

Supervised Instruction Fine-Tuning (SFT)

Qwen 2: Used synthetic data to create instruction-response pairs, particularly for "high-quality literary data," to refine the model's response accuracy in predetermined scenarios.

AFM: Leveraged both human-annotated and synthetic data for SFT, fine-tuning the data mixture through multiple experiments to achieve the optimal balance.

Reinforcement Learning with Human Feedback (RLHF)

Qwen 2: Employed a two-stage alignment phase using Direct Preference Optimization (DPO) on an existing dataset and in real-time during training. Synthetic data played a role in forming the preference pairs for optimization.

AFM: Introduced new algorithms like Rejection Sampling Fine-tuning with Teacher Committee (iTeC) and RLHF with Mirror Descent Policy Optimization, using synthetic data to generate multiple responses and select the best ones for training.

Rejection Sampling

Qwen 2 and AFM: Used synthetic data to generate multiple responses during training, with a reward model selecting the preferred response. This approach, often called rejection sampling, helps refine the model's alignment with human preferences.

Model Distillation

Gemma 2: Applied knowledge distillation during both pre-training and post-training, using synthetic data generated by teacher models to train smaller models. This method, combined with model averaging techniques, helped stabilize and improve performance over time.

Llama 3.1: Employed model averaging not only for the reward models but also for the SFT and DPO models, using synthetic data to enhance the training process.

In addition, other state-of-the-art models, such as Phi-3 and NuminaMath, leverage synthetic data heavily to develop high-performance SLM.

Benefits of Synthetic Datasets

Synthetic data offers several benefits, making it an attractive option for training and evaluating AI models.

Limitless Data

Synthetic data can be generated at scale, providing an abundant supply of training and testing data for AI models. This is particularly valuable in domains where real-world data is scarce or difficult to obtain. For example, generating synthetic chat data can help improve chat models by covering various conditions and scenarios.

Absence of Closed Domain Data

In certain domains, such as healthcare, obtaining real-world data can be challenging due to privacy concerns and regulatory restrictions. For example, patient data in healthcare is highly sensitive and subject to strict privacy laws. Synthetic data can be used to create datasets that mimic real-world data without compromising privacy, enabling the development of AI models in these sensitive domains. This allows researchers and developers to work with realistic data while adhering to privacy regulations.

Improved Model Performance

The limits of human creativity can restrict the diversity and variety of data that can be generated manually. For example, creating diverse and realistic scenarios for training autonomous vehicles or medical diagnosis systems can be challenging. Synthetic data can overcome this limitation by using algorithms to create diverse and varied datasets that capture a wide range of scenarios and conditions, enhancing the robustness and generalization of AI models.

Cheaper Data

One of the primary challenges in creating high-quality datasets is the cost associated with data annotation. Annotating large volumes of data requires significant human effort and resources, making it costly and time-consuming.

Privacy Compliant

Synthetic data can help mitigate privacy concerns by creating anonymized or de-identified datasets that do not contain sensitive personal information. This is crucial in domains such as healthcare, where patient privacy is of utmost importance.

Limitations of Synthetic Datasets

Despite its promise, synthetic data is not perfect and has limitations that must be addressed.

Ensuring Quality

One of the main challenges is ensuring the factuality and fidelity of synthetic data. Models trained on false, hallucinated, or biased synthetic data may fail to generalize to real-world scenarios. For example, if a language model is trained on synthetic data that contains factual errors, it may produce inaccurate or misleading responses. Researchers must develop sophisticated generative models and evaluation metrics to create synthetic data that accurately reflects real-world data's complex patterns and relationships.

Bias Amplification

Synthetic data can amplify or introduce biases if not carefully designed and validated. For instance, the resulting AI model may exhibit biased behavior if the synthetic data generation process is biased towards certain demographics. Rigorous testing and fairness assessments are necessary to mitigate these risks and ensure that synthetic data does not perpetuate or exacerbate biases.

Evaluation Contamination

Using synthetic data in model training poses significant challenges to fair evaluation. Evaluation benchmarks are often created by referring to public text sources, which can lead to contamination if the synthetic data includes rephrased versions of the benchmark data. This can result in inflated performance metrics and misleading conclusions about the model's capabilities. Developing robust evaluation protocols and contamination detection techniques is essential to address this challenge.

Survey on Synthetic Datasets

Recent advancements in synthetic data generation have led to various synthetic datasets for different domains. For example, synthetic data has been used to improve performance on math-related tasks, code reasoning, tool-using abilities, and multilingual language models. These datasets have demonstrated synthetic data's potential in enhancing AI models' capabilities.

Reasoning Tasks

Synthetic data has been effectively used in reasoning tasks such as mathematical problem-solving and code generation. For example, models like Minerva, Llemma, and DeepSeekMath have been trained on synthetic math-targeted pre-training data, improving their performance on math-related tasks. Similarly, synthetic data has been used to generate complex questions and answers, enhancing the reasoning capabilities of language models.

Tool-Using and Planning

LLMs are used to build agents which require tool selection and usage. Synthetic data has also enabled language models to learn these tool-using abilities and planning skills. For example, models like GPT4o, Claude 3.5 Sonnet and Toolformer have been trained on interaction data annotated with calls to appropriate tools, enabling them to use calculators, search engines, and machine translators effectively. Synthetic trajectories in simulated environments have been used to teach models planning skills, such as decomposing complex tasks into subtasks and completing them in a reward-optimal way.

Multimodality

In multimodal tasks, synthetic data has been used to align visual input with language models. For example, Pix2Struct and MatCha have been trained on synthetic image-caption pairs generated from HTML code and tabular data, respectively. This has enabled these models to accurately ground visual input to language, improving their performance on tasks such as derendering screenshots and converting webpage screenshots into code.

Multilingual Data

Multilingual models are complicated to build due to lack of annotated data. Synthetic data has been key in improving multilingual language models by creating synthetic parallel training data from monolingual data sources. Techniques such as back-translation have been employed to generate synthetic multilingual question-answer pairs, enhancing the performance of language models on multilingual and cross-lingual question answering tasks.

A Framework for High-Quality Synthetic Datasets

Creating high-quality synthetic datasets is a multi-faceted challenge that requires a systematic and meticulous approach. This section outlines a robust framework to guide synthetic data generation, ensuring it meets the highest standards of accuracy, diversity, and applicability.

Step 1: Prompt Engineering

Task Specification: The first step in prompt engineering is clearly defining the task. This involves providing the necessary context, background information, and specific instructions that the model needs to understand the task. For instance, if the task is to generate synthetic medical records, the prompt should include details about the type of medical conditions, patient demographics, and the format of the records.

Generation Conditions: Next, define the attributes and characteristics of the desired data. This could include specifying the length of the text, the style of writing, and any particular focus areas. For example, in generating synthetic legal documents, the conditions might specify the inclusion of certain legal terminologies and the structure of the document.

In-Context Demonstrations: Providing examples or demonstrations within the prompt can significantly enhance the model's understanding and performance. These examples act as a guide, showing the model the desired output format and content. For instance, if the task is to generate customer service interactions, including a few example dialogues can help the model produce more accurate and relevant responses.

Step 2: Multi-Step Generation

Decomposition of Tasks: For complex data generation tasks, it is often beneficial to break down the task into smaller, manageable sub-tasks. This step-by-step approach can help ensure that each component of the data is generated accurately. For example, generating a synthetic research paper might involve separate steps for creating the abstract, introduction, methodology, results, and conclusion.

Iterative Refinement: Multi-step generation allows for iterative refinement, where the output from one step can be reviewed and improved before moving on to the next. This iterative process helps in catching and correcting errors early, ensuring higher quality in the final dataset. For instance, in generating synthetic financial reports, the initial draft can be reviewed for accuracy and completeness before adding detailed financial statements.

Contextual Conditioning: Each step of the multi-step generation can be conditioned on the outputs of previous steps. This ensures coherence and logical flow in the generated data. For example, in generating synthetic dialogues, each turn in the conversation can be conditioned on the previous turns, maintaining context and relevance.

Step 3: Data Curation

High-Quality Sample Filtering: After generating the synthetic data, it is crucial to filter out low-quality samples. This can be achieved using heuristic metrics such as confidence scores, influence functions, and generation probabilities. For instance, samples with low confidence scores or high uncertainty can be discarded to ensure only high-quality data is retained.

Label Enhancement: This can be done through human intervention or by using auxiliary models for knowledge distillation. For example, in a synthetic dataset of annotated images, human reviewers can verify and correct the labels, or a student model can be used to refine the annotations based on feedback from the teacher model.

Re-Weighting Strategies: Instead of discarding low-quality data, re-weighting strategies can be employed to assign varying importance to different samples. This ensures that influential and correctly annotated samples have a larger impact on the training process. For instance, in a synthetic text dataset, samples with higher relevance and accuracy can be given more weight during model training.

Bias Mitigation: This involves conducting fairness assessments and using techniques to balance the representation of different classes and demographics. For instance, in a synthetic dataset for customer feedback analysis, ensure balanced representation of positive, negative, and neutral sentiments across different demographic groups.

By following this comprehensive framework, researchers and practitioners can generate high-quality synthetic datasets that are accurate, diverse, and applicable to a wide range of AI tasks.

Build your Synthetic RAG Dataset

Here is a recipe to build your own synthetic dataset for training and evaluating RAG systems.

First, use any GPT to create the dataset, and then filter it based on context adherence score. This process ensures that our dataset is purely synthetic, as we won't rely on any pre-existing datasets.

Here are the steps for the complete process:

  1. Define the prompt to generate synthetic data

  2. Create OpenAI batch with the prompts

  3. Download the output of batch and parse it

  4. Convert data to RAG prompt and response

  5. Evaluate dataset with Galileo

  6. Download data and remove samples with low context adherence score

  7. Save the final high quality dataset

Future of Synthetic Datasets

The future of synthetic data looks very promising, with several key areas warranting further exploration.

Self-Improvement Capability

An intriguing question arises: can a model generate synthetic data that is better than the data it was trained on, thus enabling it to improve itself? This concept of self-improvement through synthetic data generation is an exciting avenue for future research. If a model can generate higher-quality data than its original training set, it could potentially bootstrap its own performance by iteratively learning from the enhanced synthetic data.

This process was leveraged in building Alibaba's Qwen 2 and Meta AI's Llama 3.1. Although there has been some success here, it’s still to be proven whether something like GPT-5 could be created from GPT-4 generated synthetic data. Papers like The Curse of Recursion: Training on Generated Data Makes Models Forget have raised concerns about LLMs being brittle if trained in such a manner.

Scaling Laws of Synthetic Data

Future research should investigate the scaling laws for synthetic data and determine the optimal balance between the quantity and quality of synthetic samples. Understanding how to scale synthetic data effectively can help maximize its benefits for AI model training and evaluation.

Improving Quality and Diversity

There is still room for improvement in creating high-quality, attributed synthetic samples that closely mimic real-world data. Future research should focus on developing new advanced techniques to control and manipulate specific attributes of the generated data. This will naturally improve as the instruction-following capability of LLMs improves.

Conclusion

Synthetic data has emerged as a promising solution to address the challenges of data scarcity, privacy concerns, and high costs in AI development.

Many companies are already generating realistic and diverse datasets, synthetic data to enable the training and evaluation of AI models at scale across various domains. Despite the challenges, the potential benefits of synthetic data in advancing AI research are unparalleled. Are you ready to generate some high-quality synthetic data with Galileo?

References

Best Practices and Lessons Learned on Synthetic Data for Language Models

On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

Cosmopedia: how to create large-scale synthetic data for pre-training

New LLM Pre-training and Post-training Paradigms

High quality AI models heavily rely on large, diverse, and high-quality datasets for training and evaluation. But acquiring these datasets can be a significant challenge due to data scarcity, privacy concerns, and data collection and annotation costs. Synthetic data has emerged as a promising solution to address these challenges. This blog will explore its benefits, challenges, and how it can be effectively used for training and evaluating LLMs.

What is a Synthetic Dataset?

Synthetic data refers to artificially generated data that mimics the characteristics and patterns of real-world data. Unlike real data, collected from actual events or observations, synthetic data is created through algorithms, generative models, or simulations. This data type can be tailored to specific requirements, ensuring a balanced representation of different classes and introducing controlled variations to improve model performance and generalization.

Synthetic Data in Building LLMs

The development and training of LLMs have evolved significantly, incorporating both pre-training and post-training methodologies. Here are some key ways synthetic data can be utilized in the training pipelines of modern LLMs, as illustrated by recent advancements in models like Alibaba's Qwen 2, Apple's AFM, Google's Gemma 2, and Meta AI's Llama 3.1.

Pre-Training with Synthetic Data

Here are some examples of how synthetic data is used for pretraining LLMs.

Data Augmentation and Diversity

Qwen 2: The Qwen models leveraged synthetic data generated by previous iterations of Qwen models to augment their pre-training datasets. This approach helped enhance data diversity and improve the model's ability to handle various tasks.

AFM: Apple's AFM models included synthetic data in their pre-training stages, particularly for context lengthening. They augmented their datasets with synthetic long-context Q&A data to improve the model's performance on tasks requiring long-term dependencies.

Gemma 2: Google's Gemma models also utilized synthetic data generated through knowledge distillation. Smaller models were trained using outputs from larger teacher models, enriching the training data and improving the efficiency of the training process.

Improving Data Quality

Qwen 2: Emphasized improving the data filtering pipeline to remove low-quality data and enhance data mixing, ensuring the synthetic data used was of high quality.

AFM: Focused on using high-quality synthetic data for continued pre-training, particularly for math and code tasks. This ensured that the model received high-quality signals during training.

Context Lengthening

Qwen 2: Performed long-context training in the later stages of pre-training, using high-quality, lengthy synthetic data to increase the context length from 4,096 to 32,768 tokens.

AFM: Included a dedicated pre-training stage for context lengthening, where synthetic data was used to train the model on longer sequences, enhancing its ability to handle extended contexts.

Post-Training with Synthetic Data

Similarly, here are examples of how synthetic data is leveraged for post-training of LLMs.

Supervised Instruction Fine-Tuning (SFT)

Qwen 2: Used synthetic data to create instruction-response pairs, particularly for "high-quality literary data," to refine the model's response accuracy in predetermined scenarios.

AFM: Leveraged both human-annotated and synthetic data for SFT, fine-tuning the data mixture through multiple experiments to achieve the optimal balance.

Reinforcement Learning with Human Feedback (RLHF)

Qwen 2: Employed a two-stage alignment phase using Direct Preference Optimization (DPO) on an existing dataset and in real-time during training. Synthetic data played a role in forming the preference pairs for optimization.

AFM: Introduced new algorithms like Rejection Sampling Fine-tuning with Teacher Committee (iTeC) and RLHF with Mirror Descent Policy Optimization, using synthetic data to generate multiple responses and select the best ones for training.

Rejection Sampling

Qwen 2 and AFM: Used synthetic data to generate multiple responses during training, with a reward model selecting the preferred response. This approach, often called rejection sampling, helps refine the model's alignment with human preferences.

Model Distillation

Gemma 2: Applied knowledge distillation during both pre-training and post-training, using synthetic data generated by teacher models to train smaller models. This method, combined with model averaging techniques, helped stabilize and improve performance over time.

Llama 3.1: Employed model averaging not only for the reward models but also for the SFT and DPO models, using synthetic data to enhance the training process.

In addition, other state-of-the-art models, such as Phi-3 and NuminaMath, leverage synthetic data heavily to develop high-performance SLM.

Benefits of Synthetic Datasets

Synthetic data offers several benefits, making it an attractive option for training and evaluating AI models.

Limitless Data

Synthetic data can be generated at scale, providing an abundant supply of training and testing data for AI models. This is particularly valuable in domains where real-world data is scarce or difficult to obtain. For example, generating synthetic chat data can help improve chat models by covering various conditions and scenarios.

Absence of Closed Domain Data

In certain domains, such as healthcare, obtaining real-world data can be challenging due to privacy concerns and regulatory restrictions. For example, patient data in healthcare is highly sensitive and subject to strict privacy laws. Synthetic data can be used to create datasets that mimic real-world data without compromising privacy, enabling the development of AI models in these sensitive domains. This allows researchers and developers to work with realistic data while adhering to privacy regulations.

Improved Model Performance

The limits of human creativity can restrict the diversity and variety of data that can be generated manually. For example, creating diverse and realistic scenarios for training autonomous vehicles or medical diagnosis systems can be challenging. Synthetic data can overcome this limitation by using algorithms to create diverse and varied datasets that capture a wide range of scenarios and conditions, enhancing the robustness and generalization of AI models.

Cheaper Data

One of the primary challenges in creating high-quality datasets is the cost associated with data annotation. Annotating large volumes of data requires significant human effort and resources, making it costly and time-consuming.

Privacy Compliant

Synthetic data can help mitigate privacy concerns by creating anonymized or de-identified datasets that do not contain sensitive personal information. This is crucial in domains such as healthcare, where patient privacy is of utmost importance.

Limitations of Synthetic Datasets

Despite its promise, synthetic data is not perfect and has limitations that must be addressed.

Ensuring Quality

One of the main challenges is ensuring the factuality and fidelity of synthetic data. Models trained on false, hallucinated, or biased synthetic data may fail to generalize to real-world scenarios. For example, if a language model is trained on synthetic data that contains factual errors, it may produce inaccurate or misleading responses. Researchers must develop sophisticated generative models and evaluation metrics to create synthetic data that accurately reflects real-world data's complex patterns and relationships.

Bias Amplification

Synthetic data can amplify or introduce biases if not carefully designed and validated. For instance, the resulting AI model may exhibit biased behavior if the synthetic data generation process is biased towards certain demographics. Rigorous testing and fairness assessments are necessary to mitigate these risks and ensure that synthetic data does not perpetuate or exacerbate biases.

Evaluation Contamination

Using synthetic data in model training poses significant challenges to fair evaluation. Evaluation benchmarks are often created by referring to public text sources, which can lead to contamination if the synthetic data includes rephrased versions of the benchmark data. This can result in inflated performance metrics and misleading conclusions about the model's capabilities. Developing robust evaluation protocols and contamination detection techniques is essential to address this challenge.

Survey on Synthetic Datasets

Recent advancements in synthetic data generation have led to various synthetic datasets for different domains. For example, synthetic data has been used to improve performance on math-related tasks, code reasoning, tool-using abilities, and multilingual language models. These datasets have demonstrated synthetic data's potential in enhancing AI models' capabilities.

Reasoning Tasks

Synthetic data has been effectively used in reasoning tasks such as mathematical problem-solving and code generation. For example, models like Minerva, Llemma, and DeepSeekMath have been trained on synthetic math-targeted pre-training data, improving their performance on math-related tasks. Similarly, synthetic data has been used to generate complex questions and answers, enhancing the reasoning capabilities of language models.

Tool-Using and Planning

LLMs are used to build agents which require tool selection and usage. Synthetic data has also enabled language models to learn these tool-using abilities and planning skills. For example, models like GPT4o, Claude 3.5 Sonnet and Toolformer have been trained on interaction data annotated with calls to appropriate tools, enabling them to use calculators, search engines, and machine translators effectively. Synthetic trajectories in simulated environments have been used to teach models planning skills, such as decomposing complex tasks into subtasks and completing them in a reward-optimal way.

Multimodality

In multimodal tasks, synthetic data has been used to align visual input with language models. For example, Pix2Struct and MatCha have been trained on synthetic image-caption pairs generated from HTML code and tabular data, respectively. This has enabled these models to accurately ground visual input to language, improving their performance on tasks such as derendering screenshots and converting webpage screenshots into code.

Multilingual Data

Multilingual models are complicated to build due to lack of annotated data. Synthetic data has been key in improving multilingual language models by creating synthetic parallel training data from monolingual data sources. Techniques such as back-translation have been employed to generate synthetic multilingual question-answer pairs, enhancing the performance of language models on multilingual and cross-lingual question answering tasks.

A Framework for High-Quality Synthetic Datasets

Creating high-quality synthetic datasets is a multi-faceted challenge that requires a systematic and meticulous approach. This section outlines a robust framework to guide synthetic data generation, ensuring it meets the highest standards of accuracy, diversity, and applicability.

Step 1: Prompt Engineering

Task Specification: The first step in prompt engineering is clearly defining the task. This involves providing the necessary context, background information, and specific instructions that the model needs to understand the task. For instance, if the task is to generate synthetic medical records, the prompt should include details about the type of medical conditions, patient demographics, and the format of the records.

Generation Conditions: Next, define the attributes and characteristics of the desired data. This could include specifying the length of the text, the style of writing, and any particular focus areas. For example, in generating synthetic legal documents, the conditions might specify the inclusion of certain legal terminologies and the structure of the document.

In-Context Demonstrations: Providing examples or demonstrations within the prompt can significantly enhance the model's understanding and performance. These examples act as a guide, showing the model the desired output format and content. For instance, if the task is to generate customer service interactions, including a few example dialogues can help the model produce more accurate and relevant responses.

Step 2: Multi-Step Generation

Decomposition of Tasks: For complex data generation tasks, it is often beneficial to break down the task into smaller, manageable sub-tasks. This step-by-step approach can help ensure that each component of the data is generated accurately. For example, generating a synthetic research paper might involve separate steps for creating the abstract, introduction, methodology, results, and conclusion.

Iterative Refinement: Multi-step generation allows for iterative refinement, where the output from one step can be reviewed and improved before moving on to the next. This iterative process helps in catching and correcting errors early, ensuring higher quality in the final dataset. For instance, in generating synthetic financial reports, the initial draft can be reviewed for accuracy and completeness before adding detailed financial statements.

Contextual Conditioning: Each step of the multi-step generation can be conditioned on the outputs of previous steps. This ensures coherence and logical flow in the generated data. For example, in generating synthetic dialogues, each turn in the conversation can be conditioned on the previous turns, maintaining context and relevance.

Step 3: Data Curation

High-Quality Sample Filtering: After generating the synthetic data, it is crucial to filter out low-quality samples. This can be achieved using heuristic metrics such as confidence scores, influence functions, and generation probabilities. For instance, samples with low confidence scores or high uncertainty can be discarded to ensure only high-quality data is retained.

Label Enhancement: This can be done through human intervention or by using auxiliary models for knowledge distillation. For example, in a synthetic dataset of annotated images, human reviewers can verify and correct the labels, or a student model can be used to refine the annotations based on feedback from the teacher model.

Re-Weighting Strategies: Instead of discarding low-quality data, re-weighting strategies can be employed to assign varying importance to different samples. This ensures that influential and correctly annotated samples have a larger impact on the training process. For instance, in a synthetic text dataset, samples with higher relevance and accuracy can be given more weight during model training.

Bias Mitigation: This involves conducting fairness assessments and using techniques to balance the representation of different classes and demographics. For instance, in a synthetic dataset for customer feedback analysis, ensure balanced representation of positive, negative, and neutral sentiments across different demographic groups.

By following this comprehensive framework, researchers and practitioners can generate high-quality synthetic datasets that are accurate, diverse, and applicable to a wide range of AI tasks.

Build your Synthetic RAG Dataset

Here is a recipe to build your own synthetic dataset for training and evaluating RAG systems.

First, use any GPT to create the dataset, and then filter it based on context adherence score. This process ensures that our dataset is purely synthetic, as we won't rely on any pre-existing datasets.

Here are the steps for the complete process:

  1. Define the prompt to generate synthetic data

  2. Create OpenAI batch with the prompts

  3. Download the output of batch and parse it

  4. Convert data to RAG prompt and response

  5. Evaluate dataset with Galileo

  6. Download data and remove samples with low context adherence score

  7. Save the final high quality dataset

Future of Synthetic Datasets

The future of synthetic data looks very promising, with several key areas warranting further exploration.

Self-Improvement Capability

An intriguing question arises: can a model generate synthetic data that is better than the data it was trained on, thus enabling it to improve itself? This concept of self-improvement through synthetic data generation is an exciting avenue for future research. If a model can generate higher-quality data than its original training set, it could potentially bootstrap its own performance by iteratively learning from the enhanced synthetic data.

This process was leveraged in building Alibaba's Qwen 2 and Meta AI's Llama 3.1. Although there has been some success here, it’s still to be proven whether something like GPT-5 could be created from GPT-4 generated synthetic data. Papers like The Curse of Recursion: Training on Generated Data Makes Models Forget have raised concerns about LLMs being brittle if trained in such a manner.

Scaling Laws of Synthetic Data

Future research should investigate the scaling laws for synthetic data and determine the optimal balance between the quantity and quality of synthetic samples. Understanding how to scale synthetic data effectively can help maximize its benefits for AI model training and evaluation.

Improving Quality and Diversity

There is still room for improvement in creating high-quality, attributed synthetic samples that closely mimic real-world data. Future research should focus on developing new advanced techniques to control and manipulate specific attributes of the generated data. This will naturally improve as the instruction-following capability of LLMs improves.

Conclusion

Synthetic data has emerged as a promising solution to address the challenges of data scarcity, privacy concerns, and high costs in AI development.

Many companies are already generating realistic and diverse datasets, synthetic data to enable the training and evaluation of AI models at scale across various domains. Despite the challenges, the potential benefits of synthetic data in advancing AI research are unparalleled. Are you ready to generate some high-quality synthetic data with Galileo?

References

Best Practices and Lessons Learned on Synthetic Data for Language Models

On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

Cosmopedia: how to create large-scale synthetic data for pre-training

New LLM Pre-training and Post-training Paradigms

High quality AI models heavily rely on large, diverse, and high-quality datasets for training and evaluation. But acquiring these datasets can be a significant challenge due to data scarcity, privacy concerns, and data collection and annotation costs. Synthetic data has emerged as a promising solution to address these challenges. This blog will explore its benefits, challenges, and how it can be effectively used for training and evaluating LLMs.

What is a Synthetic Dataset?

Synthetic data refers to artificially generated data that mimics the characteristics and patterns of real-world data. Unlike real data, collected from actual events or observations, synthetic data is created through algorithms, generative models, or simulations. This data type can be tailored to specific requirements, ensuring a balanced representation of different classes and introducing controlled variations to improve model performance and generalization.

Synthetic Data in Building LLMs

The development and training of LLMs have evolved significantly, incorporating both pre-training and post-training methodologies. Here are some key ways synthetic data can be utilized in the training pipelines of modern LLMs, as illustrated by recent advancements in models like Alibaba's Qwen 2, Apple's AFM, Google's Gemma 2, and Meta AI's Llama 3.1.

Pre-Training with Synthetic Data

Here are some examples of how synthetic data is used for pretraining LLMs.

Data Augmentation and Diversity

Qwen 2: The Qwen models leveraged synthetic data generated by previous iterations of Qwen models to augment their pre-training datasets. This approach helped enhance data diversity and improve the model's ability to handle various tasks.

AFM: Apple's AFM models included synthetic data in their pre-training stages, particularly for context lengthening. They augmented their datasets with synthetic long-context Q&A data to improve the model's performance on tasks requiring long-term dependencies.

Gemma 2: Google's Gemma models also utilized synthetic data generated through knowledge distillation. Smaller models were trained using outputs from larger teacher models, enriching the training data and improving the efficiency of the training process.

Improving Data Quality

Qwen 2: Emphasized improving the data filtering pipeline to remove low-quality data and enhance data mixing, ensuring the synthetic data used was of high quality.

AFM: Focused on using high-quality synthetic data for continued pre-training, particularly for math and code tasks. This ensured that the model received high-quality signals during training.

Context Lengthening

Qwen 2: Performed long-context training in the later stages of pre-training, using high-quality, lengthy synthetic data to increase the context length from 4,096 to 32,768 tokens.

AFM: Included a dedicated pre-training stage for context lengthening, where synthetic data was used to train the model on longer sequences, enhancing its ability to handle extended contexts.

Post-Training with Synthetic Data

Similarly, here are examples of how synthetic data is leveraged for post-training of LLMs.

Supervised Instruction Fine-Tuning (SFT)

Qwen 2: Used synthetic data to create instruction-response pairs, particularly for "high-quality literary data," to refine the model's response accuracy in predetermined scenarios.

AFM: Leveraged both human-annotated and synthetic data for SFT, fine-tuning the data mixture through multiple experiments to achieve the optimal balance.

Reinforcement Learning with Human Feedback (RLHF)

Qwen 2: Employed a two-stage alignment phase using Direct Preference Optimization (DPO) on an existing dataset and in real-time during training. Synthetic data played a role in forming the preference pairs for optimization.

AFM: Introduced new algorithms like Rejection Sampling Fine-tuning with Teacher Committee (iTeC) and RLHF with Mirror Descent Policy Optimization, using synthetic data to generate multiple responses and select the best ones for training.

Rejection Sampling

Qwen 2 and AFM: Used synthetic data to generate multiple responses during training, with a reward model selecting the preferred response. This approach, often called rejection sampling, helps refine the model's alignment with human preferences.

Model Distillation

Gemma 2: Applied knowledge distillation during both pre-training and post-training, using synthetic data generated by teacher models to train smaller models. This method, combined with model averaging techniques, helped stabilize and improve performance over time.

Llama 3.1: Employed model averaging not only for the reward models but also for the SFT and DPO models, using synthetic data to enhance the training process.

In addition, other state-of-the-art models, such as Phi-3 and NuminaMath, leverage synthetic data heavily to develop high-performance SLM.

Benefits of Synthetic Datasets

Synthetic data offers several benefits, making it an attractive option for training and evaluating AI models.

Limitless Data

Synthetic data can be generated at scale, providing an abundant supply of training and testing data for AI models. This is particularly valuable in domains where real-world data is scarce or difficult to obtain. For example, generating synthetic chat data can help improve chat models by covering various conditions and scenarios.

Absence of Closed Domain Data

In certain domains, such as healthcare, obtaining real-world data can be challenging due to privacy concerns and regulatory restrictions. For example, patient data in healthcare is highly sensitive and subject to strict privacy laws. Synthetic data can be used to create datasets that mimic real-world data without compromising privacy, enabling the development of AI models in these sensitive domains. This allows researchers and developers to work with realistic data while adhering to privacy regulations.

Improved Model Performance

The limits of human creativity can restrict the diversity and variety of data that can be generated manually. For example, creating diverse and realistic scenarios for training autonomous vehicles or medical diagnosis systems can be challenging. Synthetic data can overcome this limitation by using algorithms to create diverse and varied datasets that capture a wide range of scenarios and conditions, enhancing the robustness and generalization of AI models.

Cheaper Data

One of the primary challenges in creating high-quality datasets is the cost associated with data annotation. Annotating large volumes of data requires significant human effort and resources, making it costly and time-consuming.

Privacy Compliant

Synthetic data can help mitigate privacy concerns by creating anonymized or de-identified datasets that do not contain sensitive personal information. This is crucial in domains such as healthcare, where patient privacy is of utmost importance.

Limitations of Synthetic Datasets

Despite its promise, synthetic data is not perfect and has limitations that must be addressed.

Ensuring Quality

One of the main challenges is ensuring the factuality and fidelity of synthetic data. Models trained on false, hallucinated, or biased synthetic data may fail to generalize to real-world scenarios. For example, if a language model is trained on synthetic data that contains factual errors, it may produce inaccurate or misleading responses. Researchers must develop sophisticated generative models and evaluation metrics to create synthetic data that accurately reflects real-world data's complex patterns and relationships.

Bias Amplification

Synthetic data can amplify or introduce biases if not carefully designed and validated. For instance, the resulting AI model may exhibit biased behavior if the synthetic data generation process is biased towards certain demographics. Rigorous testing and fairness assessments are necessary to mitigate these risks and ensure that synthetic data does not perpetuate or exacerbate biases.

Evaluation Contamination

Using synthetic data in model training poses significant challenges to fair evaluation. Evaluation benchmarks are often created by referring to public text sources, which can lead to contamination if the synthetic data includes rephrased versions of the benchmark data. This can result in inflated performance metrics and misleading conclusions about the model's capabilities. Developing robust evaluation protocols and contamination detection techniques is essential to address this challenge.

Survey on Synthetic Datasets

Recent advancements in synthetic data generation have led to various synthetic datasets for different domains. For example, synthetic data has been used to improve performance on math-related tasks, code reasoning, tool-using abilities, and multilingual language models. These datasets have demonstrated synthetic data's potential in enhancing AI models' capabilities.

Reasoning Tasks

Synthetic data has been effectively used in reasoning tasks such as mathematical problem-solving and code generation. For example, models like Minerva, Llemma, and DeepSeekMath have been trained on synthetic math-targeted pre-training data, improving their performance on math-related tasks. Similarly, synthetic data has been used to generate complex questions and answers, enhancing the reasoning capabilities of language models.

Tool-Using and Planning

LLMs are used to build agents which require tool selection and usage. Synthetic data has also enabled language models to learn these tool-using abilities and planning skills. For example, models like GPT4o, Claude 3.5 Sonnet and Toolformer have been trained on interaction data annotated with calls to appropriate tools, enabling them to use calculators, search engines, and machine translators effectively. Synthetic trajectories in simulated environments have been used to teach models planning skills, such as decomposing complex tasks into subtasks and completing them in a reward-optimal way.

Multimodality

In multimodal tasks, synthetic data has been used to align visual input with language models. For example, Pix2Struct and MatCha have been trained on synthetic image-caption pairs generated from HTML code and tabular data, respectively. This has enabled these models to accurately ground visual input to language, improving their performance on tasks such as derendering screenshots and converting webpage screenshots into code.

Multilingual Data

Multilingual models are complicated to build due to lack of annotated data. Synthetic data has been key in improving multilingual language models by creating synthetic parallel training data from monolingual data sources. Techniques such as back-translation have been employed to generate synthetic multilingual question-answer pairs, enhancing the performance of language models on multilingual and cross-lingual question answering tasks.

A Framework for High-Quality Synthetic Datasets

Creating high-quality synthetic datasets is a multi-faceted challenge that requires a systematic and meticulous approach. This section outlines a robust framework to guide synthetic data generation, ensuring it meets the highest standards of accuracy, diversity, and applicability.

Step 1: Prompt Engineering

Task Specification: The first step in prompt engineering is clearly defining the task. This involves providing the necessary context, background information, and specific instructions that the model needs to understand the task. For instance, if the task is to generate synthetic medical records, the prompt should include details about the type of medical conditions, patient demographics, and the format of the records.

Generation Conditions: Next, define the attributes and characteristics of the desired data. This could include specifying the length of the text, the style of writing, and any particular focus areas. For example, in generating synthetic legal documents, the conditions might specify the inclusion of certain legal terminologies and the structure of the document.

In-Context Demonstrations: Providing examples or demonstrations within the prompt can significantly enhance the model's understanding and performance. These examples act as a guide, showing the model the desired output format and content. For instance, if the task is to generate customer service interactions, including a few example dialogues can help the model produce more accurate and relevant responses.

Step 2: Multi-Step Generation

Decomposition of Tasks: For complex data generation tasks, it is often beneficial to break down the task into smaller, manageable sub-tasks. This step-by-step approach can help ensure that each component of the data is generated accurately. For example, generating a synthetic research paper might involve separate steps for creating the abstract, introduction, methodology, results, and conclusion.

Iterative Refinement: Multi-step generation allows for iterative refinement, where the output from one step can be reviewed and improved before moving on to the next. This iterative process helps in catching and correcting errors early, ensuring higher quality in the final dataset. For instance, in generating synthetic financial reports, the initial draft can be reviewed for accuracy and completeness before adding detailed financial statements.

Contextual Conditioning: Each step of the multi-step generation can be conditioned on the outputs of previous steps. This ensures coherence and logical flow in the generated data. For example, in generating synthetic dialogues, each turn in the conversation can be conditioned on the previous turns, maintaining context and relevance.

Step 3: Data Curation

High-Quality Sample Filtering: After generating the synthetic data, it is crucial to filter out low-quality samples. This can be achieved using heuristic metrics such as confidence scores, influence functions, and generation probabilities. For instance, samples with low confidence scores or high uncertainty can be discarded to ensure only high-quality data is retained.

Label Enhancement: This can be done through human intervention or by using auxiliary models for knowledge distillation. For example, in a synthetic dataset of annotated images, human reviewers can verify and correct the labels, or a student model can be used to refine the annotations based on feedback from the teacher model.

Re-Weighting Strategies: Instead of discarding low-quality data, re-weighting strategies can be employed to assign varying importance to different samples. This ensures that influential and correctly annotated samples have a larger impact on the training process. For instance, in a synthetic text dataset, samples with higher relevance and accuracy can be given more weight during model training.

Bias Mitigation: This involves conducting fairness assessments and using techniques to balance the representation of different classes and demographics. For instance, in a synthetic dataset for customer feedback analysis, ensure balanced representation of positive, negative, and neutral sentiments across different demographic groups.

By following this comprehensive framework, researchers and practitioners can generate high-quality synthetic datasets that are accurate, diverse, and applicable to a wide range of AI tasks.

Build your Synthetic RAG Dataset

Here is a recipe to build your own synthetic dataset for training and evaluating RAG systems.

First, use any GPT to create the dataset, and then filter it based on context adherence score. This process ensures that our dataset is purely synthetic, as we won't rely on any pre-existing datasets.

Here are the steps for the complete process:

  1. Define the prompt to generate synthetic data

  2. Create OpenAI batch with the prompts

  3. Download the output of batch and parse it

  4. Convert data to RAG prompt and response

  5. Evaluate dataset with Galileo

  6. Download data and remove samples with low context adherence score

  7. Save the final high quality dataset

Future of Synthetic Datasets

The future of synthetic data looks very promising, with several key areas warranting further exploration.

Self-Improvement Capability

An intriguing question arises: can a model generate synthetic data that is better than the data it was trained on, thus enabling it to improve itself? This concept of self-improvement through synthetic data generation is an exciting avenue for future research. If a model can generate higher-quality data than its original training set, it could potentially bootstrap its own performance by iteratively learning from the enhanced synthetic data.

This process was leveraged in building Alibaba's Qwen 2 and Meta AI's Llama 3.1. Although there has been some success here, it’s still to be proven whether something like GPT-5 could be created from GPT-4 generated synthetic data. Papers like The Curse of Recursion: Training on Generated Data Makes Models Forget have raised concerns about LLMs being brittle if trained in such a manner.

Scaling Laws of Synthetic Data

Future research should investigate the scaling laws for synthetic data and determine the optimal balance between the quantity and quality of synthetic samples. Understanding how to scale synthetic data effectively can help maximize its benefits for AI model training and evaluation.

Improving Quality and Diversity

There is still room for improvement in creating high-quality, attributed synthetic samples that closely mimic real-world data. Future research should focus on developing new advanced techniques to control and manipulate specific attributes of the generated data. This will naturally improve as the instruction-following capability of LLMs improves.

Conclusion

Synthetic data has emerged as a promising solution to address the challenges of data scarcity, privacy concerns, and high costs in AI development.

Many companies are already generating realistic and diverse datasets, synthetic data to enable the training and evaluation of AI models at scale across various domains. Despite the challenges, the potential benefits of synthetic data in advancing AI research are unparalleled. Are you ready to generate some high-quality synthetic data with Galileo?

References

Best Practices and Lessons Learned on Synthetic Data for Language Models

On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

Cosmopedia: how to create large-scale synthetic data for pre-training

New LLM Pre-training and Post-training Paradigms

Pratik Bhavsar

Pratik Bhavsar

Pratik Bhavsar

Pratik Bhavsar