Mastering Data: Generate Synthetic Data for RAG in Just $10

Pratik Bhavsar
Pratik BhavsarGalileo Labs
Mastering Data: Generate Synthetic Data for RAG in Just $10
13 min readSeptember 10 2024

High quality AI models heavily rely on large, diverse, and high-quality datasets for training and evaluation. But acquiring these datasets can be a significant challenge due to data scarcity, privacy concerns, and data collection and annotation costs. Synthetic data has emerged as a promising solution to address these challenges. This blog will explore its benefits, challenges, and how it can be effectively used for training and evaluating LLMs.

What is a Synthetic Dataset?

Synthetic data refers to artificially generated data that mimics the characteristics and patterns of real-world data. Unlike real data, collected from actual events or observations, synthetic data is created through algorithms, generative models, or simulations. This data type can be tailored to specific requirements, ensuring a balanced representation of different classes and introducing controlled variations to improve model performance and generalization.

Synthetic Data in Building LLMs

The development and training of LLMs have evolved significantly, incorporating both pre-training and post-training methodologies. Here are some key ways synthetic data can be utilized in the training pipelines of modern LLMs, as illustrated by recent advancements in models like Alibaba's Qwen 2, Apple's AFM, Google's Gemma 2, and Meta AI's Llama 3.1.

Pre-Training with Synthetic Data

Here are some examples of how synthetic data is used for pretraining LLMs.

Data Augmentation and Diversity

Qwen 2: The Qwen models leveraged synthetic data generated by previous iterations of Qwen models to augment their pre-training datasets. This approach helped enhance data diversity and improve the model's ability to handle various tasks.

AFM: Apple's AFM models included synthetic data in their pre-training stages, particularly for context lengthening. They augmented their datasets with synthetic long-context Q&A data to improve the model's performance on tasks requiring long-term dependencies.

Gemma 2: Google's Gemma models also utilized synthetic data generated through knowledge distillation. Smaller models were trained using outputs from larger teacher models, enriching the training data and improving the efficiency of the training process.

Improving Data Quality

Qwen 2: Emphasized improving the data filtering pipeline to remove low-quality data and enhance data mixing, ensuring the synthetic data used was of high quality.

AFM: Focused on using high-quality synthetic data for continued pre-training, particularly for math and code tasks. This ensured that the model received high-quality signals during training.

Context Lengthening

Qwen 2: Performed long-context training in the later stages of pre-training, using high-quality, lengthy synthetic data to increase the context length from 4,096 to 32,768 tokens.

AFM: Included a dedicated pre-training stage for context lengthening, where synthetic data was used to train the model on longer sequences, enhancing its ability to handle extended contexts.

Post-Training with Synthetic Data

Similarly, here are examples of how synthetic data is leveraged for post-training of LLMs.

Supervised Instruction Fine-Tuning (SFT)

Qwen 2: Used synthetic data to create instruction-response pairs, particularly for "high-quality literary data," to refine the model's response accuracy in predetermined scenarios.

AFM: Leveraged both human-annotated and synthetic data for SFT, fine-tuning the data mixture through multiple experiments to achieve the optimal balance.

Reinforcement Learning with Human Feedback (RLHF)

Qwen 2: Employed a two-stage alignment phase using Direct Preference Optimization (DPO) on an existing dataset and in real-time during training. Synthetic data played a role in forming the preference pairs for optimization.

AFM: Introduced new algorithms like Rejection Sampling Fine-tuning with Teacher Committee (iTeC) and RLHF with Mirror Descent Policy Optimization, using synthetic data to generate multiple responses and select the best ones for training.

Rejection Sampling

Qwen 2 and AFM: Used synthetic data to generate multiple responses during training, with a reward model selecting the preferred response. This approach, often called rejection sampling, helps refine the model's alignment with human preferences.

Model Distillation

Gemma 2: Applied knowledge distillation during both pre-training and post-training, using synthetic data generated by teacher models to train smaller models. This method, combined with model averaging techniques, helped stabilize and improve performance over time.

Llama 3.1: Employed model averaging not only for the reward models but also for the SFT and DPO models, using synthetic data to enhance the training process.

In addition, other state-of-the-art models, such as Phi-3 and NuminaMath, leverage synthetic data heavily to develop high-performance SLM.

Benefits of Synthetic Datasets

Synthetic data offers several benefits, making it an attractive option for training and evaluating AI models.

Limitless Data

Synthetic data can be generated at scale, providing an abundant supply of training and testing data for AI models. This is particularly valuable in domains where real-world data is scarce or difficult to obtain. For example, generating synthetic chat data can help improve chat models by covering various conditions and scenarios.

Absence of Closed Domain Data

In certain domains, such as healthcare, obtaining real-world data can be challenging due to privacy concerns and regulatory restrictions. For example, patient data in healthcare is highly sensitive and subject to strict privacy laws. Synthetic data can be used to create datasets that mimic real-world data without compromising privacy, enabling the development of AI models in these sensitive domains. This allows researchers and developers to work with realistic data while adhering to privacy regulations.

Improved Model Performance

The limits of human creativity can restrict the diversity and variety of data that can be generated manually. For example, creating diverse and realistic scenarios for training autonomous vehicles or medical diagnosis systems can be challenging. Synthetic data can overcome this limitation by using algorithms to create diverse and varied datasets that capture a wide range of scenarios and conditions, enhancing the robustness and generalization of AI models.

Cheaper Data

One of the primary challenges in creating high-quality datasets is the cost associated with data annotation. Annotating large volumes of data requires significant human effort and resources, making it costly and time-consuming.

Privacy Compliant

Synthetic data can help mitigate privacy concerns by creating anonymized or de-identified datasets that do not contain sensitive personal information. This is crucial in domains such as healthcare, where patient privacy is of utmost importance.

Limitations of Synthetic Datasets

Despite its promise, synthetic data is not perfect and has limitations that must be addressed.

Ensuring Quality

One of the main challenges is ensuring the factuality and fidelity of synthetic data. Models trained on false, hallucinated, or biased synthetic data may fail to generalize to real-world scenarios. For example, if a language model is trained on synthetic data that contains factual errors, it may produce inaccurate or misleading responses. Researchers must develop sophisticated generative models and evaluation metrics to create synthetic data that accurately reflects real-world data's complex patterns and relationships.

Bias Amplification

Synthetic data can amplify or introduce biases if not carefully designed and validated. For instance, the resulting AI model may exhibit biased behavior if the synthetic data generation process is biased towards certain demographics. Rigorous testing and fairness assessments are necessary to mitigate these risks and ensure that synthetic data does not perpetuate or exacerbate biases.

Evaluation Contamination

Using synthetic data in model training poses significant challenges to fair evaluation. Evaluation benchmarks are often created by referring to public text sources, which can lead to contamination if the synthetic data includes rephrased versions of the benchmark data. This can result in inflated performance metrics and misleading conclusions about the model's capabilities. Developing robust evaluation protocols and contamination detection techniques is essential to address this challenge.

Survey on Synthetic Datasets

Recent advancements in synthetic data generation have led to various synthetic datasets for different domains. For example, synthetic data has been used to improve performance on math-related tasks, code reasoning, tool-using abilities, and multilingual language models. These datasets have demonstrated synthetic data's potential in enhancing AI models' capabilities.

Reasoning Tasks

Synthetic data has been effectively used in reasoning tasks such as mathematical problem-solving and code generation. For example, models like Minerva, Llemma, and DeepSeekMath have been trained on synthetic math-targeted pre-training data, improving their performance on math-related tasks. Similarly, synthetic data has been used to generate complex questions and answers, enhancing the reasoning capabilities of language models.

Tool-Using and Planning

LLMs are used to build agents which require tool selection and usage. Synthetic data has also enabled language models to learn these tool-using abilities and planning skills. For example, models like GPT4o, Claude 3.5 Sonnet and Toolformer have been trained on interaction data annotated with calls to appropriate tools, enabling them to use calculators, search engines, and machine translators effectively. Synthetic trajectories in simulated environments have been used to teach models planning skills, such as decomposing complex tasks into subtasks and completing them in a reward-optimal way.

Multimodality

In multimodal tasks, synthetic data has been used to align visual input with language models. For example, Pix2Struct and MatCha have been trained on synthetic image-caption pairs generated from HTML code and tabular data, respectively. This has enabled these models to accurately ground visual input to language, improving their performance on tasks such as derendering screenshots and converting webpage screenshots into code.

Multilingual Data

Multilingual models are complicated to build due to lack of annotated data. Synthetic data has been key in improving multilingual language models by creating synthetic parallel training data from monolingual data sources. Techniques such as back-translation have been employed to generate synthetic multilingual question-answer pairs, enhancing the performance of language models on multilingual and cross-lingual question answering tasks.

A Framework for High-Quality Synthetic Datasets

Creating high-quality synthetic datasets is a multi-faceted challenge that requires a systematic and meticulous approach. This section outlines a robust framework to guide synthetic data generation, ensuring it meets the highest standards of accuracy, diversity, and applicability.

Step 1: Prompt Engineering

Task Specification: The first step in prompt engineering is clearly defining the task. This involves providing the necessary context, background information, and specific instructions that the model needs to understand the task. For instance, if the task is to generate synthetic medical records, the prompt should include details about the type of medical conditions, patient demographics, and the format of the records.

Generation Conditions: Next, define the attributes and characteristics of the desired data. This could include specifying the length of the text, the style of writing, and any particular focus areas. For example, in generating synthetic legal documents, the conditions might specify the inclusion of certain legal terminologies and the structure of the document.

In-Context Demonstrations: Providing examples or demonstrations within the prompt can significantly enhance the model's understanding and performance. These examples act as a guide, showing the model the desired output format and content. For instance, if the task is to generate customer service interactions, including a few example dialogues can help the model produce more accurate and relevant responses.

Step 2: Multi-Step Generation

Decomposition of Tasks: For complex data generation tasks, it is often beneficial to break down the task into smaller, manageable sub-tasks. This step-by-step approach can help ensure that each component of the data is generated accurately. For example, generating a synthetic research paper might involve separate steps for creating the abstract, introduction, methodology, results, and conclusion.

Iterative Refinement: Multi-step generation allows for iterative refinement, where the output from one step can be reviewed and improved before moving on to the next. This iterative process helps in catching and correcting errors early, ensuring higher quality in the final dataset. For instance, in generating synthetic financial reports, the initial draft can be reviewed for accuracy and completeness before adding detailed financial statements.

Contextual Conditioning: Each step of the multi-step generation can be conditioned on the outputs of previous steps. This ensures coherence and logical flow in the generated data. For example, in generating synthetic dialogues, each turn in the conversation can be conditioned on the previous turns, maintaining context and relevance.

Step 3: Data Curation

High-Quality Sample Filtering: After generating the synthetic data, it is crucial to filter out low-quality samples. This can be achieved using heuristic metrics such as confidence scores, influence functions, and generation probabilities. For instance, samples with low confidence scores or high uncertainty can be discarded to ensure only high-quality data is retained.

Label Enhancement: This can be done through human intervention or by using auxiliary models for knowledge distillation. For example, in a synthetic dataset of annotated images, human reviewers can verify and correct the labels, or a student model can be used to refine the annotations based on feedback from the teacher model.

Re-Weighting Strategies: Instead of discarding low-quality data, re-weighting strategies can be employed to assign varying importance to different samples. This ensures that influential and correctly annotated samples have a larger impact on the training process. For instance, in a synthetic text dataset, samples with higher relevance and accuracy can be given more weight during model training.

Bias Mitigation: This involves conducting fairness assessments and using techniques to balance the representation of different classes and demographics. For instance, in a synthetic dataset for customer feedback analysis, ensure balanced representation of positive, negative, and neutral sentiments across different demographic groups.

By following this comprehensive framework, researchers and practitioners can generate high-quality synthetic datasets that are accurate, diverse, and applicable to a wide range of AI tasks.

Build your Synthetic RAG Dataset

Synthetic data generation pipeline
Synthetic data generation pipeline

Enough with the theory — let's build our own synthetic dataset for training and evaluating RAG systems! Our goal is to generate prompt-and-response pairs entirely from scratch.

First, we'll use GPT4o to create the dataset, and then filter it based on ChainPoll's context adherence score. This process ensures that our dataset is purely synthetic, as we won't rely on any pre-existing datasets.

Here are the steps for the complete process:

  1. Define the prompt to generate synthetic data
  2. Create OpenAI batch with the prompts
  3. Download the output of batch and parse it
  4. Convert data to RAG prompt and response
  5. Evaluate dataset with Galileo
  6. Download data and remove samples with low context adherence score
  7. Save the final high quality dataset

Let's get started!

1. Define the Prompt to Generate Synthetic Data

We define the imports needed for generating the data.

1import random, json, yaml, random
2from string import Template
3
4from dotenv import load_dotenv
5from openai import OpenAI
6import pandas as pd
7from tqdm import tqdm
8
9from fortune_500 import companies
10
11load_dotenv("../.env")
12random.seed(42)


Now, let's configure the generation LLM, system prompt, and instruction prompt for producing RAG samples. The prompt will outline detailed requirements to ensure high-quality, diverse outputs. We will instruct the LLM to generate context, questions, and answers related to financial text for companies listed in the S&P 500. We specify various types of questions for a given context to achieve sufficient diversity.

1MODEL = "gpt-4o-2024-08-06"
2SYSTEM = "You are an expert in data annotation for machine learning models, specifically in the areas of LLM and generative AI."
3
4PROMPT = Template("""**Task Overview**
5Your job is to create structured data in a specific format (YAML) that includes:
6
71. **Context:** This should be a collection of at least 10 paragraphs from quarterly and yearly reports of various companies in the S&P 500 list. The paragraphs can vary in length (10-15 sentences) and should contain both text and a table in markdown format with at least *10* rows and *5* columns. 
8
92. **Questions:** You need to create a list of complex questions based on the context. These questions should require deep reasoning, involve multiple references, or be based on the information in the table. Include questions for each of these types:
10- reasoning: questions that require synthesizing information from multiple paragraphs.
11- cannot answer: questions that cannot be answered with the provided context. Say "I do not have the information to answer this question."
12- tabular: questions that specifically ask for information from the table. 
13- extractive: questions that require extracting specific entities from the context.
14- math: questions that involve different type of relevant math calculations.
15
163. **Answers:** Provide a concise answer based on the context for each question. If a question cannot be answered with the given information, state that you do not have the information. For math questions, show the calculations used to arrive at the answer.
17
18# Schema of yaml output
19Sector: $sector
20Company: $company_name
21Context: List[str] 
22Questions : List[Tuple[type: str, question: str]] 
23Answers: List[str]
24
25Don't generate anything after generating the YAML output.
26""")
27

2. Create OpenAI batch with the Prompts

While some applications necessitate sending synchronous requests, there are numerous scenarios where immediate responses are not essential, or where rate limits restrict the rapid execution of a large number of queries. In such situations, batch processing jobs can be highly beneficial, particularly for tasks like:

- Running evaluations

- Classifying large datasets

- Embedding content repositories

The Batch API provides a user-friendly set of endpoints that enable you to compile a set of requests into a single file, initiate a batch processing job to execute these requests, monitor the status of the batch as the individual requests are processed, and ultimately retrieve the results once the batch is complete.

Compared to directly using standard endpoints, the Batch API offers several advantages.

Better cost efficiency: Enjoy a 50% cost reduction compared to synchronous APIs.

Higher rate limits: Significantly higher capacity compared to synchronous APIs.

Fast completion times: Each batch is completed within 24 hours, often sooner.

Now, we will create an OpenAI batch to process 1000 rows.

1client = OpenAI()
2
3# Create a file with requests
4file_path = "../data/syn_data_rag/input/data.jsonl"
5print(file_path)
6with open(file_path, "w+") as f:
7    for i in range(2000):
8        sector, company_name = random.choice(companies)
9        input = {"custom_id": str(i), 
10                "method": "POST", "url": "/v1/chat/completions", 
11                "body": {"model": MODEL, 
12                        "messages": [{"role": "system", "content": SYSTEM},
13                                    {"role": "user", "content": PROMPT.substitute({'sector' : sector, 'company_name' : company_name})}],
14                        "max_tokens": 5000,
15                        "temperature": 1.0,}}
16        f.write(json.dumps(input))
17        f.write("\n")
18
19# Upload the file to OpenAI            
20batch_input_file_id = client.files.create(
21file=open(file_path, "rb"),
22purpose="batch"
23).id
24
25# Create a batch
26client.batches.create(
27    input_file_id=batch_input_file_id,
28    endpoint="/v1/chat/completions",
29    completion_window="24h",
30    metadata={
31    "description": f"synthetic data {i}"
32    }
33)
34
35# Save the output
36file_response = client.files.content(client.batches.list(limit=1).data[0].output_file_id)
37with open("../data/syn_data_rag/output/test.jsonl", "w+") as f:
38    f.write(file_response.text)
39

3. Process the Batch Response

Now, we download the batch output. Occasionally, the LLM might produce data that doesn't conform to our defined schema, which can cause processing errors. To address this, we create Pydantic schemas to validate the response and reject those with errors.

1from typing import List
2from pydantic import BaseModel, ValidationError
3
4class Question(BaseModel):
5    type: str
6    question: str
7
8class Schema(BaseModel):
9    Sector: str
10    Company: str
11    Context: List[str]
12    Questions: List[Question]
13    Answers: List[str]
14
15def validate_dict(data: dict) -> bool:
16    try:
17        validated_data = Schema(**data)
18        return True
19    except ValidationError as e:
20        # print(e.json())
21        return False
22


Now we load the batch output and parse it. Each response often contains multiple question and answer pairs, so we structure them into individual context, question, and answer sets. This process effectively transforms 1,000 outputs into 7,000 distinct pairs.

1# format the context
2def get_context_for_prompt(x):
3    context = ''
4    for i, p in enumerate(x):
5        context += f"\n{i+1}. {p}"
6    return context
7
8# read the file
9with open("../data/syn_data_rag/output/data.jsonl", "r") as f:
10    data = f.readlines()
11
12# parse the data
13data_formatted = []
14for line in data:
15    d = json.loads(line)['response']['body']['choices'][0]['message']['content'].replace("```yaml\n", "").replace("\n```", "")
16    try:
17        d = yaml.load(d, Loader=yaml.FullLoader)
18        if validate_dict(d):
19            for question, answer in zip(d['Questions'], d['Answers']):
20                random.shuffle(d['Context']) # shuffle every time to create variability
21                context = get_context_for_prompt(d['Context'])
22                data_formatted.append((d['Sector'], context,  question['type'], question['question'], answer))
23    except:
24        pass
25
26# create a dataframe
27df = pd.DataFrame(data_formatted, columns=['sector', 'context', 'question_type', 'question', 'answer'])
28
29df.to_parquet("../data/syn_data_rag/extracted/data.parquet", index=False)
30

4. Get RAG Prompt and Response

Now we create a new notebook and get the necessary imports.

1import time
2from uuid import uuid4
3
4from dotenv import load_dotenv
5import pandas as pd
6import promptquality as pq
7from promptquality import NodeType, NodeRow
8
9
10load_dotenv("../.env")
11pq.login('https://console.demo.rungalileo.io')
12


We set the project and run names specifically for Galileo, enabling only groundedness while disabling all other metrics. Groundedness is defined as adherence to context based on ChainPoll.

1project_name = "rag-syn-data-cleaning"
2run_name="gpt-4o-2024-08-06-v1"
3
4config = {
5    "groundedness": True,
6    "toxicity": False,
7    "factuality": False,
8    "context_relevance": False,
9    "sexist": False,
10    "pii": False,
11    "prompt_perplexity": False,
12    "chunk_attribution_utilization_gpt": False,
13    "completeness_gpt": False,
14    "tone": False,
15    "prompt_injection": False
16  }
17
18df = pd.read_parquet("../data/syn_data_rag/extracted/data.parquet").sample(1000, random_state=0)


Next, we use the context, question, and response to form pairs of prompts and responses.

1instruction = "Answer the question using the information in the context."
2template = """Context: {context}
3
4Task: {instruction}
5Question: {question}"""
6
7df["prompt"] = df.apply(lambda x: template.format(instruction=instruction, question=x["question"], context=x["context"]), axis=1)


Here is a sample of the prompt containing the context with many paragraphs, task description and the question.

Prompt
Prompt

5. Evaluate Dataset with Galileo

Now, we construct the chain and its nodes using the prompt and response. Finally, we upload the data to the Galileo console.

1# create nodes
2nodes = []
3for prompt, response in zip(df.prompt, df.answer):
4    root_id = uuid4()
5    llm_id = uuid4()
6    
7    nodes.append(NodeRow(
8                node_id=root_id, 
9                chain_root_id=root_id, 
10                node_type=NodeType.chain, 
11                step=0, 
12                node_input=prompt, 
13                node_output=response,
14            ))
15    nodes.append(NodeRow(
16                node_id=llm_id, 
17                chain_root_id=root_id, 
18                chain_id=root_id,
19                node_type=NodeType.llm, 
20                step=1, 
21                node_input=prompt, 
22                node_output=response,
23            ))
24
25# push to galileo
26pq.chain_run(
27    rows=nodes,
28    scorers_config=pq.ScorersConfiguration(**config),
29    project_name=project_name,
30    run_name=run_name
31)
32


6. Error Analysis

After completing the run, we can log in and navigate to the project in the console. Here, we can view the statistics for the run, including the average context adherence.

Below are the scores for various runs. The Run Insight panel indicates that 86 out of 1000 samples exhibit low context adherence, which suggests a high likelihood of hallucinations. Ideally, these samples should be excluded from the dataset.

Galileo Run Insights
Galileo Run Insights

Let's look at some of these samples to know the type of errors..

Error example 1 (wrong tabular understanding)

In this sample, the LLM made an error while understanding the table.


Question: Which metric showed the highest year-over-year growth for Huntington Bancshares in Q3 2023?

Response: The metric that showed the highest year-over-year growth for Huntington Bancshares in Q3 2023 is ROA with 18.2% growth.

Context adherence explanation: Upon reviewing the figures, the highest year-over-year growth is actually in 'Deposit Growth' at 33.3%. Although the ROA grew by 18.2%, it is not the highest; therefore, the claim made in the response is incorrect.

Error example 2 (wrong calculation)

In this sample, the LLM made an error while calculating the average.

Question: What is the average revenue growth rate for Axon Enterprise across all quarters?

Response: The average revenue growth rate for Axon Enterprise across all quarters is calculated as follows: (12% + 4.44% + 14.89% + 1.85%) / 4 = 8.795%.

Context adherence explanation: 8.295%

Error example 3 (wrong tabular understanding)

In this sample, the LLM made an error while understanding the table.

Question: Who is Sysco's closest competitor in terms of market share?

Response: US Foods, with a market share of 25%.

Context adherence explanation: US Foods is correct but market share value should be 30%

These examples give us an idea of issues with our generation LLM. The errors are often hard to decipher, but the explanation significantly reduces the time needed to find them.

7. Filter Low Context Adherence Samples

To obtain scores from the console and filter out samples with low adherence, we can use a prompt quality library along with the project and run names to retrieve the data and then apply the threshold.

1project_name = "rag-syn-data-cleaning"
2run_name="gpt-4o-2024-08-06-v1"
3
4def get_run_data(project_name, run_name):
5    print(f"Getting data for {project_name}: {run_name}")
6    project_id = pq.get_project_from_name(project_name).id
7    run_id = pq.get_run_from_name(run_name, project_id).id
8    rows = pq.get_rows(project_id=project_id, run_id=run_id, task_type=None, config=None, starting_token=0, limit=10000)
9    rows = [row for row in rows if row.has_children]
10    context_adherence_scores = [row.metrics.context_adherence for row in rows]
11    prompts = [row.node_input for row in rows]
12    responses = [row.node_output for row in rows]
13    return pd.DataFrame({"prompt": prompts, "response": responses, "context_adherence": context_adherence_scores})
14        
15df = get_run_data(project_name, run_name)
16

Let's have a quick look at the distribution of the scores.

1# histogram plot
2df.context_adherence.hist()
3


We can see that more than 90% of the samples have scores above 0.8. The remaining sample scores are distributed between 0, 0.33, and 0.66.

Context Adherence vs Number of Samples
Context Adherence vs Number of Samples

1print(len(df))
2df = df[df.context_adherence > 0.8]
3print(len(df))
4
5# Output
61000
7872
8


8. Save the final high-quality RAG dataset

We finally save the samples with a high context adherence score which we can later use for training and evaluation of models.

1df.to_parquet("../data/syn_data_rag/filtered/data.parquet", index=False)

Cost of Creating Synthetic Dataset

Calculation of GPT-4o-2024-08-06 Generation Costs using OpenAI Batch:

Generation Cost

Average prompt length = 380 tokens

Average response length = 1800 tokens

Cost Breakdown:

Prompt: 380 tokens * $1.875 / 1M tokens = $0.000712

Response: 1800 tokens * $7.5 / 1M tokens = $0.013500

Total Generation Cost per Response: $0.000712 + $0.013500 = $0.014212


Cost per Question and Answer Pair

Each response generates 10 question-answer pairs.

Cost per Pair: $0.014212 / 10 = $0.001421


Cost for 1000 Pairs for Evaluation

Generation Cost for 1000 Pairs: 1000 * $0.0014212 = $0.14212

Evaluation Cost (from Galileo console): $0.6051

Total Cost for Generating and Evaluating 1000 Samples: $0.14212 + $0.6051 = $0.74722


Context Adherence Adjustment

Out of 1000 samples, 86 had low Context Adherence scores and were removed.

Remaining Samples: 914


Cost per Sample Calculation

Cost per Sample: $0.74722 / 914 = $0.00081753

For evaluation, we can start with 10k samples.

Cost per 10k Samples: $0.00081753 * 10k = $8.175

For training an LLM we need around 1M samples.

Cost per 1M Samples: $0.00081753 * 1M = $817.52


This shows that we can create a 10k sample evaluation dataset for less than $10 and a 1 M sample training dataset for less than $1,000. Isn't that incredible?


Future of Synthetic Datasets

The future of synthetic data looks very promising, with several key areas warranting further exploration.

Self-Improvement Capability

An intriguing question arises: can a model generate synthetic data that is better than the data it was trained on, thus enabling it to improve itself? This concept of self-improvement through synthetic data generation is an exciting avenue for future research. If a model can generate higher-quality data than its original training set, it could potentially bootstrap its own performance by iteratively learning from the enhanced synthetic data.

This process was leveraged in building Alibaba's Qwen 2 and Meta AI's Llama 3.1. Although there has been some success here, it’s still to be proven whether something like GPT-5 could be created from GPT-4 generated synthetic data. Papers like The Curse of Recursion: Training on Generated Data Makes Models Forget have raised concerns about LLMs being brittle if trained in such a manner.

Scaling Laws of Synthetic Data

Future research should investigate the scaling laws for synthetic data and determine the optimal balance between the quantity and quality of synthetic samples. Understanding how to scale synthetic data effectively can help maximize its benefits for AI model training and evaluation.

Improving Quality and Diversity

There is still room for improvement in creating high-quality, attributed synthetic samples that closely mimic real-world data. Future research should focus on developing new advanced techniques to control and manipulate specific attributes of the generated data. This will naturally improve as the instruction-following capability of LLMs improves.

Conclusion

Synthetic data has emerged as a promising solution to address the challenges of data scarcity, privacy concerns, and high costs in AI development.

Many companies are already generating realistic and diverse datasets, synthetic data to enable the training and evaluation of AI models at scale across various domains. Despite the challenges, the potential benefits of synthetic data in advancing AI research are unparalleled. Are you ready to generate some high-quality synthetic data with Galileo?


References

Best Practices and Lessons Learned on Synthetic Data for Language Models

On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

Cosmopedia: how to create large-scale synthetic data for pre-training

New LLM Pre-training and Post-training Paradigms