10 min read ~1.5 hour project ~$2 Beginner
This example demonstrates how you can can quickly, easily, and inexpensively generate synthetic data using the Sutro Python SDK.
The Goal
Our goal today will be to generate a dataset of 20,000 high-quality, synthetic product reviews. This could be useful for:- Training/evaluating sentiment analysis models, recommendation systems, spam classifiers, and other machine learning models
- Market research simulations, A/B testing, customer segmentation
- … and more!
- Start by generating a basic dataset of 100 product reviews.
- Add structure and randomness (diversity) to the reviews.
- Add representation to the reviews, so that they are representative of underlying real-world data.
- Scale up to 20,000 reviews, seamlessly and inexpensively.
Baby Steps
First, make sure you have the Sutro Python SDK installed. This will include all dependencies required for the examples. Let’s start by creating a basic dataset of 100 product reviews.p0
) job, this should take a few minutes to run. You should see something like the following when you run the job. It should take a a couple of minutes to run (GIF sped up for brevity):


Let’s Walk - Adding Structure & Randomness
To begin, let’s see if we can introduce some more diversity and structure into our reviews. We’ll do this in a few ways:- Update the system prompt to include specific fields.
- Add a random numerical seed to the each of the inputs to increase diversity.
- Modify the
temperature
sampling parameter to increase randomness. - Add a Pydantic model to enforce a schema structure of the reviews we want to generate.
- Increase the model size from
qwen-3-4b
toqwen-3-14b
to sample from more world knowledge.

Time to Run - Adding Representation
For most valuable use cases, we want a final dataset that is representative of real-world data. To achieve this, we not only want diversity itself, but rather diversity that adheres to the real-world distribution of the data we’re trying to represent. There are various levels of complexity to achieve this, but for this example we’ll take a simple approach by using two other “seed” datasets to produce the representation we’re looking for. In our example, we’re creating product reviews, so we probably need a set of products to review, right? In the real world, perhaps if you’re running an e-commerce business you’d want to use your own product dataset for this. However, for our toy example, we’ll use an Amazon Products sample dataset from Hugging Face. It works well for our purposes - it contains 33,000 products with associated product names, descriptions, and prices.
This will certainly help us get more product representation, but how about reviewer representation? For that, we can use a personas dataset, in this case our very own Synthetic Humans 50k dataset.
This dataset contains 50,000 personas sampled from actual US demographics, and contains qualitative descriptions of each persona.
For this example, let’s say we want to generate product reviews from 18-30 year olds living in popular US cities over all of the products in the Amazon Products dataset.
To do this, we’ll need to merge the two datasets, sampling a random persona for each product. We’ll then run our previous inference job over the merged dataset. Let’s do that now.

Scaling Up
To scale up to our 20,000 product reviews, it’s dead simple! We just need to make a couple of changes to our code above.
Recap
In this example, we demonstrated how you can easily create synthetic data with LLMs using the Sutro Python SDK. Our final 20,000 product review dataset:✅ Created using a few dozen lines of code.
✅ Representative of our underlying real-world data.
✅ Required zero infrastructure setup.
✅ In less than an hour.
✅ For less than $2!
With the Sutro Python SDK, you can easily create synthetic data with LLMs for your own use cases. Try it out today by requesting access to Sutro!
Addendum
If you want to generate even more variations, you can set then
sampling parameter, which will produce n
samples for each input.
