You may find yourself in a situation where you’d like to generate mock data, like when writing a blog post series on taking a pipeline and model to production. Luckily numpy
and pandas
makes this task incredibly easy.
For my use case, I wanted to generate a Pandas
DataFrame
with one independent column temperature_celsius
and one dependent column ice_cream_sales_euro
. The goal was to make a data set where temperature_celsius
would affect ice_cream_sales_euro
.
On first attempt, working with the data showed to easy of a relation between the two, so we should generate some noise and add that in. This adds variation and unexplainable variance.
First we’ll import our required packages;
import datetime
import pandas as pd
import numpy as np
Its a good idea to set a random seed, so our code is reproducible.
np.random.seed = 42
Now we can create our index column, which will be the year, month and day and list our column names.
how_many_days = 365
today_last_year = datetime.datetime.now().date()
index = pd.date_range(today_last_year-datetime.timedelta(how_many_days), periods=365, freq='D')
columns = ["temperature_celsius", "ice_cream_sales_euro"]
Now we can create the data for our temperature_celsius
column and the noise we mentioned above
temperature_celsius_data_x = np.arange(how_many_days)
temperature_celsius_data_delta = np.random.uniform(-1, 3, size=(how_many_days,))
temperature_celsius_data = (.1 * temperature_celsius_data_x) + temperature_celsius_data_delta
noise_data = np.random.normal(loc=25000, scale=10000, size=(how_many_days,))
And now we can take all the data and create our ice_cream_sales_euro_data
column
ice_cream_sales_euro_data = (temperature_celsius_data * 1200) + noise_data
At this point we’re ready to create a single numpy array and output the data as a Pandas
DataFrame
and save that to a csv file.
data = np.array([temperature_celsius_data, ice_cream_sales_euro_data]).T
df = pd.DataFrame(data, index=index, columns=columns)
df.to_csv("ice_cream_shop.csv")
And the full script
import datetime
import pandas as pd
import numpy as np
np.random.seed = 42
how_many_days = 365
today_last_year = datetime.datetime.now().date()
index = pd.date_range(today_last_year-datetime.timedelta(how_many_days), periods=365, freq='D')
columns = ["temperature_celsius", "ice_cream_sales_euro"]
temperature_celsius_data_x = np.arange(how_many_days)
temperature_celsius_data_delta = np.random.uniform(-1, 3, size=(how_many_days,))
temperature_celsius_data = (.1 * temperature_celsius_data_x) + temperature_celsius_data_delta
noise_data = np.random.normal(loc=25000, scale=10000, size=(how_many_days,))
ice_cream_sales_euro_data = (temperature_celsius_data * 1200) + noise_data
data = np.array([temperature_celsius_data, ice_cream_sales_euro_data]).T
df = pd.DataFrame(data, index=index, columns=columns)
df.to_csv("ice_cream_shop.csv")
Additional Approaches
One could also use a mocker library, say like the popular Faker package