Time series data is a sequence of data points that are collected or recorded at intervals over a period of time. What makes a time series dataset unique is the sequence or order in which these data points occur. This ordering is vital to understanding any trends, patterns, or seasonal variations that may be present in the data.
In a time series, data points are often correlated and dependent on previous values in the series. For example, when a financial stock price moves every fraction of a second, its movements are based on previous positions and trends. Time series data becomes a valuable asset in predicting future values based on these past patterns, a process known as forecasting.
Time series forecasting employs specialized statistical techniques to effectively model and generate future predictions. It is commonly used in business, finance, environmental science, and many other areas for decision-making and strategic planning.
Time series data can be categorized in various ways, each with its own characteristics and analytical approaches.
When measurements are taken at regular intervals, these are known as time series metrics. Metrics are crucial for observing trends, detecting anomalies, and forecasting future values based on historical patterns.
This type of time series data is commonly seen in financial datasets, where stock prices are recorded at consistent intervals, or in environmental monitoring, where temperature, pressure, or humidity data is collected periodically.
Event-based time series data captures occurrences that happen at specific points in time, but not necessarily at regular intervals. While this data can still be aggregated into snapshots over traditional periods, the event-based time series data forms a more complex series of related activities.
Examples include system logging in IT networks, where each entry records an event like a system error or a transaction. Electronic health records capture patient interactions with doctors, with medical devices capturing complex health telemetry over time. City-wide sensor networks capture the telemetry from millions of individual transport journeys, including bus, subway, and taxi routes.
Event-based data is vital to understanding the sequences and relationships between occurrences that help drive decision-making in cybersecurity, customer behavioral analysis, and many other domains.
Time series data can also be categorized based on how the patterns within the time series behave over time. Linear time series data is more straightforward to model and forecast, with consistent behavior from one time period to the next.
Stock prices are a classic example of a linear time series. The value of a company’s shares is recorded at regular intervals, reflecting the latest market valuation. Analyzing this data over extended periods helps investors make informed decisions about buying and selling stocks based on historical performance and predicted trends.
In contrast, non-linear time series data is often more complex, with changes that do not follow a predictable pattern. Such time series are often found in more dynamic systems when external factors force changes in behavior that may be short-lived.
For example, short-term demand modeling for public transport after an event or incident will likely follow a complex pattern that combines the time of day, geolocation information, and other factors, making reliable predictions more complicated. With IoT wearables for health, athletes are constantly monitored for early warning signals of injury or fatigue. These data points do not follow a traditional linear time series model; instead, they require a broader range of inputs to assess and predict areas of concern.
Capturing time series data around user interactions or consumer patterns produces behavioral datasets that can provide insights into habits, preferences, or individual decisions. Behavioral time series data is becoming increasingly important to social scientists, designers, and marketers to better understand and predict human behavior in various contexts.
From measuring whether daily yoga practice can impact device screen time habits to analyzing over 285 million user events from an eCommerce website, behavioral time series data can exist as either metrics- or event-based time series datasets.
Metrics-based behavioral analytics are widespread in financial services, where customer activity over an extended period is used to assess suitability for loans or other services. Event-based behavioral analytics are often deployed as prescriptive analytics against sequences of events that represent transactions, visits, clicks, or other actions.
Organizations use behavioral analytics at scale to provide customers visiting websites, applications, or even brick-and-mortar stores with a “next best action” that will add value to their experience.
Despite the immense growth of behavioral data captured through digital transformation and investment programs, there are still major challenges to driving value from this largely untapped data asset class.
Since behavioral data typically stores thousands of data points per customer, individuals are increasingly likely to be re-identified, resulting in privacy breaches. Legacy data anonymization techniques, such as data masking, fail to provide strong enough privacy controls or remove so much from the data that it loses its utility for analytics altogether.
Let’s explore some common examples of time series data from public sources.
From the US Federal Reserve, a data platform known as the Federal Reserve Economic Database (FRED) collects time series data related to populations, employment, socioeconomic indicators, and many more categories.
Some of FRED’s most popular time series datasets include:
Category | Source | Frequency | Data Since |
---|---|---|---|
Population | US Bureau of Economic Analysis | Monthly | 1959 |
Employment (Nonfarm Private Payroll) | Automatic Data Processing, Inc. | Weekly | 2010 |
National Accounts (Federal Debt) | US Department of the Treasury | Quarterly | 1966 |
Environmental (Jet Fuel CO2 Emissions) | US Energy Information Administration | Annually | 1973 |
Beyond socioeconomic and political indicators, time series data plays a critical role in the decision-making processes behind financial services, especially banking activities such as trading, asset management, and risk analysis.
Category | Source | Frequency | Data Since |
---|---|---|---|
Interest Rates (e.g., 3-Month Treasury Bill Secondary Market Rates) | Federal Reserve | Daily | 2018 |
Exchange Rates (e.g., USD to EUR Spot Exchange Rate) | Federal Reserve | Daily | 2018 |
Consumer Behavior (e.g., Large Bank Consumer Credit Card Balances) | Federal Reserve Bank of Philadelphia | Quarterly | 2012 |
Markets Data (e.g., commodities, futures, equities, etc.) | Bloomberg, Reuters, Refinitiv, and many others | Real-Time | N/A |
The website kaggle.com provides an extensive repository of publicly available datasets, many recorded as time series.
Category | Source | Frequency | Data Range |
---|---|---|---|
Environmental (Jena Climate Dataset) | Max Planck Institute for Biogeochemistry | Every 10 minutes | 2009-2016 |
Transportation (NYC Yellow Taxi Trip Data) | NYC Taxi & Limousine Commission (TLC) | Monthly updates, with individual trip records | 2009- |
Public Health (COVID-19) | World Health Organization | Daily | 2020- |
An emerging category of time series data relates to the growing use of Internet of things (IoT) devices that capture and transmit information for storage and processing. IoT devices, such as smart energy meters, have become extremely popular in both industrial applications (e.g., manufacturing sensors) and commercial use.
Category | Source | Frequency | Data Range |
---|---|---|---|
IoT Consumer Energy (Smart Meter Telemetry) | Jaganadh Gopinadhan (Kaggle) | Minute | 12-month period |
IoT Temperature Measurements | Atul Anand (Kaggle) | Second | 12-month period |
Once time series data has been captured, there are several popular options for storing, processing, and querying these datasets using standard components in a modern data stack or via more specialist technologies.
Storing time series data in file formats like CSV, JSON, and XML is common due to their simplicity and broad compatibility. With CSV files especially, this makes them ideal for smaller datasets, where ease of use and portability are critical.
Formats such as Parquet have become increasingly popular for storing large-scale time series datasets, offering efficient compression and high performance for analysis. However, Parquet can be more complex and resource-intensive than simpler file formats, and managing large numbers of Parquet files, especially in a rapidly changing time series context, can become challenging.
When more complex data structures are involved, JSON and XML formats provide a structured way to store time series data, complete with associated metadata, especially when using APIs to transfer information between systems. JSON and XML typically require additional processing to “flatten” the data for analysis and are not ideal for large datasets.
For most time series stored in files, it’s recommended to use the more straightforward CSV format where possible, switching to Parquet when data volumes affect storage efficiency and read/write speeds, typically at the gigabyte or terabyte scale. Likewise, a synthetically generated time series can be easily exported to tabular CSV or Parquet format for downstream analysis in various tools.
Dedicated time series databases, such as Kx Systems, are specifically designed to manage and analyze sequences of data points indexed over time. These databases are optimized for handling large volumes of data that are constantly changing or being updated, making them ideal for applications in financial markets such as high-frequency trading, IoT sensor data, or real-time monitoring.
Graph databases like Neo4j offer a unique approach to storing time series data by representing it as a network of interconnected nodes and relationships. Graph databases allow for the modeling of complex relationships, providing insights that might be difficult to extract from traditional relational data models.
The ability to explore relationships efficiently in graph databases makes them suitable for analyses that require a deep understanding of interactions over time, adding a rich layer of context to the time series data.
In the example below, Neo4j can create a “TimeTree” graph data model that captures events used in risk and compliance analysis. Exploring emails sent at different times to different parties and any associated events from that period becomes possible.
For decades, traditional relational database management systems (RDBMS) like Snowflake, Postgres, or Redshift have been used to store, process, and analyze time series data. One of the most popular relational data models for time series analysis is known as the star schema, where a central fact table (containing the time series data such as events, transactions, or behaviors) is connected to several dimension tables (e.g., customer, store, product, etc.) that provide rich analytical context.
By capturing events at a granular level, the time series data can be sliced and diced in many different ways, giving analysts a great deal of flexibility to answer questions and explore business performance. Usually, a date dimension table contains all the relevant context for a time series analysis, with attributes such as day of the week, month, and quarter, as well as valuable references to prior periods for comparison.
In a well-designed star schema model, the number of dimensions associated with a transactional fact table generally ranges between six and 15. These dimensions, which provide the contextual details necessary to understand and analyze the facts, depend on the specific analysis needs and the complexity of the business domain. MOSTLY AI can generate highly realistic synthetic data that fully retains the correlations from the original dimensions and fact tables across star schema data models with three or more entities.
Before analyzing a time series model, there are several essential terms and concepts to review.
A trend is a long-term value increase or decrease within a time series. Trends do not have to be linear and may reverse direction over time.
Seasonality is a pattern that occurs in a time series dataset at a fixed interval, such as the time of year or day of the week. Most commonly associated with physical properties such as temperature or rainfall, seasonality is also applied to consumer behavior driven by public holidays or promotional events.
Data retention over extended periods allows analysts to observe long-term patterns and variations. This historical perspective is essential for distinguishing between one-time anomalies and consistent seasonal fluctuations, providing valuable insights for forecasting and strategic planning.
A cyclic pattern occurs when observations rise and fall at non-fixed frequencies. Often, cycles last for multiple years, but the cyclic duration can only sometimes be determined in advance.
The final component to a time series is random noise, once any trends, seasonality, or cyclic signals have been accounted for. Any time series that contains too much random noise will be challenging to forecast or analyze.
Once a time series dataset has been collected, ensuring no missing dates within the sequence is vital. Review the granularity of the data set and impute any missing elements to ensure a smooth sequence. The imputation approach will vary depending on the dataset. Still, a common approach is filling any missing time series gaps with an average value based on the nearest data points.
The next step in time series analysis is to explore different univariate plots of the data to determine how to develop a forecasting model.
A time series plot can help assess whether the original time series data needs to be transformed or whether any outliers are present.
A seasonal plot helps analysts explore whether seasonality exists within the dataset, its frequency, and cyclic behaviors.
A trend analysis can explore the magnitude of the change that is identified during the time series and is used in conjunction with the seasonality chart to explore areas of interest in the data.
Finally, a residual analysis shows any information remaining once seasonality and trend have been taken into account.
Time series decomposition plots of this type are available in most data science environments, including R and Python.
As explored previously, time series records have strong relationships with previous points in the data. The strength of these relationships can be measured through a statistical tool called autocorrelation.
An autocorrelation function (ACF) measures how much current data points in a time series are correlated to previous ones over different periods. It’s a method to understand how past values in the series influence current values.
When generating synthetic data, it’s important to preserve these underlying patterns and correlations. Accurate synthetic datasets can mimic these patterns, successfully retaining both the statistical properties as well as the time-lagged behavior of the original time series.
Once the exploration of a time series is complete, analysts can use their findings to build predictive models against the dataset to forecast future values.
ARIMA, AutoRegressive Integrated Moving Average, is a popular statistical method effective for time series data showing patterns or trends over time. It combines three key components:
An alternative approach is to use a method known as Error, Trend, Seasonality (ETS) that focuses on decomposing a time series into its error, trend, and seasonal components to predict future values:
Once a model (or models) have been created, they can be visualized alongside historical data to inspect how closely the forecast follows the pattern of the existing time series data.
A quantitative approach to measuring time series forecasts often employs either the AIC (Akaike Information Criterion) or AICc (Corrected Akaike Information Criterion), defined as follows:
Modern approaches to anonymization, such as synthetic data, offer a solution to these privacy concerns. Anonymization involves a series of steps designed to ensure that the resulting synthetically generated time series data retains the statistical properties of the original data while protecting individual privacy.
Synthesizing time series data makes a lot of sense when dealing with behavioral data, which is notoriously difficult to anonymize. Understanding the key concepts of data subjects is a crucial step in learning how to generate synthetic data in a privacy-preserving manner.
A subject is an entity or individual whose privacy you will protect. Behavioral event data must be prepared in advance so that each subject in the dataset (e.g., a customer, website visitor, hospital patient, etc.) is stored in a dedicated table, each with a unique row identifier. These subjects can have additional reference information stored in separate columns, including attributes that ideally don’t change during the captured events.
For data practitioners, the concept of the subject table is similar to a “dimension” table in a data warehouse, where common attributes related to the subjects are provided for context and further analysis.
The behavioral event data is prepared and stored in a separate linked table referencing a unique subject. In this way, one subject will have zero, one, or (likely) many events captured in this linked table.
Records in the linked table must be pre-sorted in chronological order for each subject to capture the time-sensitive nature of the original data. This model suits various types of event-based data, including insurance claims, patient health, eCommerce, and financial transactions.
In the example of a customer journey, our tables may look like this.
We see customers stored in our subject table with their associated demographic attributes.
ID | ZONE | STATE | GENDER | AGE_CAT | AGE |
---|---|---|---|---|---|
1 | Pacific | Oregon | M | Young | 26 |
2 | Eastern | New Jersey | M | Medium | 36 |
3 | Central | Minnesota | M | Young | 17 |
4 | Eastern | Michigan | M | Medium | 56 |
5 | Eastern | New Jersey | M | Medium | 46 |
6 | Mountain | New Mexico | M | Medium | 35 |
In the corresponding linked table, we have captured events relating to the purchasing behavior of each of our subjects.
USER_ID | DATE | NUM_CDS | AMT |
---|---|---|---|
1 | 1997-01-01 | 1 | 11.77 |
2 | 1997-01-12 | 1 | 12 |
2 | 1997-01-12 | 5 | 77 |
In this example, user 1 visited the website on January 1st, 1997, purchasing 1 CD for $11.77. User 2 visited the website twice on January 12th, 1997, making six purchases over these visits for $89.
These consumer buying behaviors can be aggregated into standard metrics-based time series, such as purchases per week, month, or quarter, revealing general buying trends over time. Alternatively, the behavioral data in the linked table can be treated as discrete purchasing events happening at specific intervals in time.
Customer-centric organizations obsess around behaviors that drive revenue and retention beyond simple statistics. Analysts constantly ask questions about customer return rates, spending habits, and overall customer lifetime value.
Defining the relationship between customers and their purchases is an essential first step in synthetic data modeling. Ensuring that primary and foreign keys are identified between subject and linked tables enables synthetic data generation platforms to understand the context of each behavioral record (e.g., purchases) in terms of the subject (e.g., customers).
Additional configurations, such as smart imputation, dataset rebalancing, or rare category protection, can be defined at this stage.
A time series sequence refers to a captured set of data over time for a subject within the dataset. For synthetic data models, generating the next element in a sequence given a previous set of features is a critical capability known as sequence continuation.
Defining sequence lengths in synthetic data models involves specifying the number of time steps or data points to be considered in each sequence within the dataset. This decision determines how much historical data the synthetic model will use to predict or generate the next element in the sequence.
For instance, if you're working with daily store revenue data and set a sequence length of 30, the model will use the data from the past 30 days to predict or generate the store revenue for the 31st day.
The choice of sequence length depends significantly on the nature of the data and the specific application. Longer sequence lengths can capture more long-term patterns and dependencies but will also require more computational resources and may need to be more responsive to recent changes. Conversely, a shorter sequence length is more sensitive to recent trends but might overlook longer-term patterns.
In synthetic modeling, selecting a sequence length that strikes a balance between capturing sufficient historical or behavioral context and maintaining computational efficiency and performance is essential.
Synthetic data generation can produce realistic and representative behavioral time series data that mimics the original distribution found in the source data without the possibility of re-identification. With privacy-safe behavioral data, it’s possible to democratize access to datasets such as these, developing more sophisticated behavioral models and deeper insights beyond basic metrics, “average” customers, and crude segmentation methods.
Synthetic data is quickly becoming a critical tool for organizations to unlock the value of sensitive customer data while keeping the privacy of their customers protected and in compliance with data protection regulations such as GDPR and CCPA. It can be generated quickly in abundance and has been proven to drastically improve machine learning performance. As a result, it is often used for advanced analytics and AI training, such as predictive algorithms, fraud detection and pricing models.
According to Gartner, by 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated.
MOSTLY AI pioneered the creation of synthetic data for AI model development and software testing. With things moving so quickly in this space here are four trends that we see happening in AI and synthetic data in 2022:
Most of the machine learning and AI algorithms currently in production, interacting with customers, making decisions about people have never been audited for fairness and discrimination, the training data has never been augmented to fix embedded biases. It is only through massive scandals that companies are finding out and learning the hard way that they need to pay more attention to biased data and to use fair synthetic data instead.
Regulations all over the world are getting stricter every day; many countries have a personal data protection policy in place by now. Using customer data is getting increasingly difficult for a number of other reasons too - people are more privacy-conscious and are increasingly likely to refuse consent to using their data for analytics purposes. So companies literally run out of relevant and usable data assets. Companies will learn to understand that synthetic data is the way out of this dilemma.
Synthetic data is better than real when it comes to AI training. And it can be shared freely across teams and organizations. AI and machine learning algorithms simply perform better when trained with upsampled, augmented and bias-corrected synthetic data, being able to pick up on patterns more efficiently without overfitting.
Not all synthetic data is created equal. To start off with, there is a world of difference between what we call structured and unstructured synthetic data. Unstructured data means images and text for example, while structured data is mainly tabular in nature. There are lots of open source and proprietary synthetic data providers out there for both kinds of synthetic data and the quality of their generators varies widely. It’s high time to establish a synthetic data standard to make sure that synthetic data users get consistently high-quality synthetic data. We are already working on structured synthetic data standards.
If you’d like to connect on these trends, we’re happy to set up an interview or write a byline on these topics for your publication. Please let us know - thanks.
2021 has passed in the blink of an eye, yet MOSTLY AI can be proud as this was a revolutionary year of many extraordinary achievements. While we are already excited for what 2022 holds for us, we are taking a step back to look at the highlights and major milestones we have accomplished in 2021.
Our developers had a busy start to the year with the new upgrade of our category-leading synthetic data generator, MOSTLY AI 1.5. Alongside many shiny new features, the big buzz was about our synthetic data generator now supporting the synthesis of geolocation data with latitude and longitude encoding types. Say goodbye to harmful digital footprints and hello to privacy-safe synthetic geodata!
This was not enough for our very ambitious team; so in the second half of the year, they pushed the boundaries even further by truly revolutionizing software testing. With this new version of our platform, MOSTLY AI 2.0 became the first synthetic data platform that can automatically synthesize complex data structures, making it ideal for software testing. By expanding the capabilities to multi-table data structures, MOSTLY AI now enables anyone – not just data scientists – to create synthetic data from databases, automatically. This improves security and compliance and accelerates time to data. Our team truly deserves a toast for this!
We’ll be soon celebrating the first birthday of “The Data Democratization Podcast”, which we started back in January 2021. With over 2000 downloads in 2021, the podcast was an absolute hit! Our listeners had the opportunity to get so many insights from knowledgeable AI and privacy experts working in top-notch companies who shared their experiences, advice, and real-life case studies. We are entering the new year with even more enthusiasm and are preparing some special surprise guests for you. Stay tuned!
In 2021 we also launched our professional services and training program intended to help create the next generation of synthetic data superusers within enterprises. Several clients have already leveraged this first-of-its-kind program to kickstart their synthetic data journeys, with very positive results. As synthetic data pioneers, we have the most experienced team in the world. Our top engineers, architects, consultants, and data scientists have seen it all. They know what makes or breaks a company's synthetic data adoption, no matter the use case. From scaling ethical and explainable AI to providing on-demand, privacy-safe test data, the know-how is here.
Despite COVID-19 we have managed to attend multiple conferences. While most of them happened virtually, we participated in Slush 2021 in person! Our Co-Founder & Chief Strategy Officer Michael Platzer rocked the stage presenting at this year's event in Helsinki, Finland. We are proud to have been invited to present our synthetic data solution to the world and - while staying safe - connect and exchange ideas with some of the most brilliant minds.
With data privacy and information security at the heart of everything we do, our efforts to ensure the privacy and integrity of our customer’s sensitive data by following strict security policies and procedures have been officially recognized this year. In March, we received the SOC 2 Type 2 certification, which is an audit report capturing how a company safeguards customer data and how well internal controls are operating and later in November, we got awarded the ISO 27001 certification which is a globally recognized information security standard.
Thanks to both SOC2 and ISO certifications, our customers and partners can now speed up vetting processes and immediately get a clear picture of the advanced level of information security standards we hold.
All this wouldn’t be possible without MOSTLY AI’s most important asset – our team (or Mostlies as we like to call them). In 2021, we welcomed quite a few new Mostlies to the team - amongst them new executives to strengthen our product, marketing and sales activities.
The first one to join the team this year was Andreas Ponikiewicz as our Vice President of Global Sales, who took the lead for MOSTLY AI's international sales team across Europe, North America and Asia and has brought our communication with the clients to the next level. Shortly afterward, we welcomed our new CTO, Kerem Erdem, onboard. As a true captain, he is leading us on the way to accelerate our tech performance and enable organizations to thrive in an ethical, responsible way with smart and safe synthetic data. To help get the word out, in early May, Sabine Klisch joined the team as VP Global Marketing and is now leading our creative marketing team on our journey to position MOSTLY AI as the global leader for smart synthetic data. And to spice up the story even more, we have added a special Italian ingredient – Mario Scriminaci, our new CPO who is making sure our synthetic data platform is the number one solution and provides our customers with better-than-real data.
As already mentioned, Mostlies are the most important part of MOSTLY AI and it seems we are doing something right since we made it to the top 3 of Great Place to Work and received Austria's Best Employers 2021 award.
The MOSTLY AI team is truly diverse, with more than 15 different nationalities represented. Almost 40 members strong, we are organized in several teams, including data science, engineering, product, marketing, sales, and operations. The majority of us are based at our headquarter in Vienna, but an increasing number are working remotely spread across the entire world. What has started as a necessity because of COVID-19 has now become an integral part of our company culture.
Looking back, we can say this year has exceeded our expectations by far. One team of devoted professionals all united with the same vision – to empower people with data and build a smarter and fairer future together.
What’s next? 2022 is said to be the year of synthetic data. According to Gartner, by 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated. 2022 will also be the year of MOSTLY AI and we will have exciting news to share with you very soon.
If you want to stay in the loop with the latest developments in the world of synthetic data, join our newsletter group for monthly updates.