Master thesis experience in data science

This article outlines key aspects of my Master’s thesis titled “Inventory Planning with Time Series Forecasting at Scale” which I presented for the Master’s in CSS at Universidad Carlos III de Madrid.

Supply chains became a fascinating subject for me after advising numerous companies seeking to enhance their efficiency, especially following one of the largest global disruptions affecting Global Value Chains—the upheaval caused between 2020-2022 by Covid-19 and the Russia-Ukraine War. This disruption led to significant interruptions in global production, distribution, and transportation processes.

This context ignited my interest in investigating this topic. After analyzing the supply chain domain, I became intrigued by addressing one of the primary questions using Data Science techniques: How many units should be stocked for the following month? Optimizing inventory planning is crucial to avoid stockouts, reduce investment overruns (ordering more than necessary), and optimize storage. Having a lean supply process ensures the reduction of operational costs.

Case Study

The case study I utilized was a Spanish Home Store managing an inventory of hundreds of references, necessitating forecasting at scale (many products at once). What were the challenges?

My ambition was to forecast an entire year of sales/stock.
The prediction had to encompass over 600 products (not just one).
Analyzing 12 years’ worth of historical sales data.

Time Series Usage

Forecasting demand directly quantifies the required inventory stock. This is not a trivial objective because constructing variables to explain demand behavior is an almost impossible task. Hence, I opted to employ time series analysis. This technique aims to analyze the historical sales behavior (target variable) and identify trends, seasonality, cyclic fluctuations, and random movements. With these time series components, future values of the variable can be constructed, practically requiring no additional information other than the sales variable.

Models and StatsForecast Pipeline

Upon analyzing the company’s portfolio, I observed high heterogeneity in time series. Naturally, products within a company exhibit very different behaviors, although some similarities exist. This led to applying various models to analyze each reference.

I employed basic models such as the mean, random walk, or window average; Medium and Advanced Complexity Models like Autoregressive Integrated Moving Average (ARIMA), Exponential Smoothing method (ETS), Theta Models, Prophet (a model developed by Facebook for time series); and Intermittent Demand Analysis Models such as Aggregate-Disaggregate Intermittent Demand Approach (ADIDA), Intermittent Multiple Aggregation Prediction Algorithm (IMAPA), Croston Classic.

I used the Python library StatsForecast to apply this battery of models to each product. Additionally, I improved the pipeline by incorporating temporal cross-validation and parallelization.

Reflections

This work provided significant learning experiences. On one hand, tackling and solving a real-world business case is an excellent opportunity to test methodologies and techniques applied in data science because models must adapt to business needs.

Constructing scalable projects requires a pipeline design that enables the efficient rerun of processes. Encapsulating code blocks into functions can facilitate this process and enhance computational resource utilization.

You can access the final thesis documents here: