With more than 70 jobs running with Spark and hundreds of gigabytes of data processed per day, Spark is a critical piece of our data pipelines.
At Teads, we use the official open-source Spark package and spawn AWS EMR clusters to run our jobs. Hence, it is essential to optimize our jobs to reduce billing. With an exciting two-fold speed-up promise, we had to give Spark 3 a try.
In this article, we first cover the main breaking changes we had to take into account to make our code compile with Spark 3. …
The sequence prediction problem consists of finding the next element of an ordered sequence by only looking at the sequence’s items.
This problem covers a lot of applications in a variety of domains. It includes applications such as product recommendation, forecasting, and web page prefetching.
A lot of different approaches have been studied for this problem, popular ones include PPM (Prediction by Partial Matching), Markov chains, and more recently LSTM (Long short-term memory).
The Compact Prediction Tree (CPT) is an approach published in 2015 which aims to match the accuracy and outmatches the performances (time to train and predict) of popular algorithms with a lossless compression of the entire training set. …