Breaking changes and expected improvements: a production point of view

Image for post
Image for post

With more than 70 jobs running with Spark and hundreds of gigabytes of data processed per day, Spark is a critical piece of our data pipelines.

At Teads, we use the official open-source Spark package and spawn AWS EMR clusters to run our jobs. Hence, it is essential to optimize our jobs to reduce billing. With an exciting two-fold speed-up promise, we had to give Spark 3 a try.

In this article, we first cover the main breaking changes we had to take into account to make our code compile with Spark 3. …


Hands-on Tutorials

A Lossless Model for Accurate Sequence Prediction over a finite alphabet

Image for post
Image for post
Image by Bela Geletneky from Pixabay

What is a Sequence Prediction problem?

The sequence prediction problem consists of finding the next element of an ordered sequence by only looking at the sequence’s items.

This problem covers a lot of applications in a variety of domains. It includes applications such as product recommendation, forecasting, and web page prefetching.

A lot of different approaches have been studied for this problem, popular ones include PPM (Prediction by Partial Matching), Markov chains, and more recently LSTM (Long short-term memory).

The Compact Prediction Tree (CPT) is an approach published in 2015 which aims to match the accuracy and outmatches the performances (time to train and predict) of popular algorithms with a lossless compression of the entire training set. …

About

Louis Fruleux

Passionate software developer and data engineer at Teads. Data enthusiast. LinkedIn: https://www.linkedin.com/in/louis-fruleux-04b48a127/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store