10 Lessons Learned In 10 Years Of Data [1/2]
From 2012 to 2022, what went wrong in the data world ?
It’s the end of 2022, and a common tradition in the data community is to predict trends in 2023. But what do you need for predictions? Data. And looking solely at 2022 will not help us too much to give accurate predictions. So let’s go back to 2012.
I’ll highlight my lessons learned, and you draw your own prediction for 2023. Don’t worry; some will be obvious. And, of course, there will be memes.
2012 ⏱️ Meet Bob, the Big data engineer
Bob is happy. His company just invested in an on-premise Hadoop cluster. No more proprietary BI tools. They will be dead in a few years, anyway (right!?). Bob is happy to care about distributed systems rather than business value.
A few months, system engineers, and thousand of $$$ later, the cluster is finally ready.
Bob is thinking: “Oh, it would be nice to have a service that does that for us, but what will we do then? It will steal our jobs!”
✔️ Lesson #1: Cloud didn’t take our job
Technology doesn’t replace people; it rather changes the way we work. So if you are scared about all these ChatGPT highlights, look at the past and think twice.
You definitely will need to adapt as many companies did for the cloud, but you will still get a job to do.
2013 ⏱️ Another day in Bob’s Big Data Engineer life
Today Bob has a big batch job to run that will probably take all resources from the cluster for a while. He kindly warns his teammates. They are ready for a long coffee break.
✔️ Lesson #2: Unlimited cloud resources can be painful
Nowadays, you are not anymore bothering your colleague about on-premise resource limits; you are just burning your credit cards.
This is a blessing and a curse. Without any limit on your resources, we tend to avoid any data pipeline/SQL query optimization. Until there’s no more money and your CTO is looking back to the data team where they can save money.
Feel familiar? Given the tough economic times, I believe it’s here to stay.
2016 ⏱️ Bob, the Big Data Engineer Data Scientist
Data is liquid gold, and all we need is a bunch of Ph.D. Data Scientists to make this happen. At least, that’s what we thought.
✔️ Lesson #3: Data Science was a dream
I like this meme above because I feel many companies were blinded by the data maturity of big tech companies. They got fooled, thinking they could easily do the same. Note that we should probably talk more about failures rather than successes at conferences and meetups.
Today the hard truth is that everybody knows you need a strong data foundation before doing anything fancier than basic analytics.
We understood that we need to be humble with our data maturity, and that’s okay.
2018 ⏱️ Bob likes Notebooks
Bob is happy with Jupyter notebooks. No need to know software engineering, just a few lines of python, and it’s working.
✔️ Lesson #4: Meet the users where they are but not too much
The “it works on my machine” black hole. Notebooks are great as they lower the technical barrier to entry to data. But when a tool is easy to use, it often hides complexity elsewhere. Jupyter notebook, in this case, bypasses most of the software engineering best practices like versioning, testing and code reusability. Yes, there are workarounds. Yes, tons of Saas companies are working on this.
But in 2018, we just thought it was the holy grail until we tried to go into production. So the bottom line is: yes, we need more tools that are easy to use, but users need to upskill themself at a minimum to understand that handling data needs software engineering foundations.
2019 ⏱️ Bob, the data scientist data engineer.
At this point, Bob is just following the market buzzword, which is fair. Most of the data engineers in 2022 that started earlier than in 2019 did the same. At least I did.
✔️ Lesson #5: Data engineer role is too wide
Why? Probably because a lot of data engineers that started before 2019 as data scientists ended up taking that part of responsibilities while still keeping the old ones.
Add to that the explosion of tooling and frameworks, and data engineer was the default role where we would put all new responsibilities for data needs.
Infrastructure? Data engineer. Data pipelines? Data engineer. Analytics? Data engineer. MLops? Data engineer. Data Observability? Data engineer.
And the list goes on. If you look at job offers today, you will get a lot of different definitions. I touch down on this topic while explaining which role name we can use to navigate through this mess in the video above.
🔄 Recap
✔️ Lesson #1: Cloud didn’t take our job
✔️ Lesson #2: Unlimited cloud resources can be painful
✔️ Lesson #3: Data Science was a dream
✔️ Lesson #4: Meet the users where they are but not too much
✔️ Lesson #5: Data engineer role is too wide
Alright, that’s all for Part I, folks. There are already way too many memes in this blog post.
Part II will cover from 2019 to 2022, which literally feels like a decade in data as so many things happened… and many lessons learned too.
May the data be with you.
Enjoyed the blog! Glad to see that you're able to be more free compared to other blog sites. Also great use of Bob memes ;)
Early happy new year!
Perfect number of memes. Lol