A couple of years ago, I started (read: was made) to adopt scrum in my work.
I didn’t like it. The concept of estimation was vague to me: How do we estimate effort for data exploration or research? And after we move something from
In Progress to
Done, can we move it back? This happens often (in data science) where we need to revisit an upstream step, such as data preparation or feature engineering, to improve downstream modelling.
Despite my initial concerns (and violent objections), Scrum grew on me. Now, I find it almost indispensable. Some data science folks find this unusual, so I thought I’d pen some of these views here. Specifically, I’ll discuss time-boxed iterations, prio, demo, and retro.
But first, a disclaimer: I’m no Agile expert. I don’t have proper training on Agile or a Scrum certificate. Also, we modified Scrum so it worked for us, thus it’s likely different from conventional Scrum–what works for you might be different too.
Scrum is organised in sprints, usually 1 - 2 weeks each, with the goal of shipping each sprint; Kanban operates on continuous delivery and release. I’ve added resources at the end that have a more in-depth discussion.
A common issue in data science projects is getting derailed by trips down the rabbit hole of research and premature optimization: The model would perform better if we had that data. Should we try a different technique? Or add more layers? Or build an ensemble—yeah, ensembling rocks!
There’s always something.
However, research and improving model performance—on its own—doesn’t serve customers. It needs to be implemented, integrated, and deployed. Then, we can measure how much it helps customers and the organization. Here’s where time-boxing shines. Each iteration has a clearly defined deliverable and a bounded timeline. This helps the team make progress towards the finish line and not get lost in the woods.
They don’t have to be the same as regular engineering deliverables; I mean, they can be, but they don’t have to. Engineering-oriented tasks and deliverables are relatively straightforward and definitely achievable. They kinda look like this:
On the other hand, research-oriented tasks have a more open problem space and a higher chance of failure:
For some engineering PMs, these research-oriented tasks can take some getting used to. These tasks can lead to non-positive outcomes such as:
While such “failures” make it seem like there was no progress, IMHO, the learning (from research and experimentation) is the deliverable.
"I haven't failed–I've just found 10,000 that won't work." - Thomas Edison
I apply time-boxing like how I would have a budget for bets, especially research-oriented ones. At the start, we might bet one or two iterations to figure out the problem space. If it results in “The target metric is 95% accuracy; we achieved 85%”, we can double down. But if the gap is too wide to close—with a reasonable amount of resources—we might choose to move on.
Time-boxed iterations provide for regular check-ins and feedback, as well as a way to limit loss—think of it as a stop-loss. It makes clear how many bets you can make in a year. Or if you’re a start-up with limited runway, how many bets you have remaining.
Zen of Python line 2:
Explicit is better than implicit. This is what we try to achieve with sprint “prio”, which involves representatives from the business and the data team.
For our stakeholders, prio gives them a say on the importance and urgency of each task. This ensures alignment between business goals and our data science efforts. It gives stakeholders transparency on the effort for each task, which helps to earn trust. They also become aware of the costs involved with frequently changing priorities and context switching, and generally try to tone it down.
Each business function has a budget (i.e., data science effort) they can allocate to their projects. Prio makes clear that—with limited resources—not everything is important and urgent. And changing priorities mid-sprint is gonna cost ya. Sometimes, business leaders deliberately remove tasks (from their backlog) to set constraints on their own team.
“Deciding what not to do is as important as deciding what to do.” - Steve Jobs
For the data team, prio provides good input and feedback. Business leaders have strong intuition on which efforts will move the needle most, and which are simply a waste of time. Through prio (and further interactions), this intuition rubs off on the data team. Over time, we gain the ability to quickly cut to the crux of the data science (and business) problem.
Perhaps more importantly, prio ensures alignment. The data team is made aware of organizational needs and the associated level of urgency. This discipline prevents distraction by the “latest and greatest” tech or fun ideas that may not be essential.
Also, we learn that, sometimes, deploying something that’s 80% done in one - two months is 10x better than deploying something 99% done in six months. It can even be a matter of keeping the lights on, due to sudden policy changes (e.g., GDPR) or events (e.g., COVID-19).
Truth be told, I still haven’t fully figured it out. You would expect that estimation error (i.e.,
actual - estimated) has a normal distribution. But in reality, it has a fat long tail. A two-week estimate is more likely to take two months than two days. (Welp, I guess we can’t use linear regression as errors are not homoscedastic.)
Nonetheless, estimations are useful in prioritization. If someone needs a project done in two weeks but we estimate it at 2 - 3 months effort, it’s a clear non-starter. I treat estimations like projected spending and time boxes as hard budgets. (Please, don’t take estimations as deadlines carved in stone; see previous paragraph.)
(Note: Past estimates and actuals are useful in guiding future estimates. Do we always underestimate certain tasks (e.g., data acquisition)? What was the actual effort required?)
People enjoy demos. Both giving and attending them. Demo-ers are excited to show (off) their work—hey, they take pride in it! Demos are also a venue to get feedback and suggestions for improvement. I’ve seen many types of demos, including:
Demos also promote accountability. The teams I’ve been in operate with high autonomy and responsibility: We trust everyone to do our best work and don’t micromanage. (This is similar to Netflix’s Freedom and Responsibility.)
“With great freedom comes great responsibility.” - Uncle Ben Parker (maybe)
What winds up happening is members striving to give regular demos—peer pressure perhaps? That said, it’s not necessary for everyone to give a weekly demo. Demos are usually done after a chunk of work or milestone, which usually happens every 2 - 8 weeks.
The entire organization has an open invite to data science demos. For the most part, we have to lure them with 🍻 beer, 🍕 pizza, and 🍿 chips. But when they show up, they understand better what we do in our geek cave. Sometimes, ideas are sparked, leading to new initiatives and collaboration.
Finally, demos increase the team’s bus factor. A better understanding of one another’s work increases the data replication of tribal knowledge. It’s also a great way for new members to get up to speed on what the team is doing, as well as gain context on data science initiatives and data.
Retrospectives provide an open and safe venue for team feedback. It’s simple—just divide a whiteboard into three sections:
Puzzling and you’re good to go. We reflect on the past sprint and list aspects of our work or events that fall into those buckets. Often, multiple people face the same issue and we just
Enjoyable bucket is often used to celebrate wins and express gratitude. We also share improvements to our workflow (e.g., “no meetings from x - y o’clock ”) or infra (e.g., “more Hadoop nodes, yay!”). I find that, often, these enjoyable tasks have an element of challenging people’s abilities and helping them grow.
Puzzling buckets contain challenges and feedback that leaders should pay close attention to. They’re often non-technical (e.g., “Too many meetings!”, “Not enough innovation!”). Puzzling issues may require help from the team to deep dive the root cause (e.g., “Our DB chokes every couple days though restarting it solves the issue”). Disciplined follow-up on these feedback beats back the entropy that creeps in over time.
Yes, there’s some cost associated with what I’ve mentioned. Prio takes an hour, demos take an hour (often more, as people have so much to share), and retros take 30 - 60 mins. That’s about 3 - 4 hours each iteration. IMHO though, the resultant benefits of alignment, learning, and continuous growth far outweigh the cost.
Better to move in the right direction, albeit slower, than fast on the wrong path.
I’ve also gotten feedback that time-boxing can constraint data science and innovation. I acknowledge this; it won’t work in all teams (e.g., academia, research-focused.) It does well in lean start-up environments though, where we have to quickly figure out what works for consumers (and Mr Market).
From experience, it’s really difficult to estimate the impact of a new model or improved features. Users may not take to that sophisticated hybrid recommender. Half the time, something that took three months—or even three weeks—had more impact than a year-long project. Yeap, that’s a coin toss—thus, I prefer to toss the coin more often.
At first glance, it seems awkward to apply Scrum to data science. However, I’ve grown to enjoy certain aspects of it, and seen how it helps teams to stay focused, ship, and grow. Hopefully, this essay has convinced you a bit.
If you’re a Scrum Master reading this, you might not consider this “true” Scrum. I accept that. But hey, as the Agile Manifesto suggests—People over Process. Adapt it so it works for your team and you.
Join 3,200+ readers getting updates on data science, data/ML systems, and career.
Welcome gift: 5-day email course on How to be an Effective Data Scientist 🚀