Spark+AI Summit 2020 took place on 24-26 June. I’m only halfway through and it’s filled with gems of practical knowledge.
Figuring out which talks to attend can be overwhelming though. For example, 24th June has close to 100 sessions, not including keynotes and forums. ctrl+f
on the following gives us the following matches:
“beginner”
: 4“intermediate”
: 56“advanced”
: 7“sponsored”
: 24“ask me anything”
: 6Most people don’t have the time to sift through and attend these talks. Thus, I’m tidying up my notes and sharing a high-level summary on selected talks. This write-up will focus on the application-agnostic talks; the next one will be application-specific.
Table of Contents
Nick Pentreath, Principal Engineer at IBM’s Center for Open Data and AI, shares four ways to improve model training and performance efficiency.
Architecture improvements: Inception V3 uses the standard convolution block and has 24 million parameters. In contrast, MobileNet V1 uses depth-wise convolution blocks and has only 4 million parameters. However, accuracy also suffers (78.8% vs 70.9% on ImageNet).
Trends in deep learning are changing. In the past, CNN models focused on accuracy (getting higher on the y-axis). Recently, the focus has shifted towards accuracy and efficiency (keeping the x-axis and the size of the models low).
Google also shared about EfficientNet which use neural architecture search to find optimal architectures. However, searching for the right architecture consumes a lot of resources.
Model pruning. Many of the weights in a neural network are not used. Here’s example of how we could prune as much as half the weights in InceptionV3 and MobileNet V1 with minimal loss in accuracy on ImageNet.
Quantization. We can reduce the numerical precision of weights by binning the values (i.e., reducing from 32-bit to 16-bit while still retaining the histogram distribution).
Quantization has two flavours: post-training and training-aware. Post-training quantization avoids retraining a model again, but there’s some drop in accuracy (77.2% vs 63.7%). Training-aware quantization is more complex though it has lesser accuracy loss (77.5% vs 70.9%)
Model distillation. From the examples on pruning, it’s clear that our large neural networks are usually over parameterised. One way to slim them down is to train a smaller (student) model to mimic the larger (teacher) model’s predictions.
The smaller model can be trained on both the teacher’s soft-labels and the ground truth labels. The soft-labels usually have higher entropy and provide more information and less variance. Thus, the smaller model can be trained on less data at a higher learning rate.
In some cases, the distilled model is stronger than the original model, providing higher accuracy and efficiency (further demonstrating that most models are over-parameterised). Some successful distilled models include DistilBERT and TinyBERT.
Instagram’s Feed recommendation system uses a stack of models, with the first layer being a distilled model. Here’s how it works during inference:
They shared how the distillation model is trained: First, they record the (candidate) input and respective output from the teacher models. Then, the distillation model is trained on the recorded data to replicate the results, optimising for NDCG ranking loss over the teacher model’s output.
Further Reading:
Sean Owen, Principle Solutions Architect at Databricks, shares 6 tips on how to scale the training of deep learning models, in the following order:
These steps use a 10% sample of the data for quick experimentation and iteration (Step 0 is “Use a sample and work in memory”):
These steps use the full data set to optimize for model performance:
Petastorm
to iterate over large data: Petastorm was designed by Uber to iteratively feed Parquet-based data into deep learning frameworks (initially TensorFlow, now supports PyTorch too) almost as efficiently as in memory. While we’re using 10x the data, epoch times are only 11x longer. Accuracy goes up to 83%.Horovod
across multiple machines: If multiple GPUs on a single machine isn’t enough, consider using Horovod
which helps to scale training on GPUs across multiple machines.Further Reading:
Joe Spisak, PyTorch Product Lead at Facebook, drops knowledge on PyTorch
and its ecosystem that helps simplify the training and deployment of models at scale.
To reduce model size and computation requirements, PyTorch comes with in-built model pruning and quantization. Facebook has been deploying quantized models for years and it helps with inference at scale, especially on model devices.
To train models at scale, there’s TorchElastic
. It enables distributed PyTorch jobs in a fault-tolerant and elastic manner. This ensures that jobs don’t get disrupted when machine(s) go down, or when your AWS spot instance gets outbid.
To deploy models at scale, there’s TorchServe
which was jointly developed with AWS. It makes it easy to achieve lightweight serving, provides default handlers for common tasks, model versioning for A/B testing, etc.
Further Reading:
TorchElastic
on GitHubTorchServe
on GitHubYeshwanth Vijayakumar, Project Lead/Architect at Adobe, shares about three probabilistic data structures for quick queries on large datasets. These data structures work because they’re monoids. Wait, what are monoids?
Here’s a simplified explanation. Monoids have the following properties:
int(1)
+ int(2)
= int(3)
1 + (2 + 3)
= (1 + 2) + 3
1 + 0 == 1
, where 0
is the identity valueBloom filters check for membership in a probabilistic way (e.g., set.exist(item)
). They can be used to answer questions like: Has the user viewed this product before?
If the result is False
, then the item
definitely does not exist in the set
. If the result is True
, there’s a possibility that the item does not exist in the set.
HyperLogLog counts the number of distinct elements in a set (e.g., set.count_distinct()
). This can be used to answer questions like: How many unique users bought items A, B, and C?
Because HyperLogLog is a monoid, we can count the number of unique users for each product, then perform a union across all the products.
Count-min sketch can be thought of as a frequency table of events in a stream of data. This can be used to answer questions like: How many items did this seller transact today?
It uses hash functions to map events to frequencies—thus, it uses sub-linear, instead of O(n), space. However, because it uses a hash function, it could over count some events due to collisions.
Further Reading:
Jianneng Li, Software Engineer at Workday, shares practical insight and experiment results from optimising broadcast joins.
Is the broadcast join always faster (if the data can fit into memory)? Not necessary. There’s significant overhead to broadcast joins, which includes:
What’s the I/O cost difference between the (regular) sortMergeJoin
and the broadcastHashJoin
? Let’s assume we have two tables: A
is the bigger table and B
is the smaller table to be broadcasted. We also have n
cores.
For sortMergeJoin (SMJ), the total I/O cost is 3(A/n) + 3(B/n)
:
A
: Read (A/n), Sort, Write (A/n)B
: Read (B/n), Sort, Write (B/n)For broadcastHashJoin (BHJ), the total I/O cost is: A/n + B/n + 2B
:
B
: Read (B/n), Build hash table, Write (B)Simplifying the above, the cost difference between both SMJ and BHJ is 2(A + B)/n - 2B
. We can further simplify this by considering A
as a magnitude of B
: If A
has 10 million rows and B
has 2 million rows, then B = 1
and A = 5
. Thus, the cost difference between SMJ and BHJ is (A+1)/n
.
When (A+1)/n
is less than 1 (i.e., you have many cores), then SMJ will be faster. But if (A+1)/n
is much greater than 1 (i.e., your left table is much bigger than the right, broadcasted table), then BHJ will be faster.
In an experiment where A
= 60 million and B
= 15 million (thus A+B = 5
), we see that with more than 5 cores, SMJ does better (i.e., lesser time). Otherwise, BHJ does better.
Similarly, keeping n
(number of cores) constant at 18, we see that BHJ outperforms SMJ when A+B
exceeds 20.
However, with automatic broadcasting (via adjusting the broadcast threshold), make sure the smaller table is on the right. Even when the smaller table is on the left, Spark will try to broadcast the bigger table on the right, leading to inefficiency.
Workday is currently working on executor-side BHJ (Spark currently has driver-side BHJ) which has shown tremendous improvement in broadcast performance (SPARK-17556). Hopefully, this will be merged soon.
There were other excellent sessions on optimising Spark but I didn’t have time to tidy up my notes on them. I recommended:
Further Reading:
The practical knowledge from Spark+AI Summit 2020 is invaluable. I’m eager to apply it to my work on spark pipelines and deep learning. Next week, I’ll share some application-specific talks at the conference.
Spark+AI Summit 2020 was epic. Some great talks👇:
— Eugene Yan (@eugeneyan) July 1, 2020
Improving Broadcast Joins In Apache Spark (25 June)
• broadcastHashJoin is not always faster; building hashtable has overhead
• More cores: sortMergeJoin performs better
• Bigger left table: broadcastHashJoin performs better
P.S., Are there any good sessions I missed? Let me know in the comments below =)
If you found this useful, please cite this write-up as:
Yan, Ziyou. (Jun 2020). My Notes From Spark+AI Summit 2020 (Application-Agnostic Talks). eugeneyan.com. https://eugeneyan.com/writing/notes-from-sparkai-summit-application-agnostic/.
or
@article{yan2020spark,
title = {My Notes From Spark+AI Summit 2020 (Application-Agnostic Talks)},
author = {Yan, Ziyou},
journal = {eugeneyan.com},
year = {2020},
month = {Jun},
url = {https://eugeneyan.com/writing/notes-from-sparkai-summit-application-agnostic/}
}
Join 9,100+ readers getting updates on machine learning, RecSys, LLMs, and engineering.