Loss engineering and uncertainty for multi-task learning

Robust Regression

Understanding the challenges and strategies in multi-task learning
data science
Author

Oren Bochman

Published

Monday, September 12, 2022

Keywords

machine learning, ml engineering, negative transfer, multi task learning, robust regression

NoteLoss engineering TLDR

Our assignment today is model some related scientific phenomena (aka tasks) using a single model. Hoping perhaps that multiple tasks will let create a synergy by helping the model generalize by providing multiple signals to better reinforce the difference of signal from noise and capture more of the hidden structure in the problem. Since each tasks has it loss function (a metric of how well it performs evaluated on unseen data) we just need to combine them and we are done.

The reason why not all models are multitask learning models is that in reality there are any number of forces that may act to frustrate the synergy we are hoping to create. Losses may be in different scales, time frames. They tasks may learn better when cooperating or competing. Finally the underlying processes may be causally related in subtle ways.

Two ideas are used in multi-goal learning - enforcing sparsity across tasks through norm regularization and modelling the relationships between tasks. Casual inference in the bayesian modeling frameworks seems to be a method of resolve some of these issues. It can certainly help by pointing out when the underlying mechanism governing the processes generating each effects/task are at odds. I recommend McElereth’s “Statistical Rethinking” for learning about that. But even if we have not been able to do a casual analysis or it seems fine, we still need to engineer a loss that works for the particular collection of tasks.

And it is worth mentioning that whole the loss sits at the top of the model and is the source of all gradients different. Using different loss function may require engineering a suitable architecture.

Multitask loss function

The naive solution is Linear combination. L = \sum {\alpha_!L_i}

  • where:
  • L_i are the individual losses
  • \alpha_i are the importance of each task.

I suppose it worth checking out what people use in multitask learning. I recall that multitask learning is common in large language model papers

else: g_{ij}(t)=g_{ij}(t-1)*.95

  • Ways to make this work better:
    • limit gains to avoid instabilities
    • better with large batches - since designed for bull bach learning
    • combine with momentum
ImportantCredit

I used Leo F. Isikdogan’s Multi-Task Learning Explained in 5 Minutes as my starting point - as it mentioned the paper that kept coming up

Citation

BibTeX citation:
@online{bochman2022,
  author = {Bochman, Oren},
  title = {Loss Engineering and Uncertainty for Multi-Task Learning},
  date = {2022-09-12},
  url = {https://orenbochman.github.io/posts/2022/2022-09-16-loss-engineering/2022-09-16-loss-engineering.html},
  langid = {en}
}
For attribution, please cite this work as:
Bochman, Oren. 2022. “Loss Engineering and Uncertainty for Multi-Task Learning.” September 12, 2022. https://orenbochman.github.io/posts/2022/2022-09-16-loss-engineering/2022-09-16-loss-engineering.html.