Feature selection problem
The first idea comes from looking at regression and hyperparameter optimization tasks
First we want to investigate the idea of finding the best values of individual parameters or hyperparameters. Next if these are second order effects corresponding to the interaction of two or more parameters, we will want to investigate if there are pairs of hyper/parameters that are more important than others.
- one way to organize this search might be characterized as a reductionist search, where we start with a simple model and add complexity as needed.
- another way to organize this search might be characterized as a covariance matrix bingo, where we conceive of all the possible interactions represented by the covariance matrix, in reality the covariance matrix is a symmetric and sparse, we may discover that certain features are correlated with each other, or better yet collinear, and we may want to remove these features and retain perhaps better orthonormal features. In other words we might be able to plan our model selection by looking at how the covariance matrix evolves as we add features to the model. We will end up with a subset of the matrix similar to a bingo card, where we wish to get to this winning combination of features faster than the other players.
A second reality of this issues addressable by RL is that high order effects tend to be increasingly sparse so that if we have a hint of that certain slots in the covariance matrix are non zero we should definitely explore those effects more than others. This in fact suggests an improvement to bayesian search.
This aspect of feature selection is not particularly exciting, but suggests a more abstract way of thinking about the problem. This leads to a second perhaps more powerful idea.
Localized subspace embedding
The idea of embeddings which allow us to use a distributed representation instead of a one hot encoding is very powerful. In one sense it is the opposite of the reductionist notion mentioned above. Instead of trying to find the best features we are trying to find the best representation of the features. However if one is familiar with PCA or SVD one knows that the best representation is often some linear combination of the actual features we observe. A reductionist might view these a generalized coordinates that are more useful for investigating the feature space.
If our features are built from such embeddings, we learn weights that correspond to the importance of the features in the model. This representation is still subject to the bias variance tradeoff and we generally have many parameters in Neural Networks. One way to view this is to try and organize the model so that it can use embeddings of subspaces - corresponding to features that are balance generalization and discrimination.
A second point is that in different contexts we might need to use different features. This is much easier to see in RL and NLP. Building a embedding that is localized to a just some features and some observations/states might allow the model to get good generalization then by considering all the features at once. This is an analogue of the idea of factoring a distribution into a product of marginals, particularly in the case of a much larger bayesian network. In this case though we might be talking about using two such factorizations, with some discriminator selecting the observations that are used in different contexts.
We might think of a neural network as evolving system of that learns to bifurcate the distributed representation of the features of the data set in the input into any number of
smaller and more localized subspaces. The more Localized subspaces are more likely to be linearly separable and thus easier to learn.
However all this happens by breaking symmetries using the random aspects of the learning algorithm. Minibatches present many random samples which carry different payloads of information. Certain such payloads may reinforce the current network weights, while the next may require a bifurcation of the representation into two to minimize the loss. Another might require many bifurcations and may not lead to any new bifurcations or reinforcements. Drop out breaks symmetries by shutting down parts of the network temporarily.
On other problems with generalization in RL and in NLP might be resolved using local subspace embedding of the state space. These are features that conflates states that are similar and distinguishes states that are different. By avoiding a full embedding we reduce the variance of the function approximation and thus improve generalization.
Question
- How can we encourage the NN model to learn localized subspaces?
- Howe can query the model to learn/interpret the localized subspaces?
- If the model learns a powerful feature how can we give access to it to other parts of the model? (this would reduce learning time and increase generalization)
- all discriminators following the features will be have one use of the feature.
- could we let all other nodes in the network have a residual connection to the feature?
- if we wanted to further refine a representation of a localized subspace of verbs to intransitive verbs how would we do that?
- we would like to lean a discriminator that can tell the difference between transitive and intransitive verbs and then use it to gate the input to the verb feature.
- however we might prefer to do better than that and learn a better representation of the
transitive and intransitive verbs. This would need learning different weights for the different verbs. This means we want to bifurcate the verb feature subnetwork into two replicates but add the discriminator as a gate to the input of the two sub-networks. - another point worth considering is that once we have learned a good representation of both
the transitive and intransitive verbs we can use these as features in the next layers of the network. We should be able to combine them to get a better representation than just the original verb. - I recon this happens many times in LLMs. What we might want is to have some way for the model to attend to all the subspaces it has learned and use them to guide its learning.
- The challenge seems to be in identification of the subspaces. We may be using different basis for each subspace etc which may lead to difficulty in reusing them.
Attention heads
It seems though that attention heads are a good way to organize and increase the diversity of the localized subspaces. This is because the attention heads can be trained to focus on different parts of the input and thus can be used to create a localized subspace embedding. This is a very powerful idea and is the basis of the transformer architecture.
Residual rerouting
Guided bifurcation along side the spontaneous symmetry breaking
Citation
@online{bochman2024,
author = {Bochman, Oren},
title = {Two Ideas on Generalization},
date = {2024-07-01},
url = {https://orenbochman.github.io/posts/2024/2024-07-01-generalization-in-ML/},
langid = {en}
}