Some questions I have possed on DNN

Q1. Is there a way to assess the impact of a trainng case or a batch on the model’s, specific layers and specific units? A1. Over the years since I posed this question I have noticed that it is something researchers seem to have looked at. - At first glance it seems like it is im[pssible to assess the impact. SGD works on mini batches or the full data. - But when we analy`se MNIST errors we take the worst misclassifications and we can look at the activation they generate at different level. We can see the activation that leads to a misclassification. So it turns out that it is possible. - Hinton also desribed full using MCMC for full baysian learning . Mackay also put DNN on more or less solid baysian footing. I have not implementated it so I cannot speak to the details but intuitively with a posterior it should be possible to condition on a point.

Lets imagine we could be advised by a “demon” regarding the can assess the over all contribution to signal or noise of different aspects of our model according to the following typology: First kind – overall model Second kind – each hyper parameter
Third kind – at each layer Fourth kind – at each unit (neuron) Fifth kind – at the weights level Sixth Kind - part of an training item that activates neurons (pixels/sounds/words) I’m considering an analytics platform that would be based on collecting data from Jason Yosinski’s data visualization toolbox

One way to do this is to have a procedure that can quickly unlearn/forget training sets then do a diff. (might not be very useful if there are millions of weights) We may need some measure of uncertainty from non parametric methods that describes how if we are adding more learning points in places that are fitting our manifold at new point which are like new (learning new atoms or their relations) or we are just moving the surface back and forth at a single location or its neighborhood.

e.g. learn the feature that differentiates birds from bees (generalizing) rather than modelling different points of view for each instance of bird and bee (modeling noise).

For each row in the data set what do we learn from it ?

more atomic concepts Relations on atomic concepts better smoothing – fitting missing data Short term relationships a>b long distance relation a>b>…>c>d

NN loves more data - more features, more layers more observation but the model can be grow very big and if we use lots of data we will need to train for a very long time

I would like to explore the following ideas

running some parametric algorithm on the data to bootstrap the neural net’s prior distributions closer the final values

similar to the above I’d like to training nn dynamically and possibly non parametrically (you can have more CPU, memory, storage, data etc. but you get penalized for it) The TF graph should be expanded/contracted layers membership increased or decreased layers increased, hyper params adjusted during training.

Bayesian methods allow choices to be made about where in input space new data should be collected in order that it be the most informative (MacKay, 1992c). Such use of the model itself to guide the collection of data during training is known as active learning.

MacKay, D. J. C. (1992c). Information-based objective functions for active data selection. Neural Computation 4 (4), 590-604.

The relative importance of different inputs can be determined using the Bayesian technique of automatic relevance determination (MacKay, 1994a, 1995b; Neal, 1994), based on the use of a separate regularization coefficient for each input. If a particular coefficient acquires a large value, this indicates that the corresponding input is irrelevant and can be eliminated.

Neal, R. M. (1994). Bayesian Learning for Neural Networks. Ph.D. thesis, University of Toronto, Canada.

MacKay, D. J. C. (1994a). Bayesian methods for backpropagation networks. In E. Domany, J. L. van Hemmen, and K. Schulten (Eds.), Models of Neural Networks III, Chapter 6. New York: Springer-Verlag.

MacKay, D. J. C. (1995b). Bayesian non-linear modelling for the 1993 energy prediction competition. In G. Heidbreder (Ed.), Maximum Entropy and Bayesian Methods

Questions: In your own words describe a neural network

A Neural Network consists of a graph with the inputs in one side and outputs on the other and between them are hidden units. All these nodes are connected with the connection strength between of the vertex connecting the units called its weight. Generally the graph is bipartite and can thus be organized using layers.

The graph can be trained so that the

Weights are the vertices
Actions – the nodes ? what are these Model selection - Chaos –
What is the importance of relative weights – within the same layer, between layers Given answers for the above should we use that for bootstrapping the wights instead of using random weights.

Geometry of second order methods. Won’t using Mini Batched steps help where there is a complex structure.

What is there are many local minima in our surface – how can we learn it all if we are always growing downhill. What happens if we have a chaotic surface – I think we can get this with a logistic function - What about an oscillation.

Difference between first and second order learning methods

In reinforcement models the game being played is a markov decision process

Do GAN take this concept one step further ?

For DNN what filters/kernels are initially selected. Are some different basis functions going to work better than others.
Also how about making some basis functions size independent by adding a 3by three five by five seven by seven etc. version.
For video filters that are time dependent. Also what about using non orthogonal basis.

Also what about forcing the system to drop basis which is redundant

For DNN we see that usually we have square on square configurations to reduce and mix the data. What about triangular or hexagonal architecture. Howa bout looking at RGB&Grey

Postscript:

Batch normalization: Accelerating … Input: Values of overa mini-batch: B = Parameters to be leamed: -y, ’3 Output: {Yi Xi — 11B 2 // mini-b;

Pix2Pix

Attention - all you need is attention

Group Equivariant Convolutional Networks

Steerable CNNs

logarithmic spiral

fractal affine embeddings

simulate stereo vision modes

Visualization

distil journal

Activation-atlas

https://aiyproject.withgoogle.com/open_speech_recording
https://github.com/petewarden/open-speech-recording
https://distill.pub/2016/augmented-rnns/

Attention and Augmented Recurrent Neural Networks

http://colah.github.io/
https://github.com/sunnyyeti/Solutions-to-Neural-Network-for-machine-learning-by-Geoffrey-Hinton-on-Coursera
https://github.com/BradNeuberg/hinton-coursera/blob/master/assignment3/a3.m
https://github.com/Chouffe/hinton-coursera/tree/master/hw3
https://github.com/tensorflow/compression/blob/master/examples/bls2017.py
https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/

nlp

https://arxiv.org/abs/1803.06643
https://arxiv.org/abs/1811.00937

Reuse

CC SA BY-NC-ND

Citation

BibTeX citation:

@online{bochman2017,
  author = {Bochman, Oren},
  title = {Deep {Neural} {Networks} - {Some} {Questions}},
  date = {2017-12-21},
  url = {https://orenbochman.github.io/notes/dnn/dnn-references/},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2017. “Deep Neural Networks - Some Questions.” December 21, 2017. https://orenbochman.github.io/notes/dnn/dnn-references/.