rl
state, action, policy, reward (short term), value / return (long term)
dqn
This post shows that dqn works by estimating a Q-network. Which predicts the maximum expected discounted future return for taking any action a in state s. When deployed, q-network will provide optimal action at each step.
policy gradient
This post shows that Policy Gradients works by leaving out the reward. During inference, the action can be randomly determined. The correct return is fed to the network after it can be determined.
- policy gradient requires a positive feedback. Which might be hard to achieve.
implications
notice that the reward does not have to be differentiable. So, we could use rl to help train network which has non-differentiable components. ex. perform the action randomly and reward the ones with good outcome.
Gradient Estimation Using Stochastic Computation Graphs Reinforcement Learning Neural Turing Machines - Revised
fusion
bn with conv
\[
\begin{gather}
s = \frac{s}{\sqrt{var + eps}}
W_c = W_c * s
B_c = (B_c - m) * s + B_{bn}
\end{gather}
\]
- bn known problem:
- some channel might have value around zero
- gamma term might be negative
https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/batchnorm.py https://github.com/pytorch/pytorch/blob/master/torch/nn/functional.py#L2041 https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Normalization.cpp
imagenet preparation
mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train
tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar
find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
cd ..
mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val && tar -xvf ILSVRC2012_img_val.tar
wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash
A copy of the script is kept here
Notice that there are some problematic data in imagenet.
moving average
- simple moving average
- \( m = \frac{1}{n} \sum_{i=0}^{n-1} p_{-i} \)
- cumulative moving average
- \( m_{n+1} = \frac{x_{n+1} + n m_n}{n+1} = m_n + \frac{x_{n+1} - m_n}{n+1} \)
- exponential moving average
- \( m_{n+1} = m_n + \alpha (x_{n+1} - m_n) \)
detection
- box size + position vs ground truth box: box loss
- box confidence score vs ground truth iou: obj loss
- classification loss
nms
- many proposals can be made from the network
- nms: remove proposal that duplicate with others
- accept proposal with the highest confidence
- a proposal will be rejected if it has a high iou with accepted proposals
- repeat until no more proposal
- soft-nms: instead of removing box completely, reduce it by the amount of iou
todo section
genetic
https://towardsdatascience.com/reinforcement-learning-without-gradients-evolving-agents-using-genetic-algorithms-8685817d84f
parameter initialization
asymmetry in kernel? http://cs231n.github.io/neural-networks-2/#init
batch size vs number of iteration
small batch -> small steps in convergence https://stats.stackexchange.com/questions/164876/tradeoff-batch-size-vs-number-of-iterations-to-train-a-neural-network
regularizer
https://stackoverflow.com/questions/46615623/do-we-need-to-add-the-regularization-loss-into-the-tot
adam
https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/