short course on neural networks (tsinghua)

[TOC]

The Taylor expansion of an analytic function around $x_0$ is given by $f ( x ) = sum _ { n = 0 } ^ { infty } frac { 1 } { n ! } f ^ { ( n ) } left( x _ { 0 } right) left( x - x _ { 0 } right) ^ { n }$, where $f ^ { ( n ) } left( x _ { 0 } right)$ denotes the $n ^ { t h }$ derivative of $f$ at $x_0$.

Taylor-expand the function $f(x)=(1-x)^{-m}$ up to second order ($n=2$) around $x_0=0$, where $m in ,mathbb{R}$ is:

$1 + m cdot x + frac { m cdot ( m + 1 ) } { 2 } cdot x ^ { 2 }$

$1 - m cdot x + frac { m ^ { 2 } } { 2 } cdot x ^ { 2 }$

$x - 1 - frac { 1 } { 2 } cdot ( x - 1 ) ^ { 2 }$

0

Ref:https://math.stackexchange.com/questions/1246136/maclaurin-series-expansion-for-e-1-x2

Linear Algebra

Matrix determinant:

$A = left[ begin{array} { c c } { 0 } & { i } \ { - i } & { 0 } end{array} right]$

$det(A)=-1$ (note: $i^2=-1$)

Matrix exponential

$e^{itheta A}=$

$left[ begin{array} { c c } { cos ( theta ) } & { - sin ( theta ) } \ { sin ( theta ) } & { cos ( theta ) } end{array} right]$

Solution: https://math.stackexchange.com/questions/773359/eulers-identity-in-matrix-form

$e ^ { mathbb { I } theta } =R ( theta ) = left( begin{array} { c c } { cos theta } & { - sin theta } \ { sin theta } & { cos theta } end{array} right) (mathbb { I } = left( begin{array} { c c } { 0 } & { - 1 } \ { 1 } & { 0 } end{array} right))$

Homework

3D Boolean functions

In mathematics and logic, a (finitary) Boolean function (or switching function) is a function) of the form $f: mathbf{B}^krightarrow mathbf{B}$, where B= {0, 1} is a Boolean domain and k is a non-negative integer called the arity of the function. In the case where k = 0, the “function” is essentially a constant element of B.

How many 3-dimensional Boolean functions are there?

For $k$ dimensional inputs, there are $2^{ 2^ k }$ Boolean functions. The number of inputs are $2^k$, for each input, the output can be 0/1.

Count the number of symmetries of 3-dimensional Boolean functions that map 4 of the possible input patterns to 1.

ANS: 6. All other patterns can be mapped by rotation or reflection. The reflection may not only reflect by the facet of axes, but by the facet of diagonals.

Symmetry: can be mapped to each other by rotation and/or reflection.

How many linearly separable 3-dimensional Boolean functions are there?

If the pattern contains facet that have crossed diagonals by the back and white balls, it is not linear separable. For example, the left down pattern in the figure above is not linearly separable.

By checking whether is pattern is like ‘0101xxxx’ or 15 others (12 from each facet of the cube, 4 from the diagonals), we get the answer is 104.

The code:

def ():
    
    pattern_list = []
    for i in range(2 ** 8):
        pattern = '0' * (8 - len(bin(i)[2:])) + bin(i)[2:]
        pattern_list.append(pattern)
    return pattern_list

def pattern_like(pattern,feature):
    # feature pattern is the pattern like '0101xxxx'
    for i, f in enumerate(feature):
        if f != 'x':
            if pattern[i] != feature[i]:
                return False
    return True

pattern_list = get_all_patterns()
XOR_features = ['0101xxxx', '1010xxxx', '01xx10xx', '10xx01xx', 'x10xx01x', 'x01xx10x', 'xxxx0101', 'xxxx1010', '1xx00xx1', '0xx11xx0', 'xx01xx10', 'xx10xx01', '1x0x0x1x', '0x1x1x0x', 'x1x0x0x1', 'x0x1x1x0', '10xxxx10', '01xxxx01', 'xx1010xx', 'xx0101xx', '1xx0x01x', '0xx1x10x', 'x10x0xx1', 'x01x1xx0']
xor_count = 0 # number of not linear separable boolean functions
for pattern in pattern_list:
    flag = False
    for feature in XOR_features:
        if pattern_like(pattern, feature):
            xor_count += 1
            break
print(len(pattern_list) - xor_count)

Linear separability of 4-dimensional Boolean functions

Describe: 4 inputs, 1 output.

$O ^ { ( mu ) } = tanh left[ frac { 1 } { 2 } left( sum _ { i = 1 } ^ { 4 } w _ { i } x _ { i } ^ { ( mu ) } - theta right) right]$ $H = frac { 1 } { 2 } sum _ { mu } left( t ^ { ( mu ) } - O ^ { ( mu ) } right) ^ { 2 }$ $g(x)=tanh(x/2)$ $delta w_i=-etafrac{partial H}{partial w_i}=eta sum_{mu}(t^{(mu)}-O^{(mu)})x_i^{(mu)}g'(sum_{j}(w_j x_j^{(mu)}-theta))$ $deltatheta=-eta sum_{mu}(t^{(mu)}-O^{(mu)})g'(-theta)$

#coding=utf8
import numpy as np

np.random.seed(0) # make the random number repeatable
class Node:
    '''
    The Neuron Node Class.
    '''
    def __init__(self):
        self.input_node_list = [] # Inputs of the node
        self.output_vector = [] # Output vector
        self.threshold = 0.0
        self.weight = 1.0
        self.input_vector = [] # Input vector
        self.activation_func = None # Activation function

    def calc_output(self):
        self.output_vector = []
        for mu in range(len(self.input_node_list[0].input_vector)):
            # for each pattern
            O = 0.0
            for input_node in self.input_node_list:
                O += input_node.input_vector[mu] * input_node.weight
            self.output_vector.append(self.activation_func(O - self.threshold))
        return self.output_vector

class InputNode(Node):
    '''
    Input Node Class
    '''
    def __init__(self):
        Node.__init__(self)
        self.activation_func = lambda x:x

class OutputNode(Node):
    '''
    Output Node Class
    '''
    def __init__(self):
        Node.__init__(self)
        self.activation_func = lambda x: np.tanh(x/2.0)

class Perceptron:
    def __init__(self):
        self.eta = 0.02
        self.input_list = []
        self.output = OutputNode()
        self.target = []
        self.loss_function = 0.0

    def gradient_descent(self):
        delta_w_vector = []
        for input_node_i in self.input_list:
            delta_wi = 0.0
            for mu in range(len(self.input_list[0].input_vector)):
                b_mu = 0.0
                for input_node_j in self.input_list:
                    b_mu += input_node_j.weight * input_node_j.input_vector[mu]
                b_mu -= self.output.threshold
                delta_wi += self.eta * (self.target[mu] - self.output.output_vector[mu]) * input_node_i.input_vector[mu] * (1 - np.tanh(b_mu / 2.0) ** 2) * 0.5
            delta_w_vector.append(delta_wi)
            # update weight
        for i, input_node_i in enumerate(self.input_list):
            input_node_i.weight += delta_w_vector[i]
        # Then update threshold at the output
        delta_theta = 0.0
        for mu in range(len(self.input_list[0].input_vector)):
            delta_theta += self.eta * (self.target[mu] - self.output.output_vector[mu]) * (1 - np.tanh(self.output.threshold / 2) ** 2) / 2.0
        self.output.threshold -= delta_theta

    def calc_loss_function(self):
        H = 0.0
        for mu in range(len(self.input_list[0].input_vector)):
            H += (self.target[mu] - self.output.output_vector[mu]) ** 2 / 2
        self.loss_function = H

perceptron = Perceptron()

for i in range(4):
    input_node = Node()
    input_node.weight = np.random.uniform(-0.2, 0.2, 1)
    perceptron.input_list.append(input_node)
    perceptron.output.input_node_list.append(input_node)

perceptron.output.threshold = np.random.uniform(-1, 1, 1)

with open('data/input_data_numeric.csv', 'r') as data:
    for line in data:
        line = line.split(',')
        for i in range(4):
            perceptron.input_list[i].input_vector.append(int(line[i+1]))

A = [1, 1, 1, 1, 1, 1, -1, 1, 1, 1, -1, -1, 1, 1, -1, 1]
B = [-1, -1, 1, 1, -1, -1, -1, 1, 1, -1, 1, -1, 1, 1, -1, -1]
C = [1, -1, 1, -1, 1, 1, -1, 1, -1, -1, 1, 1, 1, 1, 1, -1]
D = [-1, 1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, -1, 1, -1, -1]
E =  [1, 1, -1, 1, -1, 1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1]
F =  [-1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, -1, 1, 1]
perceptron.target = C
epoch = 10**5
for i in range(epoch):
    perceptron.output.calc_output()
    perceptron.gradient_descent()
    perceptron.calc_loss_function()
    if i % int(epoch/10**2) == 0:
        print(i, perceptron.loss_function)

After $10^5$ updates, example results:

Target A: loss function: 4.000; output: [0.99933648,0.99933648,0.99933648,4.27588495e-05,0.99933648,0.99933648,4.27588495e-05,0.99933648,4.27588495e-05,0.99933648,4.27588495e-05,4.27588495e-05,0.99933648,4.27588495e-05,4.27588495e-05,4.27588495e-05]

Target D: loss function: 0.07137127; output:

[-0.9983045970305993,0.8467004934374702,-0.9999998804920586,-0.9999825727485746,-0.8473914883414136,-0.9983045970305993,-0.7797995354533719,0.99829629025566,-0.999999998772594,-0.999988364023615,-0.9983045970305993,-0.9999825727485746,-0.8473914883414136,0.8467004934374702,-0.9999998804920586,-0.9983045970305993]

True-False Questions

Two hidden layers are necessary to approximate any real valued-function with N inputs and one output in terms of a perceptron.

False. In general, for $N$ inputs, two hidden layers are sufficient, with $2N$ units in the first layer, and one unit per basis function in second layer. Yet it is not always necessary to use two layers for real-valued functions. For continuous functions, one hidden layer is sufficient. This is ensured by the universal approximation theore. This theorem says any continuous function can be approximated to arbitrary accuracy by a network with a single hidden layer, for sufficiently many neurons in the hidden layer.

All Boolean functions with N inputs can be represented by a perceptron with one hidden layer.

True

How the weights are initialised for backpropagation does not matter (provided that they are not all zero) because one usually iterates for many epochs so that the initial conditions are forgotten.

False. If the weights are large, then the gradient goes to zero.

Nesterov’s accelerated gradient scheme is often more efficient than the simple momentum scheme because the weighting factor of the momentum term increases as a function of iteration number.

False. Nesterov’s accelerated-gradient method is more efficient than the simple
momentum method, because the accelerated-gradient method evaluates the gradient at an extrapolated point, not at the initial point.

$L_1$-regularisation reduces small weights more than $L_2$-regularisation.

True.

The number of $N$-dimensional Boolean functions is $2^N$

False. The answer should be $2^{2^N}$

尝试：单选第四个错误、单选第5个错误、45错误、345错误、35错误、

Two-layer perceptron

First hidden layer:

$V _ { j } ^ { ( 1 , mu ) } = tanh left( - theta _ { j } ^ { ( 1 ) } + sum _ { k = 1 } ^ { 2 } w _ { j k } ^ { ( 1 ) } x _ { k } ^ { ( mu ) } right)$

Second hidden layer:

$V _ { i } ^ { ( 2 , mu ) } = tanh left( - theta _ { i } ^ { ( 2 ) } + sum _ { j = 1 } ^ { M _ { 1 } } w _ { i j } ^ { ( 2 ) } V _ { j } ^ { ( 1 , mu ) } right)$

Output:

$O ^ { ( mu ) } = tanh left( - theta ^ { ( 3 ) } + sum _ { i = 1 } ^ { M _ { 2 } } w _ { i } ^ { ( 3 ) } V _ { i } ^ { ( 2 , mu ) } right)$

Classification error:

$C = frac { 1 } { 2 p _ { mathrm { val } } } sum _ { mu = 1 } ^ { p _ { mathrm { val } } } left| operatorname { sgn } left[ O ^ { ( mu ) } right] - t ^ { ( mu ) } right|$

short course on neural networks (tsinghua)

OpenTA

Quiz

Analysis: Taylor expansion

Linear Algebra

Homework

3D Boolean functions

Linear separability of 4-dimensional Boolean functions

True-False Questions

Two-layer perceptron

Certificate

近期文章

近期评论

标签

热门

文章归档

分类目录

功能