short course on neural networks (tsinghua)

[TOC]

OpenTA

Quiz

Analysis: Taylor expansion

The Taylor expansion of an analytic function around $x_0$ is given by $f ( x ) = sum _ { n = 0 } ^ { infty } frac { 1 } { n ! } f ^ { ( n ) } left( x _ { 0 } right) left( x - x _ { 0 } right) ^ { n }$, where $f ^ { ( n ) } left( x _ { 0 } right)$ denotes the $n ^ { t h }$ derivative of $f$ at $x_0$.

Linear Algebra

  • Matrix determinant:

$det(A)=-1$ (note: $i^2=-1$)

  • Matrix exponential

$e^{itheta A}=$

Solution: https://math.stackexchange.com/questions/773359/eulers-identity-in-matrix-form

Homework

3D Boolean functions

In mathematics and logic, a (finitary) Boolean function (or switching function) is a function) of the form $f: mathbf{B}^krightarrow mathbf{B}$, where B= {0, 1} is a Boolean domain and k is a non-negative integer called the arity of the function. In the case where k = 0, the “function” is essentially a constant element of B.

  • How many 3-dimensional Boolean functions are there?

For $k$ dimensional inputs, there are $2^{ 2^ k }$ Boolean functions. The number of inputs are $2^k​$, for each input, the output can be 0/1.

  • Count the number of symmetries of 3-dimensional Boolean functions that map 4 of the possible input patterns to 1.

    ANS: 6. All other patterns can be mapped by rotation or reflection. The reflection may not only reflect by the facet of axes, but by the facet of diagonals.

Symmetry: can be mapped to each other by rotation and/or reflection.

  • How many linearly separable 3-dimensional Boolean functions are there?

If the pattern contains facet that have crossed diagonals by the back and white balls, it is not linear separable. For example, the left down pattern in the figure above is not linearly separable.

By checking whether is pattern is like ‘0101xxxx’ or 15 others (12 from each facet of the cube, 4 from the diagonals), we get the answer is 104.

The code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def ():

pattern_list = []
for i in range(2 ** 8):
pattern = '0' * (8 - len(bin(i)[2:])) + bin(i)[2:]
pattern_list.append(pattern)
return pattern_list

def pattern_like(pattern,feature):
# feature pattern is the pattern like '0101xxxx'
for i, f in enumerate(feature):
if f != 'x':
if pattern[i] != feature[i]:
return False
return True

pattern_list = get_all_patterns()
XOR_features = ['0101xxxx', '1010xxxx', '01xx10xx', '10xx01xx', 'x10xx01x', 'x01xx10x', 'xxxx0101', 'xxxx1010', '1xx00xx1', '0xx11xx0', 'xx01xx10', 'xx10xx01', '1x0x0x1x', '0x1x1x0x', 'x1x0x0x1', 'x0x1x1x0', '10xxxx10', '01xxxx01', 'xx1010xx', 'xx0101xx', '1xx0x01x', '0xx1x10x', 'x10x0xx1', 'x01x1xx0']
xor_count = 0 # number of not linear separable boolean functions
for pattern in pattern_list:
flag = False
for feature in XOR_features:
if pattern_like(pattern, feature):
xor_count += 1
break
print(len(pattern_list) - xor_count)

Linear separability of 4-dimensional Boolean functions

Describe: 4 inputs, 1 output.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
#coding=utf8
import numpy as np

np.random.seed(0) # make the random number repeatable
class Node:
'''
The Neuron Node Class.
'''
def __init__(self):
self.input_node_list = [] # Inputs of the node
self.output_vector = [] # Output vector
self.threshold = 0.0
self.weight = 1.0
self.input_vector = [] # Input vector
self.activation_func = None # Activation function

def calc_output(self):
self.output_vector = []
for mu in range(len(self.input_node_list[0].input_vector)):
# for each pattern
O = 0.0
for input_node in self.input_node_list:
O += input_node.input_vector[mu] * input_node.weight
self.output_vector.append(self.activation_func(O - self.threshold))
return self.output_vector

class InputNode(Node):
'''
Input Node Class
'''
def __init__(self):
Node.__init__(self)
self.activation_func = lambda x:x

class OutputNode(Node):
'''
Output Node Class
'''
def __init__(self):
Node.__init__(self)
self.activation_func = lambda x: np.tanh(x/2.0)

class Perceptron:
def __init__(self):
self.eta = 0.02
self.input_list = []
self.output = OutputNode()
self.target = []
self.loss_function = 0.0

def gradient_descent(self):
delta_w_vector = []
for input_node_i in self.input_list:
delta_wi = 0.0
for mu in range(len(self.input_list[0].input_vector)):
b_mu = 0.0
for input_node_j in self.input_list:
b_mu += input_node_j.weight * input_node_j.input_vector[mu]
b_mu -= self.output.threshold
delta_wi += self.eta * (self.target[mu] - self.output.output_vector[mu]) * input_node_i.input_vector[mu] * (1 - np.tanh(b_mu / 2.0) ** 2) * 0.5
delta_w_vector.append(delta_wi)
# update weight
for i, input_node_i in enumerate(self.input_list):
input_node_i.weight += delta_w_vector[i]
# Then update threshold at the output
delta_theta = 0.0
for mu in range(len(self.input_list[0].input_vector)):
delta_theta += self.eta * (self.target[mu] - self.output.output_vector[mu]) * (1 - np.tanh(self.output.threshold / 2) ** 2) / 2.0
self.output.threshold -= delta_theta

def calc_loss_function(self):
H = 0.0
for mu in range(len(self.input_list[0].input_vector)):
H += (self.target[mu] - self.output.output_vector[mu]) ** 2 / 2
self.loss_function = H

perceptron = Perceptron()

for i in range(4):
input_node = Node()
input_node.weight = np.random.uniform(-0.2, 0.2, 1)
perceptron.input_list.append(input_node)
perceptron.output.input_node_list.append(input_node)

perceptron.output.threshold = np.random.uniform(-1, 1, 1)

with open('data/input_data_numeric.csv', 'r') as data:
for line in data:
line = line.split(',')
for i in range(4):
perceptron.input_list[i].input_vector.append(int(line[i+1]))

A = [1, 1, 1, 1, 1, 1, -1, 1, 1, 1, -1, -1, 1, 1, -1, 1]
B = [-1, -1, 1, 1, -1, -1, -1, 1, 1, -1, 1, -1, 1, 1, -1, -1]
C = [1, -1, 1, -1, 1, 1, -1, 1, -1, -1, 1, 1, 1, 1, 1, -1]
D = [-1, 1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, -1, 1, -1, -1]
E = [1, 1, -1, 1, -1, 1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1]
F = [-1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, -1, 1, 1]
perceptron.target = C
epoch = 10**5
for i in range(epoch):
perceptron.output.calc_output()
perceptron.gradient_descent()
perceptron.calc_loss_function()
if i % int(epoch/10**2) == 0:
print(i, perceptron.loss_function)

After $10^5$ updates, example results:

Target A: loss function: 4.000; output: [0.99933648,0.99933648,0.99933648,4.27588495e-05,0.99933648,0.99933648,4.27588495e-05,0.99933648,4.27588495e-05,0.99933648,4.27588495e-05,4.27588495e-05,0.99933648,4.27588495e-05,4.27588495e-05,4.27588495e-05]

Target D: loss function: 0.07137127; output:

[-0.9983045970305993,0.8467004934374702,-0.9999998804920586,-0.9999825727485746,-0.8473914883414136,-0.9983045970305993,-0.7797995354533719,0.99829629025566,-0.999999998772594,-0.999988364023615,-0.9983045970305993,-0.9999825727485746,-0.8473914883414136,0.8467004934374702,-0.9999998804920586,-0.9983045970305993]

True-False Questions

  • Two hidden layers are necessary to approximate any real valued-function with N inputs and one output in terms of a perceptron.

False. In general, for $N$ inputs, two hidden layers are sufficient, with $2N​$ units in the first layer, and one unit per basis function in second layer. Yet it is not always necessary to use two layers for real-valued functions. For continuous functions, one hidden layer is sufficient. This is ensured by the universal approximation theore. This theorem says any continuous function can be approximated to arbitrary accuracy by a network with a single hidden layer, for sufficiently many neurons in the hidden layer.

  • All Boolean functions with N inputs can be represented by a perceptron with one hidden layer.

True

  • How the weights are initialised for backpropagation does not matter (provided that they are not all zero) because one usually iterates for many epochs so that the initial conditions are forgotten.

False. If the weights are large, then the gradient goes to zero.

  • Nesterov’s accelerated gradient scheme is often more efficient than the simple momentum scheme because the weighting factor of the momentum term increases as a function of iteration number.

False. Nesterov’s accelerated-gradient method is more efficient than the simple
momentum method, because the accelerated-gradient method evaluates the gradient at an extrapolated point, not at the initial point.

  • $L_1$-regularisation reduces small weights more than $L_2$-regularisation.

True.

  • The number of $N$-dimensional Boolean functions is $2^N​$

False. The answer should be $2^{2^N}​$

尝试:单选第四个错误、单选第5个错误、45错误、345错误、35错误、

Two-layer perceptron

First hidden layer:

Second hidden layer:

Output:

Classification error:

Certificate