## Initial Programming Exercises On Simple Neural Networks

### Introduction

- Read the following web-page of this web-site to learn the programming concepts that you need to learn to work with LOCEN (and in general to be able to build and work with computational models of brain and behaviour, or artificial intelligence/robotic systems): web-page on programming concepts to learn
- Depending on the programming language that you have to learn (agreed with your LOCEN supervisor), refer to suitable web-sites teaching such programming language. For the different programming languages, such web-sites are indicated in the web-pages linked from this web-page under ''Programming and simulations --> Specific languags".
- Installing an IDE (Integrated Development Environment) on your portable computer (or the computer at LOCEN) to program in the selected language; indications on this are given in the web-pages dedicated to the specificic languages and mentioned in the previous point
- Understand that there are various learning paradigms related to neural networks (see below for more details): this page gives some exercises related to supervised learning and reinforcement learning
- Do the exercises illustrated in detail below, in particular involving:
- A first simple perceptron neural network able to learn the AND and OR functions through a supervised learning algorithm
- The exercise of the previous point, but using linear algebra data structures and operations of the language you use (or libraries expanding the language)
- Implementation, based on linear algebra, of a multilayer perceptron able to learn the difficult X-OR function
- Implementation of two leaky neurons
- Implementation of an echo-state network
- Implementation of a reinforcement learning model based on Q-Learning
- Implementation of a reinforcement learning model based on an actor-critic (still under construction)

### Understand (study) that there are various learning paradigms related to neural networks: see below for more details

Understand that there are various learning paradigms (classes of algorithms) through which neural networks can learn. The major ones are these:

- Supervised learning. Two very important sub-classes:
- Linear regression
- Classification

- Unsupervised learning
- Reinforcement learning

Search the internet for further information and understand the differences between these different learning paradigms, e.g.:

https://en.wikipedia.org/wiki/Artificial_neural_network#Learning_paradigms

https://en.wikipedia.org/wiki/Machine_learning

Also, ask your LOCEN supervisor for the book of Floreano on Neural Networks: an old book but one of the best for learning the basics on neural networks.

### Firing-rate neural-network models

### Exercise to program a first simple perceptron neural network able to learn the AND and OR functions through a supervised learning algorithm

This section gives a programming execise on a simple neural network and some indications on how to accomplish it.

The goal of the exercise is to program a one-layer perceptron that is able to learn, in a supervised fashion, to produce the logical AND and OR functions. In particualr the network should respond with the following 'desired output' if given the folling 'input' examples:

**OR function**

Input1 Input2 Bias Output

0 0 1 0

0 1 1 1

1 0 1 1

1 1 1 1

**AND function**

Input1 Input2 Bias Output

0 0 1 0

0 1 1 0

1 0 1 0

1 1 1 1

https://en.wikipedia.org/wiki/Perceptron

http://computing.dcu.ie/~humphrys/Notes/Neural/single.neural.html

http://francesco-mannella.github.io/neunet-basics/perceptron-simple-simu...

http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/

There are various versions of such simple perceptron. You should implement one where the output unit uses a sigmoid transfer function, and the network is considered to output 1 if the actvation of the output unit is > 0.9, and to asnawer 0 if the activation of the output unit is < 0.1. This leads you to create a more general neural net, e.g. useufl when you pass to multi-layer perceptrons (exercise below); plus is more correct/general from a machine learning point of view.

Your networks should have 3 input units (the first being a 'bias') and one output units.

The network should learn on-line (i.e., after the presentation of each example of the training set) not batch (i.e., for all exaples together).

Note that neural networks (usually) learn progressively, rather than by 'one shot', by being exposed to the same training examples several times. This affects the main program structure indicated below.

Your program should be structured as follows:

- Creation and initialisation of data structures
- Creation of data structurs and data for training the neural network
- Loop for multiple epochs (one epoch is one learning pass of the net):
- Loop for examples of data-set
- Newtork spreading
- Network learning
- Collection of data for following data analysis

- Loop for examples of data-set
- Data analysis: plots on learning curve of neural network
- Data analysis: numerical output of network performance (e.g., desired and actual net output, errors, etc.)

### Same previous exercise, but using liner algebra data structures and operations of the chosen language

Do the exercise above but:

- Use the programming language-dependent data-structures to implement vectors (e.g., of input and output) and matrixes (e.g., for connection weights)
- Use the programming language-dependent linear algebra operators (e.g., to implement the whole network spreading of activation, you should have something like:

vOutp = mConnWeig * vInpu - Implement a neural network that performs the AND and the OR functions at the same time. The network will hence have 3 input and two output units.

### Exercise to implement, using linear algebra programming, a multilayer perceptron learning the X-OR function

Implement a neural network that performs the X-OR function below. To this purpose you will need a multilayer perceptron with 3 input units, 1 output units, and a variable number of hidden units (initially, implement a network with 1 hidden layer only).

**X-OR function**

Input1 Input2 Bias Output

0 0 1 1

0 1 1 0

1 0 1 0

1 1 1 1

**Parity check function**

This is a generalisation of the x-or function, where the output is 0 if the elements of the input pattern have an odd number of digits equal to 1, and 1 if such number is equal. So for example, for 3 input you have:

Input1 Input2 Input3 Bias Output

0 0 0 1 1

0 0 1 1 0

0 1 0 1 0

0 1 1 1 1

1 0 0 1 0

1 0 1 1 1

1 1 0 1 1

1 1 1 1 0

The new network to solve either task has a different architecture with respect to the perceptron. In particular, it has a second layer of two units, called "hidden units", between the input layer and the otput layer. So the inputs and the bias fire on the hidden units and then the hidden units (and the bias) fire on the output units.

The learning algorithm for the weights between the output and the hidden units is the same as for the simple perceptron. A second algorithm is needed to modify the weights between the input and the hidden units, called "error backpropagation" algorithm. Search the formulas of the error-back propagation algorithm in the internet. Make sure that you understand very well the algorithm as it is not very simple.

Change your program so that after learning it plots:

- The curve of the Square Error of the output with respect to the desired output during learning (put "epochs" at the x-axis, i.e. plot the error averaged over all the input patterns)
- The two 2D weights matrices (from input to hidden; from hidden to output) of your multilayer perceptron: this plot represents every value of a connection weight as a coloured square, with each colour representing the size of the weight, from most negative values to largest positive values.
- Also, plot a similar graphs with: on 1 row = the activation of 1 input pattern; the related activation of the hidden units; the activation of the output unit(s); on different rows: those activations corresponding to all partterns.

Further change the program so that at the end of the simulation all data to make the previous graphs are downloaded into separated file .txt files, and then the graphs are plotted by a second program.

### Leaky-neuron neural-network models

### What are leaky neurons and how to simulate then

"Leaky neurons" are different from the "firing rate" neurons used so far. In particular, these are neurons that store information in time, e.g. they have an activation that accumulates, with a fixed input, or progressively fades away, with a zero activation. They are indeed behaving like a ''container of water with a leak when you put water in it": they slowly fill in when you put water, the progressively get empty when you stop.

First, see the part of slides related to the leaky neuron and the way to simulate it based on Euler approximation.

Use those slides to understand how leaky units work:

--> leaky element for computing the activation potential of the neural unit

--> tanh function, truncated to positive values, for computing the activation of the neural units

Also understand from the slides (and internet) how to implement the leaky units (or any other dynamical system) in a computer program through the "Euler approximation". Euler approximation is necessay as now we will simulate time more in detail as a continuous flow. So, a neural network will be a dynamic system that works in continuous time and the Euler approximation allows the approximated simulation of its dynamics in the discrete time-steps of a computer program.

To suitably take account of continuous-time, in the simulations establish these critical variables:

**sfSimuTimeDura**: duration of the whole simulation, in seconds, e.g. set to 10 seconds.**sfDeltaTime**: the duration of the time step used to approximate continuous time, for example, set to 0.01 (seconds). For a certain duration of the simulation, the smaller the time step, the more accurate the approximation of the dynamics of the system. Each cycle of the simulation will now correspond to one sDeltaTime.**sfSimuStepDura**: duration of the simulation in time steps, set equal to: sSimuTimeDura/sDeltaTime

Another very important thing to do in the implementations. With continuous time, causation between phenomena can happen only in different time steps (from t-1 to t). So, for any variable that affect another variable, e.g. a neuron that affects another neuron (or itself in the future), you have to do this:

- Store the "old value of the variable" and the "current value of the variable"
- At the beginning of the simulation cycle representing the elapsing of time (in sDeltaTime steps), you have to save the old value of all the variables affecting other variabels the next time step as in this example of three units:

sUnit1Old = sUnit1

sUnit2Old = sUnit2

sUnit3Old = sUnit3 - Then, for any causal dependency between the variables you have to consider the old values of the variables, for example:

sUnit1 = sUnit1Old + (sDeltaTime/Tau) * (-sUnit1Old + sWeig2to1 * sUnit2Old + sWeig3to1 * sUnit1Old)

sUnit2 = sUnit2Old + (sDeltaTime/Tau) * (-sUnit2Old + sWeig1to2 * sUnit1Old + sWeig3to2 * sUnit3Old)

sUnit3 = sUnit3Old + (sDeltaTime/Tau) * (-sUnit3Old+ sWeig1to3 * sUnit1Old + sWeig2to3 * sUnit3Old)

Based on these general rules, do the exercise below.

### How one leaky neuron respond to an input

Make a simulation that lasts for 20 seconds in total and involves a network formed by 1 leaky unit.

Input I:

0 for 1 second

1 for 1 second

0 for 2 second

1 for 2 second

0 for 3 second

1 for 3 second

0 for 4 second

1 for 4 second

Simulate how the leaky neuron responds to that input. Plot a graph in time, showing the input and the neuron activation. Try to change the tau of the neurons to see what happens. Try to change sDeltaTime and see what changes.

### A neural network of two leaky units representing an onset circuit

Make a simulation that lasts for 25 seconds in total and involves a network formed by 2 leaky units.

The input the network during the simulation is as follows:

0 for 5 second

1 for 5 seconds

2 for 5 second

1 for 5 second

0 for 5 second

The architecture of the network is as follows:

- An input unit (sUnit1) formed by a first leaky neuron, and an input/output unit, formed by a second leaky neuron (sUnit2).
- Both sUnit1 and sUnit2 take the (external) input
- sUnit1 is connected to sUnit2 with a connection weight of -1
- sUnit2 activation gives the output of the network

Plot both the input and the activation of sUnit1 and sUnit2 in time.

Try to set the tau values of the two units (initially set them to 1) with this objective: sUnit2 shows a behaviour detecting the ''derivative'' of the signal in time.

Set sfDeltaTime initially to 0.01. Then set it to different values. You will see that these values only set the accuracy of the model, not its overal dynamics.

### A model of Amygdala

This is a simulation of the basic functionalities of the amygdala, important for effective regulation.

The model can:

- produce an innate behavioural response to a biologically salient event (ingestion of food)

- learn to anticipate such response if a cue is paired with the biologically salient event (conditioning)

- make those two responses dependent on the internal hunger/satiety of the organism.

**Schedule**.

The overall simulation lasts 16 seconds.

Set sfDeltaTime = 0.01.

The neural network undergoes a phase of Pavlovian conditioning (8 seconds) and then a phase of recall showing the effects of the conditioning.

In particular, the network receives three input during the 16 seconds of the simulation:

1) Cue (use vector vfInpuCue[] to encode the values of the Cue in the steps of the simulation):

= 1 in the time intervals 2-4 and 10-12

= 0 otherwise

2) Food (use vector vfInpuFood[] to encode the values of the Food in the steps of the simulation)

= 1 in the time intervals 4-6

= 0 otherwise

3) Satiety (use vector vfInpuSati[] to encode the values of Satiation in the steps of the simulation).

You will run two tests where this unit is activated in these ways:

Test 1: = 0 always (indicating the animal is hungry)

Test 2: = 1 always (indicating the animal is satiated)

**The architecture of the network is as follows.**

Units representing different neural populations, encoding the input stimuli and the motor response:

- A unit (vfActi[0]) formed by a leaky unit encoding the stimulus Cue, with tau = 0.3
- A unit (vfActi[1]) formed by a leaky unit encoding the stimulus Food, with tau = 0.1
- A unit (vfActi[2]) formed by a leaky unit encoding the internal states of Satiety, with tau = 0.1
- A unit (vfActi[3]) formed by a leaky unit encoding neuromodulatory neurons of Dopamine, with tau = 0.1
- A unit (vfActi[4]) formed by a leaky unit encoding a Motor action, with tau = 0.

For such units create these data structures:

vfActi[]

vfActiPote[]

vfActiOld[]

vfActiPoteOld[]

Connections from the external and internal stimuli to the model units:

- Cue-->Cue Unit (the input is simply injected into the unit)
- Food-->Food Unit (the input is simply injected into the unit)
- Food-->Dopamine Unit (the input is simply injected into the unit)
- Satiety-->Satiety Unit (the input is simply injected into the unit)

Connections between the model units (connections internal to the model):

- CueUnit-->FoodUnit (call the variable: sfWeigCueFood , initialise it with 0): represents the association between the cue and the food, undergoing a Hebbian learning process
- FoodUnit-->MotorUnit (call the variable: sfWeigFoodMoto, initialise it with +1): this innate excitatory unit implies that the activation of the Food Unit causes the production of the behaviour
- SatietyUnit-->FoodUnit (call the variable: sfWeigSatiFood, initialise it with -1): this innate inhibitory unit implies that the presence of satiety prevents the Food Unit to activate even in the presence of the Food
- Dopamine Unit--> affects the Hebbian learning (call the variable: sfWeigDA, initialise it with +1) of the plastic connection

**Learning**

The Hebbian learning rule is as follows:

sfWeigCueFood = sfWeigCueFood + sfLearRate * (sfWeigDA * vfActi[3]) * vfActi[0] * max(0.0, (vfActi[1] - sfLearThrePost))

sfWeigCueFood = min(sfWeigMax,sfWeigCueFood)

where:

sfLearRate = 2.0

sfLearThrePost = 0.1

sfWeigMax = 2.0

**Data analysis**.

Collect the history of activation of the units in matrix mfHistActi[][]

Collect the history of the connection weights in matrix mfHistWeig[][]

Plot:

Figure 1: the graph of the three input stimuli in time

Figure 2: the graph of the units Cue, Food, Satiety in time

Figure 3: the graph of Dopamine and Motor action in time

Figure 4: the graph of the plastic connection weight

### A neural network formed by two leaky units that perform a decision-making circuit

The input to the two units is as follows:

(0, 0) for 5 seconds

(0, 0) for 5 seconds

(0, 1) for 5 seconds

(0, 0) for 5 seconds

(0.1, 0.9) for 5 second

(0, 0) for 5 seconds

(0.2, 0.8) for 5 second

(0, 0) for 5 seconds

(0.3, 0.7) for 5 second

(0, 0) for 5 seconds

(0.4, 0.6) for 5 second

(0, 0) for 5 seconds

The architecture of the network:

- The network is formed by two leaky input/output units: sUnit1 and sUnit2
- The sUnit1 and sUnit2 take the respective input and each gives an output, so the network returns two output values
- sUnit1 is connected to sUnit2 with a connection weight of -1
- sUnit2 is connected to sUnit1 with a connection weight of -1

Plot both the two input values and the activation of sUnit1 and sUnit2 in time in one graph (so 4 curves).

Try to set to different values: the Tau of the two units (initially equal to 1), the lateral connections between the two units, and the intensity of the leak,

with the objective of making sfUnit2 tend to go to a high value (this is the "selected option", the "decision") while sfUnit1 tends to go to zero.

### Echo State Network

Simulate the units of the ESN models below by using the same technique used to simulate the leaky neurons (see above), i.e. use the variables:

sSimuTimeDura

sDeltaTime

**First run of the network**

*Expected results*

- first have a transient dynamics

**Eigenvalues / Eigenvectors**

- with any input, the network will not explode
- if the input units are turned to zero, the network will progressively relax to zero
- for a given sequence of input patterns, the network will go through a specific transient dynamics: this is the most important property for which the network tends to respond to the current input, and also to the recent ones (with a decaying memory), in a specific deterministic way
- moreover, if the input changes a bit the dynamics will change a bit, rather than a lot as in chaotic systems

*eigenvalues*(you do not need to know how this is done or what eigenvalues precisely are: roughly speaking, they indicate the size of some important dimensions, the eigenvectors, of the space covered by W column vectors).

*spectrum*of the matrix. The function usually also returns a matrix of the eigenvectors, but you do not need it.

**Echo State with Perceptron as Output**

*Expected results*

You should see that:

- The ESN units dynamically respond to the input in time as usual
- The perceptron reads out the inner activations of the network and reproduces the sinusoid function output during the initial transient-dynamic phase
- If you try to use another desired output (e.g., the sum of different sinusoids with different phase and period) the system should be quite good to reproduce it during the initial transient phase

Reinforcement-learning neural network models

### Q-Learning

Q-LearningReinforcement learning is an important family of algorithms that allow the simulation of intelligent agents (e.g., simulated animals or intelligent robots) that learn to accomplish a task by trial-and-error while interacting with the world. Q-Learning is one of the two most important and basic reinforcement learning algorithms, the other being the actor-critic model. Q-learning is based on a series of states s (s ∈ S, S = all states), a series of actions a (a ∈ A, A = all actions), and a series of “valuations” q (q ∈ Q, Q = all valuations) for each state-action to pass to a new state.

There are two different ways to implement Q-learning:

*Tabular Q-Learning*: This is based on a 2-entry table, not a neural-network implementation and so it is not useful as a biological model. However, we suggest taking a look to this page related to tabular Q-learning because it helps you start to understand which kind of problems Q-learning can solve and how:

http://mnemstudio.org/path-finding-q-learning-tutorial.htm*Neural-network Q-Learning*. This is the target of this tutorial. The model is based on a simple “perceptron” neural-network architecture, i.e., a 2-layer feedforward linear network formed by an input and an output layer. The system takes as input the current state of the world and gives as output the q values one for each action: the actions are encoded in a “localistic way” (non “distributed way”), i.e. the system encodes a finite number of discrete actions where 1 action correspond to 1 output unit.

We now explain how to implement neural-network Q-learning in detail.

**Simulation of the “world”**

At first, you have to program the simulation of the world with which the agent interacts. Here we assume a simple “grid world”, i.e. a world formed by "positions", or "cells", where the agent can be at each step, and where the agent can navigate with its actions. You can consider a 1D arena (or "world": this is formed by a number of contiguous cells, or positions, forming a 1D sequence, e.g. imaging a corridor divided in contiguous squares), with two possible actions for moving leftward or rightward along the positions, or a 2D world (a number of contiguous positions organised in terms of a 2D grid), with four possible action actions to move north, est, west, south. In the case of the 1D world, you represent the state of the world, i.e. the position where the agent is, with a vector of 0s and a 1 in correspondence to the position occupied by the agent. In the case of the 2D world, you represent the state of the world with a matrix of zeros and a 1 in correspondence to the position occupied by the agent. We recommend to implement both the 1D and 2D world in parallel, and to parameterise your program to use one or the other at will, because this will be very instructive and help you debugging the program.

Choose a starting position in which you will position the agent at the beginning of each trial: this is represented as the initial “1” in your vector or matrix of zeros representing the state of the world.

Moreover, decide where to put the reward (r) in your world model. To this purpose, use another vector/matrix with the same dimensions of the world verctor/matrix, with a 1 in the element corresponding to the rewarded position.

Importantly, some of the actions that the agent might choose would bring it to hit the ''walls'' of the arena where it navigates. For example, if it is at the top-left corner of the 2D arena, if it selects the action "west" or the action "north", these actions lead the agent to hit respectively the west and north walls of the arena and so it remains where it is. In general, the agent is always left free to select any one of the actions of its repertoire (so at each time all actions of its repertoire of actions are available for selection), but if it selects an action that leads it against a wall, the effect of this action execution is simply that the agent is left in the same position from which it selected and performed the action. This situation is similar to the one of a robot that tries to move against a wall and the effect is that it remains where it is.

**Activation of the neural network**

The agent is based on a simple neural network formed by 2 layers of units:

The input layer is activated with the state of the world representing the position of the agent in the world (the input units as many as the possible positions in the world). In the case of the 2D world, you should “unroll” the world matrix into the vector representing the input units.

- The connection weights represent the Q values.
- At the beginning, you can set them to zero values or small random values.
- The output units represent the q values of the possible actions (2 or 4 actions).

The output values do not have to pass through a step or sigmoid transfer function, as usual in neural networks, but through a SoftMax function (also known as “normalized exponential function”). The SoftMax function allows the normalisation of the output values so they sum up to 1 while keeping a difference between them. This allows considering the output of the SoftMax as probability values, used to extract the action to execute. It means that, with zero initial connection weights, at the beginning, you will have two outputs with two 0.5 values; if you have four outputs you will have four 0.25 values. The SoftMax function SM[.], allowing the computation of the probabilities p[a] of selecting the different actions a, is defined as follows:

For q ∈ Q and a ∈ A: p(a) = SM(q)=(e^(q/T)) / (Σ_j e^(q_j/T))

T is called the “temperature parameter”, and e is the Euler number. For high temperatures, all output values tend to assume the same probability, instead of for low temperatures, the values tend to have great differences (to the extreme: one value is close to 1 and all the others values are close to 0). So the temperature allows setting how much the probabilities of selecting the different actions differ between them in correspondence to the Q values.

We now explain the core equation of the algorithm, allowing the update of the network connection weights corresponding to the q values, and then the whole program of the simulation. As we are facing reinforcement learning, the neural network cannot be trained in a supervised learning fashion as we lack a vector of desired output values. Thus, we use a different learning rule to update step by step the Q values, depending on the quality of the chosen action in terms of reward. At each time step t, the learning rule updates only the connection weights that correspond to the action a[t-1] selected and executed at time t-1 (this action corresponded to the q value: q[s[t-1], a[t-1]]).

To update the connection weights w[j,i] connecting input unit s_i to the output unit q_j corresponding to the selected a[t-1], is as follows:

Delta w_ji = η * ( (r[t] + γ * Max_q[q[s[t], a[t]]]) – q[s[t-1], a[t-1]] ) * s_i[t-1]

Let us examine this equation closely:

- As mentioned, w[j,i] connects input unit s_i to the output unit of the network corresponding to the q-value q_j corresponding to the previously selected a[t-1]
- η is the learning rate, usually set equal to 0.1
- r(t) is the reward achieved in the current state s_tγ is a “discount factor” usually set to 0.9: this has the function to decrease the value now attributed to future q values and rewards
- Max_q[q[s[t], a[t]]] represents the maximum value within the set of 2 or 4 q values corresponding to the 2 or 4 actions
- s(t-1) is the previous states vector, and s_i is an element of it

**Q-Learning algorithm**

This is how the simulation will broadly work. In the first trial, you will set your agent in the world and it will start exploring randomly the environment until it finds the reward. When it reaches the reward, the q value of the last action performed increases according to the learning equation. The first trial ends and the second one starts with the agent set again in the initial position. The agent will move again and will learn nothing until it reaches the position to which it assigned a value in the first trial (the one visited before the reward). Thus, the q-values/weights matrix are progressively updated moving away from the rewarded position. Therefore, the larger the word, the larger the number of trials and steps the agent will need to learn to know what to do at each position.

Here the pseudo-code of how the algorithm works

- Initialise the relevant variables
- Loop over trials
- Reset relevant variables
- Set the world to the initial state (agent at the starting position)
- Loop over the steps of the trial
- Store values (which become the "values at t-1") of critical variables before updating them: input, q-values, action
- Update the state of the world based on the action a[t-1]
- Compute the reward r[t] returned to the agent by the world
- Activate the input units of the network based on s[t] (this represent the activation of sensors)
- Implement the spreading of the net to compute the q[t] values
- Compute the probabilities of action, p[a], with the SoftMax function
- Select action a based on p[a]: this action will be executed at the next time step
- If this is not the first or the last step of the trial:
- update the network weights based on the equation above
- If this is the last step of the trial (r=1)
- update the network weights based on the equation above but considering zero instead of Max_q as "there is no future" in the trial

As you can see, regarding the learning rule:

- In the first step of each trial, the agent does not learn: the reason is that the algorithm needs to succeeding time steps to learn as its learning is based on the consequences (in terms of the reward) that the action selected at t-1 causes at t
- In the last step of the trial, the algorithm does not consider the maximum current q, corresponding to an estimate of the future rewards, but rather a zero value as the trial ends and so “there is no future” for the agent, hence no future rewards

To see if your program works well, try this simulation:

- Set a 1D world of nine positions (two actions)
- Set the reward for the last position on the right
- Set the start in the first position on the left
- Set a maximum of 200 Trials and a maximum of 50 steps per trial
- Set LearningRate = 0.1 and DiscountFactor = 0.9
- Set a random selection of actions (so not following the SoftMax probabilities, but rather a 0.5 probability for each action)
- Let it run!

If the program is correct, you should obtain q-values (i.e., the network connection weights) similar to the following ones (horizontal 0-8: positions; vertical 0-1: the two actions, East and West; note how the values of the West action are equal to 0, γ^0=0.9^0=1, γ^1=0.9^1=0.9, γ^2=0.9^2=0.81, etc.):