Reinforcement Learning and TicTacToe not working?!

Hello guys,

at the moment I am getting deeper into "Reinforcement Learning"-Topic.
As Ialready know basics of AI-Programming I started with Q-Learning and Neural Nets in AutoIt (just because i wanted to show whats possible with AutoIt). This did not work, so i decided to just get a really basic implementation of Q-Learning with State/Action-Table in AutoIt this also did not work.

I thought it may be some AuotIt specific problems, so I decided to implement an even more basic example in Java and tried again with similiar core code like the AutoIt Version.

In Java I wrote a little helper class which is called "QLearner" and is not much more than a Hashtable with some extended features. QLearner for itself works like a charm (no problem with adding/getting values or finding biggest value for a given state/action pair).

Then I wrote a basic class for TicTacToe board, this is working too.

After that I started implementing a class called "AiPlayer" and "Main".
Main-Class isnt even that complex so there should be no problems too.

AiPlayer itself (except learning) works great too. So it always takes the action with biggest QValue, etc.
But learning itself seems to fail.
I used for learning just the simple approach:

Remember a maximum of X moves.
After each move update the Q-Value of move before current move with reward=0.

If the game ends, do the same as above, so:
After each move update the Q-Value of move before current move with reward = 100 (if won), reward = 0 (if draw), reward= -100 (if lost).

This is so far the theory and it seemed logic and legit to me. But something (maybe a Code-Bug or a Brain-Bug) makes it failing.

The 2 important classes are the following (learning is done in method train() of AiPlayer-Class):

Main.java

Spoiler

Code:

public class Main
{
	// Create a TicTacToe-Board with 3x3 fields
	static TicTacToe board=new TicTacToe(3,3);
	// First player uses the learning AI
	static AiPlayer player1=new AiPlayer(board,1,1);
	// Second player uses the random Moves AI
	static AiPlayer player2=new AiPlayer(board,2,2);
	static int player_turn=(int)((float)Math.random()*2+1);
	static int last_games_end=0;

	public static void main(String[] args)
	{
		// This is the main loop just let player 1 and 2 play endless against each other
		while(true)
		{
			if(player_turn==1)
			{
				player1.playTraining();
				player_turn=2;
			}
			else
			{
				player2.play();
				player_turn=1;
			}

			last_games_end=board.getWinner();
			if(last_games_end>0)
			{
				if(last_games_end==1)
				{
					player1.setWin();
					player2.setLoose();
				}
				else if (last_games_end==2)
				{
					player2.setWin();
					player1.setLoose();
				}
				else
				{
					player1.setDraw();
					player2.setDraw();
				}

				board.resetBoard();
				player_turn=(int)(Math.random()*2+1);

				if (player1.games%100000==0)
				{
					System.out.println("Training Games played: "+player1.games);
					// do some benchmarking to check if out AI is getting better
					benchmark();
				}
			}

		}
	}

	public static void benchmark()
	{
		int player1Wins=0;
		int player1Looses=0;
		int player1Draws=0;
		int games;
		int action=0;

		for(games=0;games<500;)
		{
			if(player_turn==1)
			{
				action=player1.play();
				player_turn=2;
			}
			else
			{
				action=player2.play();
				player_turn=1;
			}

			last_games_end=board.getWinner();
			if(last_games_end>0)
			{
				if(last_games_end==1)
				{
					player1Wins++;
				}
				else if (last_games_end==2)
				{
					player1Looses++;
				}
				else
				{
					player1Draws++;
				}
				games++;

				board.resetBoard();
				player_turn=(int)((float)Math.random()*2+1);
			}
		}
		System.out.println("Wins: "+player1Wins+" = "+((float)player1Wins/games*100)+"% Winrate");
		System.out.println("Looses: "+player1Looses+" = "+((float)player1Looses/games*100)+"% Looserate");
		System.out.println("Draws: "+player1Draws+" = "+((float)player1Draws/games*100)+"% Drawrate");
	}
}

AiPlayer.java

Spoiler

Code:

import java.util.Arrays;
import java.util.LinkedList;

public class AiPlayer
{
	public int wins=0;
	public int looses=0;
	public int games=0;
	private TicTacToe board;
	private int player;
	private int kiType;
	private float reward=0;
	private LinkedList<int[]> oldStates = new LinkedList<int[]>();
	private LinkedList<Integer> oldActions = new LinkedList<Integer>();
	private LinkedList<Float> oldQValues = new LinkedList<Float>();
	private int maxHistoryEntries=2;
	private float alpha=0.3f;
	private float gamma=0.8f;
	private float randFactor=0.2f;

	public QLearner reinforcementLearner=new QLearner();

	AiPlayer(TicTacToe board,int player,int kiType)
	{
		this.board=board;
		this.player=player;
		this.kiType=kiType;
	}

	public int play()
	{
		int indexToMark;
		// if the AI we should use is of type 1, get best moves based on QLearner
		if (kiType==1)
		{
			indexToMark=getBestMove();
		}
		else // otherwise do some random moves
		{
			indexToMark=getRandomMove();
		}
		board.setField(indexToMark,player);
		return indexToMark;
	}

	public int getBestMove()
	{
		// get best index to mark for current state
		return reinforcementLearner.getBestQAction(board.cells,board.getFreeCells());
	}

	public void addNewHistory(int action,float qValue)
	{
		// Remove last item if we reached maximum entry size
		if(oldActions.size()>=maxHistoryEntries)
		{
			oldActions.removeLast();
			oldQValues.removeLast();
			oldStates.removeLast();
		}
		// add new item at the front so our latest move will always be at position 0
		oldActions.addFirst(action);
		oldQValues.addFirst(qValue);
		oldStates.addFirst(Arrays.copyOf(board.cells,board.FIELDS));
	}

	public void train()
	{
		float deltaQ;
		float q;
		float qValueBefore;
		float tdFactor=alpha;
		// at position 0 there is always our latest action/state/qvalue (--> 0 is the last move of this player)
		int latest=0;
		// update state/action pair before because we do not have next state
		for(int i=1;i<oldActions.size();i++)
		{
			// get q Value of state and action i (which is a state before current state)
			qValueBefore=oldQValues.get(i);
			deltaQ=reward+gamma*oldQValues.get(latest)-qValueBefore;
			q=qValueBefore+tdFactor*deltaQ;
			
			// set new Q-Value for state/action pair i
			reinforcementLearner.setQValue(oldStates.get(i), oldActions.get(i), q);
			// Only needed if there are more than 2 entries
			// spread new q value of latest state over all entries
			tdFactor*=alpha;
		}
	}

	public void playTraining()
	{
		int indexToMark=0;
		// add some random to explore more values
		if(Math.random()<randFactor)
		{
			indexToMark=getRandomMove();
		}
		else
		{
			indexToMark=getBestMove();
		}
		// add our move to histroy
		addNewHistory(indexToMark,reinforcementLearner.getQValue(board.cells, indexToMark));
		// train with reward = 0 (because we did not finish game yet)
		train();
		// mark field on TicTacToe board
		board.setField(indexToMark,player);
	}

	public int getRandomMove()
	{
		int action=0;
		while (true)
		{
			action=(int)((float)Math.random()*board.FIELDS);
			if(action>=9)
				action=8;
			if (board.isCellEmpty(action))
			{
				return action;
			}
		}
	}

	public void reset()
	{
	 	oldActions.clear();
		oldQValues.clear();
		oldStates.clear();
		reward=0;
	}

	public void setWin()
	{
		wins++;
		games++;
		reward=100;
		// train with reward != 0 (game ended now)
		train();
		reset();
	}

	public void setLoose()
	{
		games++;
		looses++;
		// train with reward != 0 (game ended now)
		reward=-100;
		train();
		reset();
	}

	public void setDraw()
	{
		games++;
		// train with reward != 0 (game ended now)
		reward=0;
		train();
		reset();
	}
}

I really hope you can help me. I am sure something went wrong with Q-Value calculation or saving old states (even if saving old states itself seems to work).

The learning AI seems not to learn at all (maybe 2-4% but not much more).
I let him do millions of games but learning AI is still loosing too often (my Output):

Spoiler

Thanks in advance. :D

Edit:
If someone wants to understand what I am trying here, have some looks at:
[Only registered and activated users can see links. Click Here To Register...]

Edit:
I really found the problem, there were 3 problems:
1. Finding 3 in a row will take a long time if you are not doing simple checking.
2. My Q-Value calculation were wrong.
3. Loose/Draw/Win reward wasnt perfect

Fixes:
1. Just check also before each move if you can win game with setting next marker
2. Fixed in Code (coming later)
3. Loose-Reward=0, Draw-Reward=0.5, Win-Reward=1

New Code (only AiPlayer.java were changed):

Spoiler

Code:

import java.util.Arrays;
import java.util.LinkedList;

public class AiPlayer
{
	public int wins=0;
	public int looses=0;
	public int games=0;
	private TicTacToe board;
	private int player;
	public int kiType;
	private float reward=0;
	private LinkedList<int[]> oldStates = new LinkedList<int[]>();
	private LinkedList<Integer> oldActions = new LinkedList<Integer>();
	private LinkedList<Float> oldSelectedQValues = new LinkedList<Float>();
	private LinkedList<Float> oldBestQValues=new LinkedList<Float>();
	private int maxHistoryEntries=5;
	private float alpha=0.2f;
	private float gamma=0.8f;
	private float randFactor=0.1f;

	public QLearner reinforcementLearner=new QLearner();

	AiPlayer(TicTacToe board,int player,int kiType)
	{
		this.board=board;
		this.player=player;
		this.kiType=kiType;
	}

	public int play()
	{
		int indexToMark;
		// if the AI we should use is of type 1, get best moves based on QLearner
		if (kiType==1)
		{
			indexToMark=getBestMove();
		}
		else // otherwise do some random moves
		{
			indexToMark=getRandomMove();
		}
		board.setField(indexToMark,player);
		return indexToMark;
	}

	public int getBestMove()
	{
		// get best index to mark for current state
		for(int i=0;i<board.FIELDS;i++)
		{
			if(board.isCellEmpty(i))
			{
				board.cells[i]=player;
				if(board.getWinner()==player)
				{
					board.setField(i,0);
					return i;
				}
				board.cells[i]=0;
			}
		}
		return reinforcementLearner.getBestQAction(board.cells,board.getFreeCells());
	}

	public void addNewHistory(int action,float qSelectedValue,float qBestValue)
	{
		// Remove last item if we reached maximum entry size
		if(oldActions.size()>=maxHistoryEntries)
		{
			oldActions.removeLast();
			oldBestQValues.removeLast();
			oldSelectedQValues.removeLast();
			oldStates.removeLast();
		}
		// add new item at the front so our latest move will always be at position 0
		oldActions.addFirst(action);
		oldSelectedQValues.addFirst(qSelectedValue);
		oldBestQValues.addFirst(qBestValue);
		oldStates.addFirst(Arrays.copyOf(board.cells,board.FIELDS));
	}

	public void train()
	{
		float deltaQ;
		float q;
		float qValueBefore;

		float tdFactor=alpha;
		// at position 0 there is always our latest action/state/qvalue (--> 0 is the last move of this player)
		int latest=0;
		// update state/action pair before because we do not have next state
		for(int i=1;i<oldActions.size();i++)
		{
			// get q Value of state and action i (which is a state before current state)
			qValueBefore=oldSelectedQValues.get(i);
			deltaQ=reward+gamma*oldBestQValues.get(latest)-qValueBefore;
			q=qValueBefore+tdFactor*deltaQ;

			// set new Q-Value for state/action pair i
			reinforcementLearner.setQValue(oldStates.get(i), oldActions.get(i), q);
			//System.out.println(Arrays.toString(oldStates.get(latest)));
			//System.out.println(Arrays.toString(oldStates.get(i)));
			// Only needed if there are more than 2 entries
			// spread new q value of latest state over all entries
			tdFactor*=alpha;
		}
	}

	public void playTraining()
	{
		int indexToMark=0;
		int bestIndex=0;
		// add some random to explore more values
		if(Math.random()<randFactor)
		{
			indexToMark=getRandomMove();
			bestIndex=getBestMove();
		}
		else
		{
			indexToMark=getBestMove();
			bestIndex=indexToMark;
		}
		// add our move to histroy
		addNewHistory(indexToMark,reinforcementLearner.getQValue(board.cells, indexToMark),reinforcementLearner.getQValue(board.cells, bestIndex));
		// train with reward = 0 (because we did not finish game yet)
		train();
		// mark field on TicTacToe board
		board.setField(indexToMark,player);
	}

	public int getRandomMove()
	{
		int action=0;
		while (true)
		{
			action=(int)((float)Math.random()*board.FIELDS);
			if(action>=9)
				action=8;
			if (board.isCellEmpty(action))
			{
				return action;
			}
		}
	}

	public void reset()
	{
	 	oldActions.clear();
		oldBestQValues.clear();
		oldSelectedQValues.clear();
		oldStates.clear();
		reward=0;
	}

	public void setWin()
	{
		wins++;
		games++;
		reward=1.0f;
		// train with reward != 0 (game ended now)
		train();
		reset();
	}

	public void setLoose()
	{
		games++;
		looses++;
		// train with reward != 0 (game ended now)
		reward=-1.0f;
		train();
		reset();
	}

	public void setDraw()
	{
		games++;
		// train with reward != 0 (game ended now)
		reward=0.0f;
		train();
		reset();
	}
}

Edit2:
After some more testing i found out that the reward function before (Win=1, Draw=0, Loose=-1) worked as well, maybe even little better. Thats why I changed the edited code again.