Variable length genomes in Genetic Algorithms

,note: this was originally written June/July 2022, but the release was delayed for personal reasons. PS, my brother did complete his master's!

prelude

I'm going to describe a way to modify a fixed-length genome genetic algorithm to support variable length. In the process, we are going to use problems that have nice and simple 2d representations, and you might be wondering what motivates these seemingly uncanny problems. So in this first section let me give you the back story.

My brother is completing his master's in Architecture. He is investigating how the lack of explicit goals in city planning leads to gentrification and therefore negative consequences to communities, especially for people of lesser means. As architecture is becoming increasingly sophisticated, computerized tools now complete whole facets of proposals. Often artificial intelligence is the key to these tools. Genetic Algorithms may be used to flesh out plausible designs.

These tools are designed around the goal, which can be sought in ways that lack effective countermeasures to preserve the community. If this all sounds like a parallel to bias in models and a topic for AI ethics, I think you are probably right. But that is not my concern. Instead, simply exploring this space with my brother is. The objective function of such a tool, for example a genetic algorithm to find the strongest supports for a structure deployed in a certain space, is often well-defined in the literature. Yet measures of good community-preserving design are only elaborated analytically. They are generally not defined numerically. After justification, implementing measures that can provide an initial basis for the preservation of these concerns in the fitness of a genetic algorithm or the objective function in general AI forms a basis of his research. He could benefit from a demonstration of the incorporation of these rules in the fitness function preserving the objective.

City plans often repurpose an area, or build out new development even, but only need to incorporate certain intentional features. The rest are buildings that must fit requirements such as zoning but are inconsequential to the proposal. These usable spaces, and in general the design of these generic buildings in a proposal can be denoted as floorage. The positioning of service buildings within this spatial allotment can matter in gentrification (for example, walkability is a number that can be derived from such plans), so how you compose and distribute the spaces between these buildings matters. Yet, among the more common tools, we find little deliberation in the approach. For example, a popular plugin for SketchUp just uses randomization to create floorage. The architect generates floorage for a space through the tool. They iterate until a few viable selections are drawn. They then replace some buildings with intentional ones, and may or may not have an eye on limiting costs to the community in this selection process. We propose a genetic algorithm to do these steps.

towards a problem definition

A target space can be transformed to fit into a rectangle, which itself can be subdivided. Within the divisions of space we can place a building. We incorporate some rules in the placement, such as a building placed within a division of our rectangle that is outside of the target space is illegal (it extends beyond the bounds of the target space), so we can give such a design zero points. Other rules can enforce objectives, such as a minimum distance, as well as rules orchestrating cardinal directionality. We use a distribution method to assign roles matching the design criteria to various buildings in our space. Finally, we will use a fitness function that incorporates community impact.

In this way, we can train a genetic algorithm to explore a 2d space for the allotment of buildings by generating cartesian coordinates. We can use something called Voronoi partitioning to convert coordinates into origin points of the divisions of our space.

Figure 0 - an example Voronoi diagram

Figure 0 - an example Voronoi diagram

The height allowance and other designations can be assigned deterministically from the civil code. Or we can let the genetic algorithm explore this by assigning additional parameters to each coordinate pair. For a first step, using simple rules to qualify the 2d plans and how each fits in our target space looks like a good idea.

framework

A great approach, probably the correct approach in the general case in python, is to exercise pyGAD, which is a nice, smallish library for genetic algorithms (we will look at it along the way). It is designed to be extensible, but the documentation and the way it is meant to be extended shape the types of extensions you can make with it. Still, with what personal experience I have, I would definitely recommend using pyGAD for the general case.

We want to experiment, so let's start with a basic framework to test different approaches. We'll use an agent that we will define to be the middleman between our tests and the implementation.

import pprint
pp = pprint.PrettyPrinter(indent=4)

print("training model with settings")
pp.pprint(defaults)

agent = Agent()
agent.prepare(defaults)
agent.define()


agent.train()

render_results(agent)

I'm just going to use a pattern that vaguely reminds me of PyTorch. To be clear, I am definitely not intent on implementing it.

The defaults variable is just a configuration object. It will include the fitness function we want to use, and other hyperparameters an agent we dream up might need.

So, now we can import implementation_x as Agent and build it. Let's make a pattern to match that:

from abc import ABC, abstractmethod


class Torch(ABC):
    @abstractmethod
    def prepare(self, options={}):
        pass

    @abstractmethod
    def define(self, options={}):
        pass

    @abstractmethod
    def train(self):
        pass

    @abstractmethod
    def evaluate(self):
        pass

    @abstractmethod
    def predict(self):
        pass

We can create a simple trial-and-error agent based on that:

import math
import random
import numpy as np
from scipy.spatial import distance

from models.torch import Torch


class TrialAndError(Torch):
    def __init__(self):
        super().__init__()
        self.sol=[]
        self.sol_fitness=-math.inf

    def prepare(self, options={}):
        self.options = options
        self.fitness = self.options['fitness_func']

    def define(self, options={}):
        self.options.update(options)

    def get_zones(self, zones=1):
        multiplier = self.options['init_range_high'] - \
            self.options['init_range_low']
        return [(random.random() * multiplier, random.random() * multiplier) \
            for _ in range(zones)]

    def train(self):
        for i in range(self.options['num_generations']):
            y = self.get_zones(self.options['num_genes'])
            y_fit = self.fitness(y)
            if y_fit > self.sol_fitness:
                self.sol = y
                self.sol_fitness = y_fit

    def evaluate(self):
        return self.fitness(self.sol)

    def predict(self):
        return self.sol

default_fitness_func = lambda x: sum([distance.euclidean(a, (0,0)) for a in x])

defaults = {
    "num_generations": 10000,
    "num_genes": 10,
    "init_range_low":0.0,
    "init_range_high":10.0,
    "fitness_func": default_fitness_func
}

Note that this defines a fitness function that is going to be passed. For a first run, this will be to see how far we can get the points from the 0,0 point. We do this by summing the distances.

We had to commit to a size for the bounding rectangle. We're using a square between 0,0 and 10,10, but this could have been any size and, in fact, any polygon.

The get_zones method randomly generates points, and these act as the coordinates reflecting origin points in the Voronoi diagram that we will build. The bounds of each tesselation in the Voronoi diagram will describe the subzone within which we hope to one day place a building. But for now, we just want a test harness for divisions of 2d surfaces.

Now we need to think about what we will be rendering when we call render_results. I'm going to want to see the score and the specific subzones that it predicted. Since this is 2d data, it would also be nice to render the division of space proposed.

def render_score(score):
  print('Score:', score)

def render_zones(centers):
    """visualization of the zones"""
  pass

def render_plan(centers):
    """specific details/coordinates"""
  pass

def render_results(agent):
  render_score(agent.evaluate())
  zone_centers = agent.predict()
  render_zones(zone_centers)
  render_plan(zone_centers)

We can render the zone visually by plotting it.

from scipy.spatial import voronoi_plot_2d

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

def render_zones(points):
  vor = Voronoi(points)
  _fig = voronoi_plot_2d(vor)
  plt.show()

And... it looks great!

Figure 1 - first visualization

Let's see if it actually improves as we increase generations; we didn't do such a good job for a trial-and-error implementation if it won't even improve over time!

Figure 2 - trial-and-error implementation improves with number of generations

Although the overall score is only a little better, visually this appeals to me. The improvement came in the form of bunching the coordinates in the opposite corner from the 0,0 point.

Surely a genetic algorithm can improve on these results, but one thing I notice is that the number of zones stays fixed. We would get improving results if we searched for more points, rather than always just mutating their positioning.

We can change the number of genes in our trial-and-error with a pretty small improvement, allowing it to randomize the number of zones and use the value provided as a base. We just need to make our implementation explicit in terms of the ceiling in some way. This is really a simple commit:

    def get_zones(self, min_zones=1):
        multiplier = self.options['init_range_high'] - \
            self.options['init_range_low']
        number_of_zones = random.randint(min_zones, min_zones * 2)
        return [(random.random() * multiplier, random.random() * multiplier) \
            for _ in range(number_of_zones)]

In this case we bound it to twice the number of genes given as input.

Figure 3 - mixed results with increasing the number of genes

The score certainly improved. But even though this is true, the tradeoff between positioning and count is obvious. This is going to be the type of hyperparameter that has to carefully reflect the intent of the user... and this will matter because each zone is going to roughly equal the lot for a building.

On closer review there are problems. It is not easy to get polygons from these Voronoi diagrams. The subzones will not pack within the zone. Like in figure 3, going clockwise from the top left corner, the second zone there has 3 solid lines, but two of them will meet at some point beyond the bounds of our space. Another problem we would face is that the space defined by our voronoi will have "vertices at infinity" generating the dotted lines. We can't generate finite polygons from these vertices, so we will have to create our own approximations at some distance. This incorporates an outer boundary around our rendered boundary. Thankfully someone already solved this problem, so we just attribute credit where due in implement it in its own file:

# src https://stackoverflow.com/questions/23901943/voronoi-compute-exact-boundaries-of-every-region#answer-52727406

from collections import defaultdict
from shapely.geometry import Polygon
import numpy as np


def voronoi_polygons(voronoi, diameter):
    """
    Generate shapely.geometry.Polygon objects corresponding to the
    regions of a scipy.spatial.Voronoi object, in the order of the
    input points. The polygons for the infinite regions are large
    enough that all points within a distance 'diameter' of a Voronoi
    vertex are contained in one of the infinite polygons.
    """
    centroid = voronoi.points.mean(axis=0)

    # Mapping from (input point index, Voronoi point index) to list of
    # unit vectors in the directions of the infinite ridges starting
    # at the Voronoi point and neighbouring the input point.
    ridge_direction = defaultdict(list)
    for (p, q), rv in zip(voronoi.ridge_points, voronoi.ridge_vertices):
        u, v = sorted(rv)
        if u == -1:
            # Infinite ridge starting at ridge point with index v,
            # equidistant from input points with indexes p and q.
            t = voronoi.points[q] - voronoi.points[p] # tangent
            n = np.array([-t[1], t[0]]) / np.linalg.norm(t) # normal
            midpoint = voronoi.points[[p, q]].mean(axis=0)
            direction = np.sign(np.dot(midpoint - centroid, n)) * n
            ridge_direction[p, v].append(direction)
            ridge_direction[q, v].append(direction)

    for i, r in enumerate(voronoi.point_region):
        region = voronoi.regions[r]
        if -1 not in region:
            # Finite region.
            yield Polygon(voronoi.vertices[region])
            continue
        # Infinite region.
        inf = region.index(-1)              # Index of vertex at infinity.
        j = region[(inf - 1) % len(region)] # Index of previous vertex.
        k = region[(inf + 1) % len(region)] # Index of next vertex.
        if j == k:
            # Region has one Voronoi vertex with two ridges.
            dir_j, dir_k = ridge_direction[i, j]
        else:
            # Region has two Voronoi vertices, each with one ridge.
            dir_j, = ridge_direction[i, j]
            dir_k, = ridge_direction[i, k]

        # Length of ridges needed for the extra edge to lie at least
        # 'diameter' away from all Voronoi vertices.
        length = 2 * diameter / np.linalg.norm(dir_j + dir_k)

        # Polygon consists of finite part plus an extra edge.
        finite_part = voronoi.vertices[region[inf + 1:] + region[:inf]]
        extra_edge = [voronoi.vertices[j] + dir_j * length,
                      voronoi.vertices[k] + dir_k * length]
        yield Polygon(np.concatenate((finite_part, extra_edge)))

Now to render it, our render_zones function becomes:

import numpy as np
from scipy.spatial import Voronoi #, voronoi_plot_2d
from shapely.geometry import Polygon

from voronoi_polygons import voronoi_polygons

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

def render_zones(points):
  vor = Voronoi(points)

  determine_boundary = np.array([[-1000,-1000], [-1000,1010], \
      [1010,1010], [1010,-1000]])
  render_boundary = np.array([[0,0], [0,10], [10,10], [10,0]])
  boundary_polygon = Polygon(render_boundary)
  bx, by = render_boundary.T

  plt.xlim(round(bx.min() - 1), round(bx.max() + 1))
  plt.ylim(round(by.min() - 1), round(by.max() + 1))
  plt.plot(*np.array(points).T, 'b.')

  diameter = np.linalg.norm(determine_boundary.ptp(axis=0))
  polys = []
  for p in voronoi_polygons(vor, diameter):
    polys.append(list(p.intersection(boundary_polygon).exterior.coords))
    x, y = zip(*p.intersection(boundary_polygon).exterior.coords)
    plt.plot(x, y, 'r-')

  plt.show()

and similarly our render_plan would be:

def render_plan(points):
  print("points:", points)
  vor = Voronoi(points)

  determine_boundary = np.array([[-1000,-1000], [-1000,1010], \
      [1010,1010], [1010,-1000]])
  render_boundary = np.array([[0,0], [0,10], [10,10], [10,0]])
  boundary_polygon = Polygon(render_boundary)
  bx, by = render_boundary.T

  diameter = np.linalg.norm(determine_boundary.ptp(axis=0))
  polys = []
  for p in voronoi_polygons(vor, diameter):
    polys.append(list(p.intersection(boundary_polygon).exterior.coords))

  print("polys:", polys)

That's a bit of code duplication and repetitive work at that, but we are doing this only once, with the final solution. We are hoping to keep this piece of code fixed while we iterate on the agent itself, so whether it matters to you or not is up to you. This at least results in a list of polygons that we can use for more complicated fitness functions.

Figure 4 - output with polygons

In order to render the entire output we've limited the number of genes to 3-6. While this worked in this case, there are conditions under which a voronoi tesselation will fail to render with our approach with less than 4 points.

pyGAD

Now that we have implemented a framework for testing different approaches, and even a first fitness function for comparison, let's see if we can't beat trial and error with a real approach. Here's the agent for PyGAD.

import numpy as np
import pygad
from scipy.spatial import distance, Voronoi

from models.torch import Torch
from naive_grouper import naive_grouper
from voronoi_polygons import voronoi_polygons


class PyGADTorch(Torch):
    def prepare(self, options):
        self.options=options

    def define(self, options={}):
        self.ga_instance = pygad.GA(**self.options)

    def train(self):
        self.ga_instance.run()

    def evaluate(self):
        if self.ga_instance is None:
            return
        solution, solution_fitness, solution_idx = \
            self.ga_instance.best_solution()
        self.sol = solution
        self.sol_fitness = solution_fitness

        return self.sol_fitness

    def predict(self):
        if self.sol is None:
            return
        return [(x,y) for x,y in naive_grouper(self.sol, 2)]


def default_fitness_func(solution, solution_idx):
    qualifier = (1 if min(solution) >= 0 else 0) * \
        (1 if max(solution) <= 10 else 0)
    fitness = sum([ distance.euclidean((x,y), (0,0))
        for x,y in naive_grouper(solution, 2) ]) * qualifier

    return fitness


defaults = {
    "num_generations": 100,
    "num_genes": 20,
    "num_parents_mating": 20,
    "sol_per_pop": 40,
    "init_range_low": 0.0,
    "init_range_high": 10.0,
    "mutation_probability": 0.1,
    "crossover_probability": 0.3,
    "fitness_func": default_fitness_func
}

This uses a helper function to just iterate an array in pairs of values, in our case representing coordinates:

def naive_grouper(inputs, n):
    num_groups = len(inputs) // n
    return [tuple(inputs[i*n:(i+1)*n]) for i in range(num_groups)]

This gets radically better results for 10 genomes in only 100 iterations:

training model with settings { 'crossover_probability': 0.3, 'fitness_func': <function default_fitness_func at 0x110ce7040>, 'init_range_high': 10.0, 'init_range_low': 0.0, 'mutation_probability': 0.1, 'num_generations': 100, 'num_genes': 20, 'num_parents_mating': 20, 'sol_per_pop': 40} Score: 135.46610814907405 points: [(9.86950278405781, 9.956012269091907), (9.834902887272309, 9.664899839960928), (7.110317209876664, 9.997044752715272), (9.936319250933368, 9.762761903748917), (9.632203317836819, 9.831434398341612), (9.63933242412567, 9.618283462096848), (9.885028188722764, 7.767623668335636), (9.583362899472677, 9.822035885083402), (9.95019099872665, 9.725071998739425), (9.883825403228242, 9.733413897533586)]

Figure 5 - PyGAD results

Note that it is also even much faster than our trial and error approach, at least with this fitness function. But it failed to outperform our modified trial-and-error algorithm with 10000 generations. So the disadvantage here is the fixed number of genomes. PyGAD is designed to be extensible, allowing you to provide custom functions to perform for example crossover, but the size of the result is fixed to the num_genes parameter -- ie, a custom crossover function is insufficient to explore a variable number of genes in your genomes.

We'll lose some speed to do this ourselves, but allowing a variable number of buildings seems pretty important, so lets implement our own. Luckily, that's already been explored.

the basic GA approach

I found Jason Brownlee's article Simple Genetic Algorithm From Scratch in Python invaluable as a no-nonsense, easy to implement GA. Since that basic GA has already been covered, we'll just be extending it here.

Here is our crossover method:

from random import randint, random


# crossover two parents to create two children
def crossover(p1, p2, r_cross=0, r_segmental=0):
    terminate = False
    while not terminate:
        # children are copies of parents by default
        c1, c2 = p1.copy(), p2.copy()
        # check for recombination
        if random() < r_cross:
            if random() < r_segmental:
                pt_left_begin = randint(0, len(p1)-1) if len(p1)-1 > 1 else 1
                pt_left_end = randint(0, len(p1)-1) if len(p1)-1 > 1 else 1
                pt_right_begin = randint(0, len(p2)-1) if len(p2)-1 > 1 else 1
                pt_right_end = randint(0, len(p2)-1) if len(p2)-1 > 1 else 1

                left_group = p1[pt_left_begin:pt_left_end]
                right_group = p2[pt_right_begin:pt_right_end]

                c1 = p1[:pt_left_begin] + right_group + p1[pt_left_end:]
                c2 = p2[:pt_right_begin] + left_group + p1[pt_right_end:]
            else:
                # select crossover point that is not on the end of the string
                pt_left = randint(1, len(p1)-2) if len(p1)-2 > 1 else 1
                pt_right = randint(1, len(p2)-2) if len(p2)-2 > 1 else 1
                # perform crossover
                c1 = p1[:pt_left] + p2[pt_right:]
                c2 = p2[:pt_right] + p1[pt_left:]    
            terminate = len(c1) > 1 and len(c2) > 1 and c1 != p2 and c2 != p1 \
                and c1 != p1 and c2 != p2
        else:
            terminate = True
    return c1, c2

I must confess this is complicated, and we are playing with unbounded loops. It is about 3 times as many lines of code as the original. For this tradeoff, we are allowing for mutation at beginning, in middle or at end, and of variable length.

Here is mutation

from random import random


# mutation operator
def mutation(genome, r_mut, r_variadic_mut=0, value_ceiling=2):
    r_gene_mutation = r_mut/len(genome)
    skip = False
    for i in range(len(genome)):
        if skip:
            continue
        # check for a mutation
        if random() < r_gene_mutation:
            original = genome.copy()
            if random() < r_variadic_mut:
                genome, skip = variadic_gene_mutation(genome, value_ceiling, i)
            else:
                genome[i] = mutate_value(genome[i], value_ceiling)
            if len(genome) < 1: 
                #print('zero-length reached', genome, original)
                return original.copy()
    return genome

def variadic_gene_mutation(genome, value_ceiling, i):
    skip = False
    if random() < 0.5:
        # compress many to one
        j = i - 1 if random() < 0.5 else i + 1
        k = abs(j if random() < 0.5 else i) 
        if k < 0:
            k = 0
        if k > len(genome) - 1:
            k = i
        gene = mutate_value(genome[k], value_ceiling)
        genome[i] = gene
    else:
        # extend one to many
        skip = True
        j = (i - 1 if random() < 0.5 else i + 1) % len(genome) - 1
        if j > i:
            del genome[i]
            for gene in [genome[k] for k in range(i, j)]:
                genome.insert(i, mutate_value(gene, value_ceiling))
    return genome, skip

def mutate_value(value, value_ceiling):
    return (value + (random() * value_ceiling)) % value_ceiling

Again this is decidedly more complicated in order to allow mutations to convert one gene to many or the reverse. We must provide a value ceiling so we can limit the range of values we propose to replace the current value. Note that in a binary genetic algorithm, this would be obviated. But it makes the code less legible and harder to reason about in a prototype.

The selection method was also changed:

from heapq import nlargest
import numpy as np
from numpy import random


# selection with replacement
def selection(pop, polarity=None, k=4):
    indices = list(range(len(pop)))
    selections = random.choice(indices, k, p=polarity).tolist()
    chosen = [t[1] for t in nlargest(2, \
        zip(np.array(polarity)[selections].tolist(), \
            np.array(pop, dtype=object)[selections].tolist()))]
    return chosen[0], chosen[1]

It selects the entire set in one go, and allows for a probability polarity to be associated with each genome in the generation. We can set this to just the unit vector norm of the score associated with each index, and this is definitely the simplest solution. But that would never favor smaller solutions while the population is too large, nor larger ones while the population is too small. We can improve the specificity of the search through the genome length by changing polarity. This will be assigned based on the length of the genome, and whether longer or shorter genomes most recently were doing better. We create this polarity vector by looking at scores in the parent class consuming these functions:

    def __find_polarity(self, pop, scores):
        mean_scores = mean(scores)
        sizes = [len(c) for c in pop]
        mean_size = mean(sizes)
        is_larger = True \
            if self.prev_mean_size != 0 and mean_size > self.prev_mean_size \
                else False
        size_modifier =  np.array(sizes)/mean_size \
            if is_larger else 1/(np.array(sizes)/mean_size)
        specificity_modifier = mean_scores / (self.prev_mean_scores \
            if self.prev_mean_scores != 0 else 1)
        modifier = size_modifier * specificity_modifier
        polarity = normalize((np.array(scores) * modifier).tolist())
        self.prev_mean_scores = mean_scores
        self.prev_mean_size = mean_size
        return polarity

This makes use of a helper normalize function.

from sklearn import preprocessing 

def normalize(raw):
  return preprocessing.normalize([raw], norm='l1').tolist()[0]

Here, if the entire population is larger, and performing better, then all of the entities are predisposed to selection.

This is a peculiar measure of performance -- we are looking at the population's mean size and mean scores and comparing it to previous results. If the mean score is lower, we take the polarity as whatever the unit vector norm of the scores is (so a value from 0 to 1 for our distances from the 0,0 point fitness function, summing to 1), this is because we gained no information from the change in size of the offspring this generation.

If the mean score is higher, then we take the polarity as the normal form of the scores and a value reflecting the mean size. If the current generation is_larger, we take the size of the offspring over the mean_size itself as this value. Since the trend was both larger and better, we want to reward their size. Thus we reward larger offspring in proportion to their size above average. If it is not larger, we take the inverse of the size of the offspring over the mean_size as the value. Thus we reward smaller offspring in proportion to their size below average.

The rest of the algorithm follows.

from math import inf
import numpy as np
from statistics import mean

from crossover import crossover
from mutation import mutation
from normalize import normalize
from selection import selection


class GeneticAlgorithm:
    def __init__(self, fitness, genome_length, num_generations, num_population, \
        max_gene, r_cross, r_segmental, r_mut, r_variadic_mut, \
            on_gene_change=lambda x: x):
        """genetic algorithm that admits variable-length solutions

        Args:
            fitness ([type]): fitness function
            genome_length (int): number of bits in gene
            num_generations (int): number of generations to run the model
            num_population (int): number of solutions in the population
            r_cross (number): chance of crossover
            r_segmental (number): chance of midpoint type of crossover
            r_mut (number): chance of mutation
            r_variadic_mut (number): change of variadic mutation
            max_gene (int): maximum value of gene
            on_gene_change ([type]): mutating function for solutions, return None to exclude solution
        Returns:
            [type]: [description]
        """
        self.fitness = fitness
        self.num_genes = genome_length
        self.num_generations = num_generations
        self.num_population = num_population
        self.r_cross = r_cross
        self.r_segmental = r_segmental
        self.r_mut = r_mut
        self.r_variadic_mut = r_variadic_mut
        self.max_gene = max_gene
        self.best = None
        self.best_eval = None
        self.on_gene_change = on_gene_change

    def run(self):
        self.prev_mean_size = 0
        self.prev_mean_scores = 0
        # initial population of random bitstring
        pop = [(self.max_gene * np.random.random(self.num_genes)).tolist() \
            for _ in range(self.num_population)]
        # keep track of best solution
        self.best, self.best_eval = 0, self.fitness(pop[0])
        # enumerate generations
        for gen in range(self.num_generations):
            #print("Generation %d" % gen, [len(c) for c in pop])
            scores = self.__generate_scores(pop, gen)
            # record polarity
            polarity = self.__find_polarity(pop, scores)
            # prepare population array for next generation        
            pop = self.__generate_next_population(pop, polarity)
        return [self.best, self.best_eval]

    def __generate_scores(self, pop, gen):
        # evaluate all candidates in the population
        scores = [self.fitness(c) for c in pop]
        # check for new best solution
        if max(scores) > self.best_eval:
            i = np.argmax(scores)
            self.best, self.best_eval = pop[i], scores[i]
            print(">%d, new best f(%s) = %.3f" % (gen,  len(pop[i]), scores[i]))
        return scores

    def __generate_next_population(self, pop, polarity):
        # create the next generation
        children = set()
        while len(children) < self.num_population:
            # select two parents
            p1, p2 = selection(pop, polarity)
            # crossover
            c1, c2 = crossover(p1, p2, self.r_cross, self.r_segmental)
            # mutate
            genome1 = self.on_gene_change(mutation(c1, self.r_mut, \
                self.r_variadic_mut, self.max_gene))
            t1 = tuple(genome1 if genome1 is not None else [])
            genome2 = self.on_gene_change(mutation(c2, self.r_mut, \
                self.r_variadic_mut, self.max_gene))
            t2 = tuple(genome2 if genome2 is not None else [])
            # add to next generation
            if genome1 is not None:
                children.add(t1)
            if genome2 is not None:
                children.add(t2)
        # replace population
        return [list(c) for c in children]

You'll note that we provide an on_gene_change configuration option. This is because after a change in length at the end of the day we need pairs of genes for each x and y value to act as coordinates. Thus our agent is slightly more complex, as we need to enforce regularization:

import numpy as np
from math import log
from scipy.spatial import Voronoi, distance

from models.torch import Torch
from genetic_algorithm import GeneticAlgorithm
from voronoi_polygons import voronoi_polygons
from naive_grouper import naive_grouper
from flatten import flatten


class GATorch(Torch):
    def prepare(self, options):
        self.options=options

    def define(self, options={}):
        self.ga_instance = GeneticAlgorithm(**self.options)

    def train(self):
        self.sol, self.sol_fitness = self.ga_instance.run()

    def evaluate(self):
        return self.sol_fitness

    def predict(self):
        if self.sol is None:
            return
        return [(x,y) for x,y in naive_grouper(self.sol, 2)]

default_fitness_func = lambda x: sum([distance.euclidean(a, (0,0)) for a in x])

def on_gene_change(x):
    # datapoints compose coordinates, we need pairs of each
    genes = (x[:-1] if len(x) % 2 == 1 else x)
    # there should be only up to 100 datapoints
    genes = x[:100] 
    coords = naive_grouper(genes, 2) 
    # zones must be unique
    unique_coords = set(coords)
    # the voronoi polygon should have at least 4 regions
    if len(unique_coords) < 4:
        return None

    return flatten([list(gene) for gene in unique_coords])

defaults = {
    "fitness": default_fitness_func,
    "genome_length": 20,
    "num_generations": 100,
    "num_population": 40,
    "r_cross": .1,
    "r_segmental": .3,
    "r_mut": .1,
    "r_variadic_mut": .3,
    "max_gene": 10,
    "on_gene_change": on_gene_change
}

This is clearly not the fastest implementation on earth, with non-binary genomes, callbacks on each new offspring, and lots of additional work from the basic implementation. But it was easy enough to reason around, not only working in the general case but even allowing for the pairwise requirement of the cartesian coordinates we are generating. But we can run this already and get the best-yet result, by a massive margin:

{ 'fitness': <function at 0x121985ca0>, 'genome_length': 20, 'max_gene': 10, 'num_generations': 100, 'num_population': 40, 'on_gene_change': <function on_gene_change at 0x12ef42550>, 'r_cross': 0.1, 'r_mut': 0.1, 'r_segmental': 0.3, 'r_variadic_mut': 0.3}

>0, new best f(20) = 187.127

>1, new best f(34) = 263.373

>2, new best f(50) = 404.630

...

Score: 958.086746013197 points: [(7.93008831574214, 1.968483657711614), (9.902343010571661, 9.160986979252993), (4.9129301935292755, 4.862632752661364), (8.545732939471318, 5.503635000851346), (8.108713266966452, 5.891420181491518), (9.902343010571661, 6.724773576129348), (9.170647940288081, 7.42291280925392), (8.907123471629202, 0.3855717819589066), (7.42291280925392, 3.56205656460416), (9.64308492994326, 0.6377938308208675), (5.48646088598545, 4.720666321107755), (6.3522363894967615, 9.414705390237422), (5.261085208504133, 4.081837315034508), (9.141096429268622, 3.2900266548457724), (3.86613718251277, 7.633768370721032), (8.722511215646412, 0.9151817357610668), (2.07829290375665, 9.414705390237422), (8.545732939471318, 9.93678409454446), (7.135370849743405, 6.3963186582025395), (6.644813464829235, 8.057013487459445), (8.108713266966452, 9.341216662959416), (4.081837315034508, 6.650906731468247), (6.650906731468247, 7.335288016073247), (9.93678409454446, 7.42291280925392), (9.775019022995757, 5.48646088598545), (2.746686030695572, 5.48646088598545), (4.720666321107755, 5.261085208504133), (7.42291280925392, 5.891420181491518), (4.303571388849635, 9.429848778276954), (7.693108494029097, 9.745450727477307), (7.328113042521571, 9.182159258081919), (0.6377938308208675, 2.2914986465596376), (7.605325810148898, 4.081837315034508), (6.644813464829235, 9.745450727477307), (7.42291280925392, 9.739311199313953), (6.3522363894967615, 7.569316210024216), (8.596049646071824, 9.902343010571661), (9.160986979252993, 9.93678409454446), (5.680442484859791, 6.311487157044229), (2.2914986465596376, 7.818646538051066), (7.93008831574214, 4.794627733870749), (4.720666321107755, 4.794627733870749), (9.902343010571661, 7.852512145220334), (9.170647940288081, 7.515148725298934), (7.335288016073247, 8.730220876738965), (8.105172924420557, 0.24105659986792483), (7.302687737686564, 6.724773576129348), (7.135370849743405, 8.596049646071824), (8.087853589378009, 8.359570162796624), (8.108713266966452, 8.611020366032598)]

Figure 6 - our custom GAs results

Note the lines that say "new best". These include an "f(n)" statement where n is the number of genes in the genome. After just the first couple of iterations, already you can see it rapidly scaling up. By the 50th, in generation 3 we have already gone up 250% from the starting value, and are halfway to our maximum.

Even considering that our on_gene_change method allowed a maximum of 100 data points, i.e. 50 coordinates, which is 2.5 times more than any other method we have tried so far, the score is nearly 4 times better than any previous attempt so this is clearly an overall improvement, and it maintains the score-per-point ratio, in rough terms, that we got from PyGAD. It looks like we did a good job of balancing gene growth versus point fitness.

better fitness

Let's not assume this is working just yet. We should look at problems where we need fewer buildings, with a new fitness function. We can race towards the 0,0 point instead of the 10,10 point, by summing the inverse of the log of the distance, if we are careful to avoid negative outcomes from the log.

near_zero_fitness_func = lambda x: 0 \
    if len(x) < 8 \
        else 1/log(1 + sum([distance.euclidean(a, (0,0)) for a in x]))

Figure 7 - custom GA on inverse problem

This looks impressive, but we forgot to get a baseline so we can't judge it just yet. We see good results when we compare to pyGAD with the same modification to its fitness function:

Figure 8 - pyGAD on inverse problem

The results there are clearly pretty good, but not nearly as centered on the bottom corner. Let's actually check that trial-and-error baseline. We'll set the num_genes to 4 for 4-8 genes.

Figure 9 - trial-and-error on inverse problem

At this point I think I am happy. Normally a data scientist or someone else would flip out that we are looking at pictures instead of actually doing math. But, well, I'll leave that as an exercise for the precise.

control

We can see that we have an approach that will work to find corners, but we will probably need zones that have minimum sizes even when there is a race toward the bottom for the number of genes. Let's work on one last test case before wrapping up this article (because getting into the real fitness function makes this a lot larger of a project, we won't be going very much further).

One way we can try to get there is to use the concept of median. By shifting the focus to the largest median, we still tend towards solutions that are smaller in count, because you will more quickly find better values when considering fewer points. Evidence for this hunch can be seen in figure 6, whereas the number of points grew, their precision dropped. Let's try it out.

A problem is that now we are no longer just measuring the distance of an origin point within each subzone, but the actual area of the subzone itself. So we'll need to bring in some code from the results functions we wrote to render the results earlier.

For our custom_ga, the fitness function:

def median_area_fitness_func(solution):
    points = [ (x,y) for x,y in naive_grouper(solution, 2) ]
    vor = Voronoi(points)
    boundary = np.array([[0,0], [0,10], [10,10], [10,0]])
    diameter = np.linalg.norm(boundary.ptp(axis=0))

    return np.median([shape.area for shape in voronoi_polygons(vor, diameter)])

Note that we need to hard-code the boundary in the fitness function, which makes me very happy that this function is a variable that we pass to our implementation. Otherwise, our AI logic would have to be aware of the intended use case.

To that, we just need to add bounds for the minimum and maximum value to the pyGAD agent's fitness function:

    qualifier = (1 if min(solution) >= 0 else 0) * \
        (1 if max(solution) <= 10 else 0)
    if qualifier == 0:
        return 0

Wrapping up, for trial-and-error we can just pass points in directly from our agent, they are already grouped.

These next three figures are trial-and-error, pyGAD, and custom GA in that order. I ran each several times and chose a typical run from each.

Figure 10 - trial-and-error median

Figure 11 - pyGAD median

Figure 12 - custom GA median

The trial-and-error approach, at least with 10000 generations of search, is able to hone in on some pretty good-looking results, much like the custom GA. But the custom GA over very many runs basically never produces a result with more than 4 subzones. And every once in a while, you get a gem like this one, reflecting an aesthetic we didn't even think we were coding into this.

Figure 13 - custom GA beautifully linear result

Notice how these are getting really close to rectangles!

Using GA to search across a variable number of genomes is quite possible and approachable. I think it is uniquely approachable, and perhaps a great introduction to artificial intelligence. I hope you have enjoyed the presentation. If you were paying attention you are likely now armed with a list of things that I did wrong, which you can do even better! Have fun.