Simulator class

Module containing simulator class.

class chromax.simulator.Simulator(genetic_map: Path | DataFrame, trait_names: List[str] | None = None, chr_column: str = 'CHR.PHYS', position_column: str = 'cM', recombination_column: str = 'RecombRate', mutation_probability: float = 0.0, h2: ndarray | None = None, seed: int | None = None, device: Device = None, backend: str | Client = None)[source]

Breeding simulator class. It can perform the most common operation of a breeding program.

Parameters:

genetic_map (Path or DataFrame) – the path, or dataframe, containing the genetic map. It needs to have all the columns specified in trait_names, CHR.PHYS (with the name of the marker chromosome), and one between cM or RecombRate.
trait_names (List of strings) – column names in the genetic_map. The values of the columns are the marker effects on the trait for each marker. The default value is Yield.
chr_column (str) – name of the column containing the chromosome identifier. The default value is CHR.PHYS.
position_column (str) – name of the column containing the position in cM of the marker. The default value is cM.
recombination_column (str) – name of the column containing the probability that a recombination happens before the current marker and after the previous one. The default value is RecombRate.
mutation_probability (float) – The probability of having a mutation in a marker.
h2 (array of float) – narrow-sense heritability value for each trait. The default value is 0.5 for each trait.
seed (int) – the random seed for reproducibility.
device (XLA Device) – the device for computing simulations. It will be automatically selected if not specified; by default to the first available GPU or TPU, or the CPU if neither is present.
backend (str or XLA client) – the backend of the device. Common choices are gpu, cpu or tpu.

Example:

>>> from chromax import Simulator, sample_data
>>> simulator = Simulator(genetic_map=sample_data.genetic_map)
>>> f1 = simulator.load_population(sample_data.genome)
>>> f2, _ = simulator.random_crosses(f1, n_crosses=10, n_offspring=20)
>>> f2.shape
(10, 20, 9839, 2)

set_seed(seed: int)[source]

Set random seed for reproducibility.

Parameters:: seed (int) – random seed.

load_population(file_name: Path | str) → Bool[Array, 'n m d'][source]

Load a population from file.

Parameters:

file_name (path) – path of the file with the population genome.

Returns:

loaded population of shape (n, m, d), where n is the number of individual, m is the total number of marker, and d is the diploidy of the population.

Return type:

ndarray

Example:

>>> from chromax import Simulator, sample_data
>>> simulator = Simulator(genetic_map=sample_data.genetic_map)
>>> f1 = simulator.load_population(sample_data.genome)
>>> f1.shape
(371, 9839, 2)

save_population(population: Bool[Array, 'n m d'], file_name: Path | str)[source]

Save a population to file.

Parameters:

population (ndarray) – population to save.

File_name:

file path to save the population.

Example:

>>> from chromax import Simulator, sample_data
>>> simulator = Simulator(genetic_map=sample_data.genetic_map)
>>> f1 = simulator.load_population(sample_data.genome)
>>> f2, _ = simulator.random_crosses(f1, n_crosses=10, n_offspring=20)
>>> simulator.save_population(f2, "pop_file")

cross(parents: Bool[Array, 'n 2 m d']) → Bool[Array, 'n m d'][source]

Main function that computes crosses from a list of parents.

Parameters:

parents (ndarray) – parents to compute the cross. The shape of the parents is (n, 2, m, d), where n is the number of parents, m is the number of markers, and d is the ploidy.

Returns:

offspring population of shape (n, m, d).

Return type:

ndarray

Example:

>>> from chromax import Simulator, sample_data
>>> import numpy as np
>>> simulator = Simulator(genetic_map=sample_data.genetic_map)
>>> f1 = simulator.load_population(sample_data.genome)
>>> parents_indices = np.array([
    [1, 5],
    [4, 7],
    [5, 6]
])
>>> parents = f1[parents_indices]
>>> f2 = simulator.cross(parents)
>>> f2.shape
(3, 9839, 2)

property differentiable_cross_func: Callable

Experimental features that return a differentiable version of the cross function.

The differentiable crossing function takes as input:

population (array): starting population from which performing the crosses.
The shape of the population is (n, m, d).
cross_weights (array): Array of shape (l, n, d). It is used to compute
l crosses, starting from a weighted average of the n possible parents. When the n-axis has all zeros except of a single element equals to one, this function is equivalent to the cross function.
random_key (JAX random key): random key used for recombination sampling.

And returns a population of shape (l, m, d).

Example:

>>> from chromax import Simulator, sample_data
>>> import numpy as np
>>> import jax
>>> simulator = Simulator(genetic_map=sample_data.genetic_map)
>>> diff_cross = simulator.differentiable_cross_func
>>> def mean_gebv(pop, weights, random_key):
        new_pop = diff_cross(pop, weights, random_key)
        return simulator.GEBV(new_pop, raw_array=True).mean()
>>> grad_f = jax.grad(mean_gebv, argnums=1)
>>> f1 = simulator.load_population(sample_data.genome)
>>> weights = np.random.uniform(size=(10, len(f1), 2))
>>> weights /= weights.sum(axis=1, keepdims=True)
>>> random_key = jax.random.key(42)
>>> grad_value = grad_f(f1, weights, random_key)
>>> grad_value.shape
(10, 371, 2)

double_haploid(population: Bool[Array, 'n m d'], n_offspring: int = 1) → Bool[Array, 'n n_offspring m d'][source]

Computes the double haploid of the input population.

Parameters:

population (ndarray) – input population of shape (n, m, 2).
n_offspring (int) – number of offspring per plant. The default value is 1.

Returns:

output population of shape (n, n_offspring, m, 2). This population will be homozygote.

Return type:

ndarray

Example:

>>> from chromax import Simulator, sample_data
>>> simulator = Simulator(genetic_map=sample_data.genetic_map)
>>> f1 = simulator.load_population(sample_data.genome)
>>> dh = simulator.double_haploid(f1, n_offspring=10)
>>> dh.shape
(371, 10, 9839, 2)

diallel(population: Bool[Array, 'n m d'], n_offspring: int = 1) → Bool[Array, 'n*(n-1)/2 n_offspring m d'][source]

Diallel crossing function (crossing between every possible couple) except self-crossing.

Parameters:

population (ndarray) – input population of shape (n, m, d).
n_offspring (int) – number of offspring per cross. The default value is 1.

Returns:

output population of shape (l, n_offspring, m, d), where l is the number of possible pair, i.e n * (n-1) / 2.

Return type:

ndarray

Example:

>>> from chromax import Simulator, sample_data
>>> simulator = Simulator(genetic_map=sample_data.genetic_map)
>>> f1 = simulator.load_population(sample_data.genome)[:10]
>>> f2 = simulator.diallel(f1, n_offspring=10)
>>> f2.shape
(45, 10, 9839, 2)

random_crosses(population: Bool[Array, 'n m d'], n_crosses: int, n_offspring: int = 1) → Tuple[Bool[Array, 'n_crosses n_offspring m d'], Int[Array, 'n_crosses 2']][source]

Computes random crosses on a population.

Parameters:

population (ndarray) – input population of shape (n, m, d).
n_crosses (int) – number of random crosses to perform.
n_offspring (int) – number of offspring per cross. The default value is 1.

Returns:

output population of shape (n_crosses, n_offspring, m, d) and parent indices of shape (n_crosses, 2) of performed crosses.

Return type:

tuple of two ndarrays

Example:

>>> from chromax import Simulator, sample_data
>>> simulator = Simulator(genetic_map=sample_data.genetic_map)
>>> f1 = simulator.load_population(sample_data.genome)
>>> f2, parent_ids = simulator.random_crosses(f1, 100, n_offspring=10)
>>> f2.shape
(100, 10, 9839, 2)
>>> parent_ids.shape
(100, 2)

select(population: Bool[Array, '_g n m d'], k: int, f_index: Callable[[Bool[Array, 'n m d']], Float[Array, 'n']] | None = None) → Tuple[Bool[Array, '_g k m d'], Int[Array, '_g k']][source]

Function to select individuals based on their score (index).

Parameters:

population (ndarray) – input population of shape (n, m, d), or shape (g, n, m, d), to select k individual from each group population group g.
k (int) – number of individual to select.
f_index (Callable) – function that computes a score from each individual. The function accepts as input the population, i.e. and array of shape (n, m, d) and returns a n float numbers. The default f_index is the conventional index, i.e. the sum of the marker effects masked with the SNPs from the genetic_map.

Returns:

output population of shape (k, m, d) or (g, k, m, d), depending on the input population, and respective indices of shape (k,) or (g, k)

Return type:

tuple of two ndarrays

Example:

>>> from chromax import Simulator, sample_data
>>> simulator = Simulator(genetic_map=sample_data.genetic_map, trait_names=["Yield"])
>>> f1 = simulator.load_population(sample_data.genome)
>>> len(f1), simulator.GEBV(f1).mean().values
(371, [8.223844])
>>> f2, selected_indices = simulator.select(f1, k=20)
>>> len(f2), simulator.GEBV(f2).mean().values
(20, [14.595136])
>>> selected_indices.shape
(20,)

GEBV(population: Bool[Array, 'n m d'], *, raw_array: bool = False) → DataFrame | ndarray[source]

Computes the Genomic Estimated Breeding Values using the data from the genetic_map.

Parameters:

population (ndarray) – input population of shape (n, m, d).
raw_array (bool) – whether to return a raw array or a DataFrame. Default value is False.

Returns:

a DataFrame (or array) with n rows and a column for each trait. It contains the GEBV of each trait for each individual.

Return type:

DataFrame or ndarray

Example:

>>> from chromax import Simulator, sample_data
>>> simulator = Simulator(genetic_map=sample_data.genetic_map)
>>> f1 = simulator.load_population(sample_data.genome)
>>> simulator.GEBV(f1).mean()
Heading Date              0.196119
Protein Content          -0.228718
Plant Height             -5.888406
Thousand Kernel Weight   -1.029418
Yield                     8.223843
Fusarium Head Blight      5.318052
Spike Emergence Period   -0.933169
dtype: float32

create_environments(num_environments: int) → Float[Array, 'num_environments'][source]

Create environments to phenotype the population.

In practice, it generates random numbers from a normal distribution.

Parameters:: num_environments (int) – number of environments to create.
Returns:: array of floating point numbers. This output can be used for the function phenotype.
Return type:: ndarray

phenotype(population: Bool[Array, 'n m d'], *, num_environments: int | None = None, environments: ndarray | None = None, raw_array: bool = False) → DataFrame | ndarray[source]

Simulates the phenotype of a population.

This uses the Genotype-by-Environment model described in AlphaSimR.

Parameters:

population (ndarray) – input population of shape (n, m, d)
num_environments (int) – number of environments to test the population. Default value is 1.
environments (ndarray) – environments to test the population. Each environment must be represented by a floating number in the range (-1, 1). When drawing new environments use normal distribution to maintain heretability semantics.
raw_array (bool) – whether to return a raw array or a DataFrame. Default value is False.

Returns:

a DataFrame (or array) with n rows and a column for each trait. It contains the simulated phenotype for each individual.

Return type:

DataFrame or ndarray

Example:

>>> from chromax import Simulator, sample_data
>>> simulator = Simulator(genetic_map=sample_data.genetic_map, seed=42)
>>> f1 = simulator.load_population(sample_data.genome)
>>> envs = simulator.create_environments(4)
>>> simulator.phenotype(f1, environments=envs).mean()
Heading Date              0.105397
Protein Content          -0.172026
Plant Height             -5.813669
Thousand Kernel Weight   -1.372738
Yield                     8.306302
Fusarium Head Blight      4.286477
Spike Emergence Period   -0.575061
dtype: float32

corrcoef(population: Bool[Array, 'n m d']) → Float[Array, 'n'][source]

Computes the correlation coefficient of the population against its centroid.

It can be used as an indicator of variance in the population.

Parameters:

population (ndarray) – input population of shape (n, m, d)

Returns:

vector of length n, containing the correlation coefficient of each individual against the average of the population.

Return type:

ndarray

Example:

>>> from chromax import Simulator, sample_data
>>> simulator = Simulator(genetic_map=sample_data.genetic_map, seed=42)
>>> f1 = simulator.load_population(sample_data.genome)
>>> corrcoef = simulator.corrcoef(f1)
>>> corrcoef.shape
(371,)