Machine Learning for Event Prediction. Practical Steps

Blog ❘ Studies

April 4, 2026, 7:56 p.m. Advanced Alexey Borealis 17 min. to read

In the previous article we described a machine learning algorithm using a neural network. In this article, we will build Python code to implement this algorithm. We will use the terms and notations from the previous article.

To work with vectors and matrices, we will use the NumPy library, which is originally written in C, meaning it is very fast to execute.

[toc]

Neural Network Structure

Our task is to design a neural network that can be easily initialized with global parameters, namely:

The dimensionality of the input astrological feature vector $x_i$ — input_dim
The dimensionality of each hidden layer, which can be set via an array of numbers hidden_dims, for example hidden_dims=[16, 9] would mean we have 2 hidden layers with 16 and 9 neurons respectively.
The gradient descent step, $\lambda$ (learning rate, lr)
The number of epochs for training the model — epochs
The number of examples $N$ in a single subsample from the training data — batch_size

For this, we will create a class BernoulliNet that can be initialized with these parameters, for example:

model = BernoulliNet(
    input_dim=8,
    hidden_dims=[16, 9],
    lr=0.05,
    epochs=100,
    batch_size=32,
)

Here, the parameters input_dims and hidden_dims determine the overall structure of the neural network:

dims = [input_dim, 16, 9, 1]
#           ↑       ↑  ↑  ↑
#        layer 0     1  2  3 (output)

The weight matrices will have the following dimensions $[\text{rows}\times \text{cols}]$ in this example:

$$ \begin{align} &W^{(1)} \;\;[16 \times 8] & \text{connects layer 0 → layer 1}\\ &W^{(2)} \;\;[9 \times 16] & \text{connects layer 1 → layer 2}\\ &W^{(3)} \;\;[1 \times 9] & \text{connects layer 2 → layer 3}\\ \end{align} $$

We will make the weights and biases internal properties of the model, i.e.:

class BernoulliNet:

    def __init__(self,
                 input_dim: int,
                 hidden_dims: list[int],
                 lr: float = 0.01,
                 epochs: int = 100,
                 batch_size: int = 32,
                 ) -> None:
        # Dimension of the entire neural network
        dims: list[int] = [input_dim] + hidden_dims + [1]
        # Weight initialization
        self.W, self.b = self._init_weights(dims)

Since we will iterate over the values $W^{(1)}, W^{(2)}, W^{(3)}$, in computer notation these will be array elements with indices self.W[0], self.W[1], self.W[2]. Similarly, the biases $b^{(1)}, b^{(2)}, b^{(3)}$ will be stored in the array self.b[0], self.b[1], self.b[2].

In the general case:

The weight matrix self.W[layer-1] corresponds to $W^{(l)}$ with dimensions $[n^{(l)}\times n^{(l-1)}]$.
The bias column vector self.b[layer-1] corresponds to the column vector $\vec{\mathbf{b}}^{(l)}$ with dimensions $[n^{(l)}\times 1]$.

Data Types

We previously agreed that lower indices denote a column vector $x_i.$ Formally, this is a matrix with $n$ rows and one column $[n\times 1]$.

Upper indices $x^i$ denote the conjugate vector, or a row vector. Formally, this is a matrix with dimensions $[1\times n]$.

Any matrices filled with decimal numbers in the NumPy library are objects of type NDArray[np.float64], so to distinguish their dimensions we will introduce explicit type annotations for the data we will operate on in the code:

import numpy as np
from numpy.typing import NDArray

# Geometric types
# ===================

# [n × 1] — column vector
Vector = NDArray[np.float64]

# [1 × n] — row vector
CoVector = NDArray[np.float64]

# [m × n] — weight matrix
Matrix = NDArray[np.float64]

# Semantic types
# ==================

# [n × 1] — single example (set of inputs x⃗ or intermediate h⃗)
Sample = NDArray[np.float64]

# [n × N] — N examples (N sets of inputs x⃗ or intermediate h⃗ for the entire batch)
Batch = NDArray[np.float64]

# [n × N_total] — entire dataset of features x⃗ from AstroDataBank
Dataset = NDArray[np.float64]

This does not affect functionality — we introduce these annotations solely for code readability.

Activation Functions for the Neural Network

First, we will implement the activation functions in the neural network that we will need later.

Output Layer

Recall that the activation function for the output layer is the sigmoid $\sigma(z^i) = \dfrac{1}{1 + e^{-z^i}}$ — it is applied for each $i$-th example in the batch. Instead of applying the sigmoid to each $z^i$, we can delegate this element-wise computation to the NumPy library, which performs this task in compiled C code, yielding a significant speed advantage. To do this, it is enough to pass the entire set of $N$ examples to the function, i.e., the row vector $\vec{\mathbf{z}}$ with dimensions $[1\times N]$:

def sigmoid(z: Batch) -> Batch:
    return 1.0 / (1.0 + np.exp(-z))

Hidden Layers

The activation function for hidden layers $\text{ReLU}(z^i) = \max(0, z^i)$ for each $i$-th example in the batch is implemented on the same principle — we delegate the element-wise function application to NumPy by passing it the entire set of $N$ examples:

def relu(z: Batch) -> Batch:
    return np.maximum(0, z)

The derivative of the ReLU function is 1 for $z^i>0$ and 0 in all other cases. We can represent this as a string of boolean values ${z^i>0}=(\text{True} \mid \text{False})^i$, in which we then element-wise convert each $\text{True}_i$ value to 1.0 and each $\text{False}_i$ to 0.0 using the .astype() method of the NumPy library:

def d_relu(z: Batch) -> Batch:
    result: Batch = (z > 0).astype(np.float64)
    return result

Gradient Descent for the Neural Network

Loss Function

Recall that the loss function for all examples in the batch is described by the formula:

$$ \mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}\left[y^i \ln \hat{p}^i + (1 - y^i)\ln(1 - \hat{p}^i)\right] $$

where $\hat{p}^i$ is the predicted probability of a specific event for the $i$-th horoscope example, and $y^i$ is the label of the actual event (1 if the event occurred and 0 otherwise). The NumPy library has a built-in averaging function mean, which is equivalent to $\frac{1}{N}\sum_1^N$, so the loss function can be implemented as:

def loss(self, p_hat: Batch, y: Batch) -> float:
    """
    Binary cross-entropy (Bernoulli log-likelihood)
    """
    return float(
        -np.mean(
            y * np.log(p_hat + 1e-9)
            + (1 - y) * np.log(1 - p_hat + 1e-9)
        )
    )

Here p_hat and y are row vectors with dimensions $[1\times N]$

Weight and Bias Initialization

We mentioned in the first part of the article that for stable training, the weights in each layer $l$ should be initialized with random numbers having zero mean and variance $2/(n^{(l-2)})$, where $n^{(l-1)}$ is the number of neurons in the previous layer.

For a Single Layer

In the NumPy library, initializing the weight matrix $W^{(l)}$ with dimensions $[n^{(l)}\times n^{(l-1)}]$ for each $l$-th layer can be implemented as follows:

W_layer: Matrix = np.random.normal( # Normal distribution
    loc=0.0,                        # mean value
    scale=np.sqrt(2.0 / n_prev),    # Standard deviation
    size=(n_curr, n_prev),          # Matrix size
).astype(np.float64)        # decimal numbers in matrix cells

We also need to initialize the bias column vector $b^{(l)}$ with dimensions $[n^{(l)}\times 1]$ for each $l$-th layer to zeros. In NumPy notation, this can be implemented as:

b_layer: Vector = np.zeros( # Array of zeros
    (n_curr, 1),            # as a column
    dtype=np.float64        # in decimal form, i.e., 0.0
)

For All Layers

To initialize the parameters in all layers of the neural network, we simply apply these formulas sequentially at the junctions of adjacent layers:

for layer in range(len(dims) - 1):
    n_prev = dims[layer]      # number of neurons in the previous layer
    n_curr = dims[layer + 1]  # number of neurons in the current layer

    # Next, initialize weight matrices and bias vectors

After that, we add these initial values to the resulting list of weights and biases for all layers of the neural network self.W[0], self.W[1], self.W[2] and self.b[0], self.b[1], self.b[2], which correspond to $W^{(1)}, W^{(2)}, W^{(3)}$ and $b^{(1)}, b^{(2)}, b^{(3)}$ in our formulas.

Forward Pass

As you remember from the previous article, after initializing the weights and biases, we first perform a forward pass, applying the formula for each $i$-th example $(\vec{\mathbf{x}}, y)^i$ in the batch:

$$ \begin{align} &\vec{\mathbf{h}}^{(0)\,i} = \vec{\mathbf{x}}^i \\ &\vec{\mathbf{h}}^{(l)\,i} = \text{ReLU}\left(W^{(l)}\vec{\mathbf{h}}^{(l-1)\,i} + \vec{\mathbf{b}}^{(l)\,i}\right), \;\; l= 1, 2,..., L-1 \\ &\hat{p}^i = \sigma\left(\vec{\mathbf{w}}^{(L)\,i}\;\vec{\mathbf{h}}^{(L-1)\,i} + b^{(L)\,i}\right) \\ \end{align} $$

and then we store these values in a cache to use them later in the backward pass.

As usual, instead of applying these formulas cyclically to each example individually, from a speed perspective it is more efficient to pass the entire array of examples to the NumPy library — its matrix computations are significantly faster than a regular Python loop.

Layer Zero

In layer zero, for each $i$-th example in the batch $\vec{\mathbf{h}}^{(0)\,i} = \vec{\mathbf{x}}^i$ — a column vector with dimensions $[\text{input_dim} \times 1]$. And for all examples in the batch, this vector turns into a matrix $H^{(0)}=X$ with dimensions $[\text{input_dim}\times N]$.

In NumPy notation, this matrix can be defined as:

H: Batch = X 
H_cache: list[Batch] = [H]

In the first line, we perform element-wise assignment. In the second line, we initialize the temporary storage H_cache for matrices $H^{(0)}, H^{(1)}, ... H^{(L-1)}$, which we will need during the backward pass. For initialization, we simply put the value $H^{(0)}$ into the H_cache array. In summary:

H_cache[0], H_cache[1], H_cache[2] correspond to matrices $H^{(0)}, H^{(1)}, H^{(2)}$
The dimension of each matrix $H^{(l)}$ is $[n^{(l)}\times N]$.

Hidden Layers

Recall that the output result $\vec{\mathbf{h}}^{(l)} = \text{ReLU}\left(W^{(l)} \vec{\mathbf{h}}^{(l-1)} + \vec{\mathbf{b}}^{(l)}\right)$ of each hidden layer $l$ is a column vector with dimensions $[n^{(l)}\times 1]$ for a single example. But for all $N$ examples in the batch, it is a matrix $H^{(l)}$ of size $[n^{(l)} \times N]$. In essence, it is the result of activation of the $l$-th hidden layer for the entire set of $N$ examples in the batch.

In NumPy notation, these matrices can be implemented as:

for W, b in zip(self.W[:-1], self.b[:-1]):
    H = relu(W @ H + b)
    H_cache.append(H)

Here, inside the loop in the first line:

The variable H is the output of the previous layer $H^{(l-1)}$ with dimensions $[n^{(l-1)}\times N]$.
The variable W is self.W[layer-1], i.e., the matrix $W^{(l)}$ with dimensions $[n^{(l)}\times n^{(l-1)}]$
b is the vector self.b[layer-1] with dimensions $[n^{(l)}\times 1]$, which NumPy automatically expands to a matrix $[n^{(l)}\times N]$ by simple column repetition to enable the addition operation.
W @ H + b is a matrix of size $[n^{(l)}\times N]$, to which ReLU is applied element-wise.
The result of the element-wise activation function application is stored in the variable H, which corresponds to $H^{(l)}$ in our formula and will be used in the next loop.

In the second line of the loop, we simply store this resulting value $H^{(l)}$ in the cache for use in the backward pass.

The regular Python loop iterates through all layers. We take the set (self.W[layer], self.b[layer]) excluding the last layer, which corresponds to $(W^{(l)},\vec{\mathbf{b}}^{(l)}), l=1...L-1$ in our formulas.

Here, the zip function combines the arrays $W^{(l)}$ and $\vec{\mathbf{b}}^{(l)}$ into an array of pairs $(W^{(l)},\vec{\mathbf{b}}^{(l)})$, and the notation [:-1] means we iterate over all values self.W[0], self.W[1], ... and self.b[0], self.b[1], ..., except the last one, i.e., over $W^{(1)}, W^{(2)}, ... W^{(L-1)}$ and $b^{(1)}, b^{(2)}, ..., b^{(L-1)}$.

Output Layer

For the output layer, we apply the formula $\hat{p}^{\,i} = \sigma\left(\left(\vec{\mathbf{w}}^{(l)\,i}\right)^{\text{T}}\;\vec{\mathbf{h}}^{(l-1)\,i} + b^{(l)\,i}\right)$ for each $i$-th example. For a single example, this is a scalar, i.e., a $[1\times 1]$ matrix, and for the entire batch of examples, it is a row vector of size $[1\times N]$. As usual, we pass the entire set of examples from the batch to the NumPy library for fast sequential sigmoid application:

p_hat: Batch = sigmoid(
    self.W[-1] @ H + self.b[-1]
)

Here, the notation [-1] means we take the last element of the self.W array, i.e., the output layer matrix $W^{(L)}$ with dimensions $[n^{(L)}\times n^{(L-1)}] = [1\times n^{(L-1)}]$, multiply it by the output of the previous layer $H^{(L-1)}$ of size $[n^{(L-1)}\times N]$, which gives us a row vector of size $[1\times N]$, and then we add the last element of the self.b list with dimensions $[1\times 1]$, which the NumPy library converts into a row vector of size $[1\times N]$ by simply repeating the number b $N$ times.

As a result, we obtain a row vector $\vec{\mathbf{\hat{p}}}$ with numbers between 0 and 1 — the probabilities of the event occurring for each example in the batch.

Backward Pass

For the backward pass, we first need to compute the delta for the output layer, and then, using the recurrent formula, the deltas for the remaining layers.

Output Layer Delta

Recall that the output layer delta for each $i$-th example is a scalar $\delta^{(L)\;i} = \frac{1}{N}\sum_{k=1}^{N}(\hat{p}_k - y_k)^i.$ But for all examples in the batch, it is a row vector of size $[1\times N]$.

In NumPy library notation, this vector is implemented as:

delta: Batch = (p_hat - y) / N

Here, the subtraction operation of two arrays (p_hat - y) computes the difference for each element of the batch vector element-wise, and the division by the scalar N automatically performs the averaging.

Hidden Layers Deltas

As you remember, for hidden layers, the recurrent formula for the delta of each layer for each $i$-th example looks like: $$ \vec{\boldsymbol{\delta}}^{(l)\,i} = \left(W^{(l+1)}\right)^T \vec{\boldsymbol{\delta}}^{(l+1)\,i} \odot f'(\vec{\mathbf{z}}^{(l)\,i}) $$

Step 1

First, we implement the computation of $\vec{\mathbf{z}}^{(l)\,i}.$

For each $i$-th example, $\vec{\mathbf{z}}^{(l)\,i} = \left(W^{(l)}\right)^\text{T}\cdot \vec{\mathbf{h}}^{(l-1)\,i} + \vec{\mathbf{b}}^{(l)}$ is a column vector with dimensions $[n^{(l)}\times 1]$, where $n^{(l)} $ is the number of neurons in the $l$-th layer. And for the entire batch of examples, it is a matrix $Z^{(l)}$ with dimensions $[n^{(l)}\times N].$

In NumPy notation, we can write this as:

z_cache[layer] = self.W[layer] @ H_cache[layer] + self.b[layer]

Recall that H_cache is a temporary storage (cache) for matrices $H^{(0)}, H^{(1)}, ... H^{(L-1)}$, which we created during the forward pass. Each of the matrices has dimensions $[n^{(l)}\times N]$, and self.W[layer] has dimensions $[n^{(l+1)}\times n^{(l)}]$, so the matrix product self.W[layer] @ H_cache[layer] has dimensions $[n^{(l+1)}\times N]$.

The individual element self.b[layer] has dimensions $[n^{(l+1)}\times 1]$, but NumPy automatically expands it to a matrix $[n^{(l+1)}\times N]$ by simple column repetition.

In summary:

The array elements z_cache[0], z_cache[1], ... correspond to $Z^{(1)}, Z^{(2)}, ...$
Each matrix $Z^{(l)}$ has dimensions $[n^{(l)}\times N].$

Step 2

Now that we know $Z^{(l)}$ for all hidden layers $l=1, 2, …, L-1$, we can proceed to implement the delta for hidden layers. For each hidden layer, for the $i$-th example, $\vec{\boldsymbol{\delta}}^{(l)\,i} = \left(W^{(l+1)}\right)^T \vec{\boldsymbol{\delta}}^{(l+1)\,i} \odot f'(\vec{\mathbf{z}}^{(l)\,i})$ is a column vector with dimensions $[n^{(l)}\times 1]$, and for the entire batch of $N$ examples, it is already a matrix with dimensions $[n^{(l)}\times N]$: $$ \Delta^{(l)} = \left(W^{(l+1)}\right)^T \Delta^{(l+1)\;i} \odot f'(Z^{(l)}) $$

where $l$ takes the values $L-1, L-2, …, 1$.

In NumPy notation, this matrix for each layer can be implemented like this:

for layer in range(len(self.W) - 1, 0, -1):
    delta = (self.W[layer].T @ delta) * d_relu(z_cache[layer - 1])

Here, we iterate in reverse order over the indices layer from the second last to zero, which corresponds to hidden layers $l=L, L-1, ...1$.
Then, in each layer, we perform the matrix multiplication self.W[layer].T @ delta, which corresponds to multiplying the matrix $\left(W^{(l+1)}\right)^\text{T}$ of size $[n^{(l+1)}\times n^{(l)}]^\text{T}$ or $[n^{(l)}\times n^{(l+1)}]$ by $\Delta^{(l+1)}$ with dimensions $[n^{(l+1)}\times N]$ for the entire batch. The resulting matrix has dimensions $[n^{(l)}\times N].$
Finally, we element-wise multiply the matrix of size $[n^{(l)}\times N]$ by the derivative $f'(Z^{(l)})$ of the same dimensions.

Weight Gradients

Now that we know the delta vector $\vec{\boldsymbol{\delta}}^{(L)}$ of the output layer with dimensions $[1\times N]$ for the entire batch and each subsequent delta matrix $\Delta^{(l)}$ of the hidden layer, we can implement the gradient formulas.

Recall that for the output layer:

For a single example $(\nabla_{\vec{\mathbf{w}}^{(L)}} \mathcal{L}) = \delta^{(L)} \cdot \left(\vec{\mathbf{h}}^{(L-1)}\right)^\text{T}$ is a vector of size $[1\times n^{(L-1)}]$
For the entire batch of examples — this is the product of the row vector $\vec{\boldsymbol{\delta}}^{(L)}$ with dimensions $[1\times N]$ and the transposed matrix $\left(H^{(L-1)}\right)^\text{T}$ with dimensions $[N\times n^{(L-1)}]$. That is, the result of the product is a matrix of size $[1\times n^{(L-1)}]$

$$ \nabla_{W^{(L)}} \mathcal{L} = \vec{ \boldsymbol{\delta}}^{(L)} \cdot \left(H^{(L-1)}\right)^\text{T} $$

For hidden layers:

For a single example $\nabla_{W^{(l)}} \mathcal{L} = \vec{ \boldsymbol{\delta}}^{(l)} \cdot \left(\vec{\mathbf{h}}^{(l-1)}\right)^\text{T}$ is a matrix of size $[n^{(l)}\times n^{(l-1)}]$
For the entire batch of examples — this is the product of the matrix $\Delta^{(l)}$ with dimensions $[n^{(l)}\times N]$ and the transposed matrix $\left(H^{(l-1)}\right)^\text{T}$ with dimensions $[N\times n^{(l-1)}]$. That is, the result of the product is a matrix of size $[n^{(l)}\times n^{(l-1)}]$:

$$ \nabla_{W^{(l)}} \mathcal{L} = \Delta^{(l)} \cdot \left(H^{(l-1)}\right)^\text{T} $$

These formulas are implemented in NumPy with a single line of matrix multiplication:

delta @ H_cache[layer - 1].T

Here delta is a matrix of size $[n^{(l)}\times n^{(l-1)}]$. It takes the dimensions $[1\times n^{(L-1)}]$ for the last layer, since in our network there is only 1 neuron in the output layer: $n^{(L)} = 1.$

cache[layer - 1] is an element of the array $H^{(0)}, H^{(1)}, ... H^{(L)}$ saved during the forward pass, which corresponds to the previous layer.

Since we go through all layers from the last to the first in the backward pass, we save these gradients in reverse order into an array:

# Array of gradients for each layer
dW: list[Matrix] = []

# Start with the output layer
delta: Batch = (p_hat - y) / N
dW.insert(0, delta @ H_cache[-1].T)

# Then recursively move to the first layer
for layer in range(len(self.W) - 1, 0, -1):
    delta = (self.W[layer].T @ delta) * d_relu(z_cache[layer - 1])
    dW.insert(0, delta @ H_cache[layer - 1].T)

Bias Gradients

Similarly, we can now implement the formulas for the bias gradients.

Recall that for the output layer:

For a single example $\nabla_{b^{(L)}} \mathcal{L} = \dfrac{\partial \mathcal{L}}{\partial z^{(L)}} \dfrac{\partial z^{(L)}}{\partial b^{(L)}} = \delta^{(L)} \cdot 1$ is a scalar with dimensions $[1\times 1]$
For $N$ examples — this is the product of the row vector $\vec{\boldsymbol{\delta}}^{(l)}$ with dimensions $[1\times N]$ for the entire batch and the unit row vector $\vec{1}$ with dimensions $[N\times 1]$, i.e., the bias gradient is a scalar with dimensions $[1\times 1]$

For hidden layers:

For a single example $\nabla_{\vec{\mathbf{b}}^{(l)}} \mathcal{L} = \vec{\boldsymbol{\delta}}^{(l)}\cdot 1$ is a vector of size $[n^{(l)}\times 1]$
For all examples in the batch — this is the product of the matrix $\Delta^{(l)}$ of size $[n^{(l)}\times N]$ and the unit column vector $\vec{1}$ with dimensions $[N\times 1]$, i.e., the resulting gradient is a column vector of size $[n^{(l)}\times 1]$.

We can implement this gradient via matrix multiplication in NumPy:

delta @ np.ones((N, 1))

But even faster is to simply sum the deltas for all examples in the batch:

delta.sum(axis=1, keepdims=True)

Here delta is the matrix $\Delta^{(l)}$ of size $[n^{(l)}\times N]$. In particular, for the output layer it has dimensions $[1\times N]$.

sum(axis=1) means we sum over the columns, i.e., over N. The keepdims=True parameter means we preserve the matrix dimension after summation.

Since we go through all layers from the last to the first in the backward pass, we save these gradients in reverse order into an array:

# Array of gradients for each layer
db: list[Vector] = []

# Start with the output layer
delta: Batch = (p_hat - y) / N
db.insert(0, delta.sum(axis=1, keepdims=True))

# Then recursively move to the first layer
for layer in range(len(self.W) - 1, 0, -1):
    delta = (self.W[layer].T @ delta) * d_relu(z_cache[layer - 1])
    db.insert(0, delta.sum(axis=1, keepdims=True))

General Gradient Descent Algorithm

Step 1. Shuffle the dataset at the start of the epoch

For each epoch, we take the entire dataset and shuffle it. This can be implemented as:

N_total: int = X.shape[1]
for epoch in range(1, self.epochs + 1):
    idx = np.random.permutation(N_total)
    X, y = X[:, idx], y[:, idx]

Here Х is the astrological features of all examples taken from AstroDataBank, i.e., a matrix of size $[\text{input_dim}\times N_\text{total}]$.

First, we take the indices $i$ of all examples and shuffle them, then we reassemble the examples $(\vec{\mathbf{x}, y})_i$ in the order of the new (shuffled) indices.

Step 2. Split the dataset into batches

Next, we split the dataset into batches and start training on each individual batch of N examples. This split can be implemented as:

for s in range(0, N_total, self.batch_size):
    Xb: Batch = X[:, s: s + self.batch_size]
    yb: Batch = y[:, s: s + self.batch_size]

Here, we iterate over the sequential numbers of dataset elements $0, 1, ..., N_\text{total}$ with a step of $N$ (the batch size), i.e., sequentially selecting the $s$-th element, which is a multiple of the batch size.

Then, we extract two submatrices from the dataset — $X_\text{batch}$ of size $[\text{input_dim}\times N]$ and $y_\text{batch}$ of size $[1\times N]$ by simply selecting columns in the range from index $s$ to $s+N$. These two submatrices together form a batch of $N$ examples $(\vec{\mathbf{x}, y})_i$.

Step 3. Forward and Backward Passes

Now we must run a forward pass on our batch and store the results of the layer activations — the row vector $\vec{\mathbf{\hat{p}}}$ for the output layer and the matrices $H^{(l)}$ for all intermediate layers.

Then, using these stored quantities, we perform the backward pass and store the gradients for weights and biases for all layers.

This step can be implemented as:

p_hat, H_cache = forward(Xb)
dW, db = backward(p_hat, yb, H_cache)

Step 4. Update Parameters

Gradients indicate the direction of increase of the loss function, so we need to take a small step in the opposite direction for each example:

$$ \begin{align} & W^{(l)} = W^{(l)} - \lambda\cdot\nabla_{W^{(l)}} \mathcal{L}\\ & \vec{\mathbf{b}}^{(l)} = \vec{\mathbf{b}}^{(l)} - \lambda\cdot \nabla_{\vec{\mathbf{b}}^{(l)}} \mathcal{L} \end{align} $$

This parameter update can be implemented via a regular Python loop over all layers:

self.W = [W - self.lr * dw for W, dw in zip(self.W, dW)]
self.b = [b - self.lr * db_ for b, db_ in zip(self.b, db)]

Complete Code

Now we can put everything together into a single code. This code emulates an example of astrological features and event labels and performs training on a small neural network based on them.

Then the code evaluates the prediction accuracy on the same dataset.

import numpy as np
from numpy.typing import NDArray
from typing import cast
# pyright: reportConstantRedefinition=false

# Geometric types
Vector = NDArray[np.float64]    # [n × 1]   — column
CoVector = NDArray[np.float64]  # [1 × n]   — row
Matrix = NDArray[np.float64]    # [m × n]   — weight matrix

# Semantic types
Sample = NDArray[np.float64]    # [n × 1]       — single example x⃗
Batch = NDArray[np.float64]     # [n × N]       — N examples
Dataset = NDArray[np.float64]   # [n × N_total] — entire dataset


# Activation functions and their derivatives
def sigmoid(z: Batch) -> Batch:
    return 1.0 / (1.0 + np.exp(-z))


def relu(z: Batch) -> Batch:
    return np.maximum(0, z)


def d_relu(z: Batch) -> Batch:
    result: Batch = (z > 0).astype(np.float64)
    return result


class BernoulliNet:
    """
    Neural network for binary classification (Bernoulli distribution).

    Parameters
    ----------
    input_dim   : size of the input vector x
    hidden_dims : list of hidden layer sizes, e.g., [64, 32]
    lr          : gradient descent step λ
    epochs      : number of epochs
    batch_size  : mini-batch size
    """

    W: list[Matrix]
    b: list[Vector]
    lr: float
    epochs: int
    batch_size: int

    def __init__(self,
                 input_dim: int,
                 hidden_dims: list[int],
                 lr: float = 0.01,
                 epochs: int = 100,
                 batch_size: int = 32,
                 ) -> None:
        dims: list[int] = [input_dim] + hidden_dims + [1]

        # Weight initialization: W ~ N(0, 2/n_prev)
        self.W, self.b = self._init_weights(dims)
        self.lr, self.epochs, self.batch_size = lr, epochs, batch_size

    def _init_weights(self, dims: list[int]) -> tuple[list[Matrix], list[Vector]]:
        """
        Initializes weights W and biases b for all layers:
        W ~ N(μ=0, σ²=2/n_prev).
        """
        W_list: list[Matrix] = []
        b_list: list[Vector] = []

        for layer in range(len(dims) - 1):
            n_prev = dims[layer]      # number of neurons in the previous layer
            n_curr = dims[layer + 1]  # number of neurons in the current layer

            # Random matrix with normal distribution N(μ=0, σ²=2/n_prev),
            W_layer: Matrix = np.random.normal(
                loc=0.0,
                scale=np.sqrt(2.0 / n_prev),
                size=(n_curr, n_prev),
            ).astype(np.float64)

            W_list.append(W_layer)

            # Bias vector — column vector of zeros
            b_layer: Vector = np.zeros((n_curr, 1), dtype=np.float64)
            b_list.append(b_layer)

        return W_list, b_list

    def forward(self, X: Batch) -> tuple[Batch, list[Batch]]:
        """
        Forward pass through the network: returns predictions p̂ for all examples in the batch
        and the activation cache Hᵢ for the backward pass (for all examples at each layer).

        Parameters
        ----------
        X : Batch  matrix of astrological features with dimensions [input_dim × N]
        """

        # Forward pass through layer zero:
        H: Batch = X
        H_cache: list[Batch] = [H]

        # Forward pass through hidden layers:
        for W, b in zip(self.W[:-1], self.b[:-1]):
            H = relu(W @ H + b)  # size [n_current × N]
            H_cache.append(H)

        # Forward pass through the output layer:
        p_hat: Batch = sigmoid(
            self.W[-1] @ H + self.b[-1])  # size [1 × N]

        return p_hat, H_cache

    def loss(self, p_hat: Batch, y: Batch) -> float:
        """
        Binary cross-entropy (Bernoulli log-likelihood)
        """
        return float(
            -np.mean(
                y * np.log(p_hat + 1e-9)
                + (1 - y) * np.log(1 - p_hat + 1e-9)
            )
        )

    def backward(self,
                 p_hat: Batch,
                 y: Batch,
                 H_cache: list[Batch]
                 ) -> tuple[list[Matrix], list[Vector]]:
        """
        Backward pass through the network: returns gradients dW and db
        for all layers.

        Parameters
        ----------
        p_hat : Batch  size [1 × N]  — predictions for all examples in the batch
        y     : Batch  size [1 × N]  — event labels for all examples in the batch
        cache : list[Batch]  — stored activations H⁽ˡ⁾ for each layer, size [n⁽ˡ⁾ × N]
        """
        N: int = y.shape[1]
        dW: list[Matrix] = []
        db: list[Vector] = []

        # Output layer:
        delta: Batch = (p_hat - y) / N
        dW.insert(0, delta @ H_cache[-1].T)
        db.insert(0, delta.sum(axis=1, keepdims=True))

        # Hidden layers:
        z_cache: list[Batch] = [
            self.W[layer] @ H_cache[layer] + self.b[layer]
            for layer in range(len(self.W) - 1)
        ]

        # Backward pass through hidden layers
        for layer in range(len(self.W) - 1, 0, -1):
            delta = (self.W[layer].T @ delta) * d_relu(z_cache[layer - 1])
            dW.insert(0, delta @ H_cache[layer - 1].T)
            db.insert(0, delta.sum(axis=1, keepdims=True))

        return dW, db

    def train(self, X: Dataset, y: Dataset, verbose: bool = True) -> None:
        """
        X : Dataset  shape [input_dim × N_total]  — features of all examples in the dataset
        y : Dataset  shape [1 × N_total]  — labels of actual events {0, 1}
        """
        N_total: int = X.shape[1]
        for epoch in range(1, self.epochs + 1):
            idx = np.random.permutation(N_total)
            X, y = X[:, idx], y[:, idx]
            for s in range(0, N_total, self.batch_size):
                Xb: Batch = X[:, s: s + self.batch_size]
                yb: Batch = y[:, s: s + self.batch_size]
                p_hat, H_cache = self.forward(Xb)
                dW, db = self.backward(p_hat, yb, H_cache)
                self.W = [W - self.lr * dw for W, dw in zip(self.W, dW)]
                self.b = [b - self.lr * db_ for b, db_ in zip(self.b, db)]
            if verbose and epoch % 10 == 0:
                p_hat, _ = self.forward(X)
                print(f"Epoch {epoch:4d} | loss = {self.loss(p_hat, y):.4f}")

    def predict_probability(self, X: Dataset) -> Dataset:
        """Returns p̂ ∈ (0,1) of the event for each column of the batch."""
        p_hat, _ = self.forward(X)
        return p_hat

    def predict_label(self, X: Dataset, threshold: float = 0.5) -> Dataset:
        """Returns the observed binary event labels y = {0, 1}."""
        result: Batch = (self.predict_probability(X) >=
                         threshold).astype(np.float64)
        return result


# Example usage
if __name__ == "__main__":
    np.random.seed(42)
    N = 500

    # Emulate astrological features (normalized to [0, 1])
    X: Dataset = np.random.rand(8, N).astype(np.float64)

    # Emulate binary event labels (e.g., "came true" or "did not come true")
    y: Dataset = (X[0] + X[3] > 1.0).astype(np.float64).reshape(1, N)

    # Model initialization and training
    model = BernoulliNet(
        input_dim=8,
        hidden_dims=[16, 9],
        lr=0.05,
        epochs=100,
        batch_size=32,
    )
    model.train(X, y)

    # Evaluate accuracy on the training set
    accuracy: float = cast(float, np.mean(model.predict_label(X) == y))
    print(f"\nTraining accuracy: {accuracy:.2%}")

In a real scenario, accuracy evaluation is first performed on validation data to improve model input parameters, and then the same accuracy evaluation is performed on test data. The found weights and biases must be saved to a file for future use.

Also, in a real scenario, instead of the NumPy library, you should use its counterpart CuPy for fast computation on the GPU.

Alexey Borealis

Master of Science in Physics, Professional astrologer (MAPAI). About the author

Search the Articles