In the previous article we described a machine learning algorithm using a neural network. In this article, we will build Python code to implement this algorithm. We will use the terms and notations from the previous article.
To work with vectors and matrices, we will use the NumPy library, which is originally written in C, meaning it is very fast to execute.
[toc]
Neural Network Structure
Our task is to design a neural network that can be easily initialized with global parameters, namely:
- The dimensionality of the input astrological feature vector $x_i$ —
input_dim - The dimensionality of each hidden layer, which can be set via an array of numbers
hidden_dims, for examplehidden_dims=[16, 9]would mean we have 2 hidden layers with 16 and 9 neurons respectively. - The gradient descent step, $\lambda$ (learning rate,
lr) - The number of epochs for training the model —
epochs - The number of examples $N$ in a single subsample from the training data —
batch_size
For this, we will create a class BernoulliNet that can be initialized with these parameters, for example:
model = BernoulliNet(
input_dim=8,
hidden_dims=[16, 9],
lr=0.05,
epochs=100,
batch_size=32,
)
Here, the parameters input_dims and hidden_dims determine the overall structure of the neural network:
dims = [input_dim, 16, 9, 1]
# ↑ ↑ ↑ ↑
# layer 0 1 2 3 (output)
The weight matrices will have the following dimensions $[\text{rows}\times \text{cols}]$ in this example:
We will make the weights and biases internal properties of the model, i.e.:
class BernoulliNet:
def __init__(self,
input_dim: int,
hidden_dims: list[int],
lr: float = 0.01,
epochs: int = 100,
batch_size: int = 32,
) -> None:
# Dimension of the entire neural network
dims: list[int] = [input_dim] + hidden_dims + [1]
# Weight initialization
self.W, self.b = self._init_weights(dims)
Since we will iterate over the values $W^{(1)}, W^{(2)}, W^{(3)}$, in computer notation these will be array elements with indices self.W[0], self.W[1], self.W[2]. Similarly, the biases $b^{(1)}, b^{(2)}, b^{(3)}$ will be stored in the array self.b[0], self.b[1], self.b[2].
In the general case:
- The weight matrix
self.W[layer-1]corresponds to $W^{(l)}$ with dimensions $[n^{(l)}\times n^{(l-1)}]$. - The bias column vector
self.b[layer-1]corresponds to the column vector $\vec{\mathbf{b}}^{(l)}$ with dimensions $[n^{(l)}\times 1]$.
Data Types
We previously agreed that lower indices denote a column vector $x_i.$ Formally, this is a matrix with $n$ rows and one column $[n\times 1]$.
Upper indices $x^i$ denote the conjugate vector, or a row vector. Formally, this is a matrix with dimensions $[1\times n]$.
Any matrices filled with decimal numbers in the NumPy library are objects of type NDArray[np.float64], so to distinguish their dimensions we will introduce explicit type annotations for the data we will operate on in the code:
import numpy as np
from numpy.typing import NDArray
# Geometric types
# ===================
# [n × 1] — column vector
Vector = NDArray[np.float64]
# [1 × n] — row vector
CoVector = NDArray[np.float64]
# [m × n] — weight matrix
Matrix = NDArray[np.float64]
# Semantic types
# ==================
# [n × 1] — single example (set of inputs x⃗ or intermediate h⃗)
Sample = NDArray[np.float64]
# [n × N] — N examples (N sets of inputs x⃗ or intermediate h⃗ for the entire batch)
Batch = NDArray[np.float64]
# [n × N_total] — entire dataset of features x⃗ from AstroDataBank
Dataset = NDArray[np.float64]
This does not affect functionality — we introduce these annotations solely for code readability.
Activation Functions for the Neural Network
First, we will implement the activation functions in the neural network that we will need later.
Output Layer
Recall that the activation function for the output layer is the sigmoid $\sigma(z^i) = \dfrac{1}{1 + e^{-z^i}}$ — it is applied for each $i$-th example in the batch. Instead of applying the sigmoid to each $z^i$, we can delegate this element-wise computation to the NumPy library, which performs this task in compiled C code, yielding a significant speed advantage. To do this, it is enough to pass the entire set of $N$ examples to the function, i.e., the row vector $\vec{\mathbf{z}}$ with dimensions $[1\times N]$:
def sigmoid(z: Batch) -> Batch:
return 1.0 / (1.0 + np.exp(-z))
Hidden Layers
The activation function for hidden layers $\text{ReLU}(z^i) = \max(0, z^i)$ for each $i$-th example in the batch is implemented on the same principle — we delegate the element-wise function application to NumPy by passing it the entire set of $N$ examples:
def relu(z: Batch) -> Batch:
return np.maximum(0, z)
The derivative of the ReLU function is 1 for $z^i>0$ and 0 in all other cases. We can represent this as a string of boolean values ${z^i>0}=(\text{True} \mid \text{False})^i$, in which we then element-wise convert each $\text{True}_i$ value to 1.0 and each $\text{False}_i$ to 0.0 using the .astype() method of the NumPy library:
def d_relu(z: Batch) -> Batch:
result: Batch = (z > 0).astype(np.float64)
return result
Gradient Descent for the Neural Network
Loss Function
Recall that the loss function for all examples in the batch is described by the formula:
where $\hat{p}^i$ is the predicted probability of a specific event for the $i$-th horoscope example, and $y^i$ is the label of the actual event (1 if the event occurred and 0 otherwise). The NumPy library has a built-in averaging function mean, which is equivalent to $\frac{1}{N}\sum_1^N$, so the loss function can be implemented as:
def loss(self, p_hat: Batch, y: Batch) -> float:
"""
Binary cross-entropy (Bernoulli log-likelihood)
"""
return float(
-np.mean(
y * np.log(p_hat + 1e-9)
+ (1 - y) * np.log(1 - p_hat + 1e-9)
)
)
Here p_hat and y are row vectors with dimensions $[1\times N]$
Weight and Bias Initialization
We mentioned in the first part of the article that for stable training, the weights in each layer $l$ should be initialized with random numbers having zero mean and variance $2/(n^{(l-2)})$, where $n^{(l-1)}$ is the number of neurons in the previous layer.
For a Single Layer
In the NumPy library, initializing the weight matrix $W^{(l)}$ with dimensions $[n^{(l)}\times n^{(l-1)}]$ for each $l$-th layer can be implemented as follows:
W_layer: Matrix = np.random.normal( # Normal distribution
loc=0.0, # mean value
scale=np.sqrt(2.0 / n_prev), # Standard deviation
size=(n_curr, n_prev), # Matrix size
).astype(np.float64) # decimal numbers in matrix cells
We also need to initialize the bias column vector $b^{(l)}$ with dimensions $[n^{(l)}\times 1]$ for each $l$-th layer to zeros. In NumPy notation, this can be implemented as:
b_layer: Vector = np.zeros( # Array of zeros
(n_curr, 1), # as a column
dtype=np.float64 # in decimal form, i.e., 0.0
)
For All Layers
To initialize the parameters in all layers of the neural network, we simply apply these formulas sequentially at the junctions of adjacent layers:
for layer in range(len(dims) - 1):
n_prev = dims[layer] # number of neurons in the previous layer
n_curr = dims[layer + 1] # number of neurons in the current layer
# Next, initialize weight matrices and bias vectors
After that, we add these initial values to the resulting list of weights and biases for all layers of the neural network self.W[0], self.W[1], self.W[2] and self.b[0], self.b[1], self.b[2], which correspond to $W^{(1)}, W^{(2)}, W^{(3)}$ and $b^{(1)}, b^{(2)}, b^{(3)}$ in our formulas.
Forward Pass
As you remember from the previous article, after initializing the weights and biases, we first perform a forward pass, applying the formula for each $i$-th example $(\vec{\mathbf{x}}, y)^i$ in the batch:
and then we store these values in a cache to use them later in the backward pass.
As usual, instead of applying these formulas cyclically to each example individually, from a speed perspective it is more efficient to pass the entire array of examples to the NumPy library — its matrix computations are significantly faster than a regular Python loop.
Layer Zero
In layer zero, for each $i$-th example in the batch $\vec{\mathbf{h}}^{(0)\,i} = \vec{\mathbf{x}}^i$ — a column vector with dimensions $[\text{input_dim} \times 1]$. And for all examples in the batch, this vector turns into a matrix $H^{(0)}=X$ with dimensions $[\text{input_dim}\times N]$.
In NumPy notation, this matrix can be defined as:
H: Batch = X
H_cache: list[Batch] = [H]
In the first line, we perform element-wise assignment. In the second line, we initialize the temporary storage H_cache for matrices $H^{(0)}, H^{(1)}, ... H^{(L-1)}$, which we will need during the backward pass. For initialization, we simply put the value $H^{(0)}$ into the H_cache array. In summary:
H_cache[0], H_cache[1], H_cache[2]correspond to matrices $H^{(0)}, H^{(1)}, H^{(2)}$- The dimension of each matrix $H^{(l)}$ is $[n^{(l)}\times N]$.
Hidden Layers
Recall that the output result $\vec{\mathbf{h}}^{(l)} = \text{ReLU}\left(W^{(l)} \vec{\mathbf{h}}^{(l-1)} + \vec{\mathbf{b}}^{(l)}\right)$ of each hidden layer $l$ is a column vector with dimensions $[n^{(l)}\times 1]$ for a single example. But for all $N$ examples in the batch, it is a matrix $H^{(l)}$ of size $[n^{(l)} \times N]$. In essence, it is the result of activation of the $l$-th hidden layer for the entire set of $N$ examples in the batch.
In NumPy notation, these matrices can be implemented as:
for W, b in zip(self.W[:-1], self.b[:-1]):
H = relu(W @ H + b)
H_cache.append(H)
Here, inside the loop in the first line:
- The variable
His the output of the previous layer $H^{(l-1)}$ with dimensions $[n^{(l-1)}\times N]$. - The variable
Wisself.W[layer-1], i.e., the matrix $W^{(l)}$ with dimensions $[n^{(l)}\times n^{(l-1)}]$ bis the vectorself.b[layer-1]with dimensions $[n^{(l)}\times 1]$, which NumPy automatically expands to a matrix $[n^{(l)}\times N]$ by simple column repetition to enable the addition operation.W @ H + bis a matrix of size $[n^{(l)}\times N]$, to which ReLU is applied element-wise.- The result of the element-wise activation function application is stored in the variable
H, which corresponds to $H^{(l)}$ in our formula and will be used in the next loop.
In the second line of the loop, we simply store this resulting value $H^{(l)}$ in the cache for use in the backward pass.
The regular Python loop iterates through all layers. We take the set (self.W[layer], self.b[layer]) excluding the last layer, which corresponds to $(W^{(l)},\vec{\mathbf{b}}^{(l)}), l=1...L-1$ in our formulas.
Here, the zip function combines the arrays $W^{(l)}$ and $\vec{\mathbf{b}}^{(l)}$ into an array of pairs $(W^{(l)},\vec{\mathbf{b}}^{(l)})$, and the notation [:-1] means we iterate over all values self.W[0], self.W[1], ... and self.b[0], self.b[1], ..., except the last one, i.e., over $W^{(1)}, W^{(2)}, ... W^{(L-1)}$ and $b^{(1)}, b^{(2)}, ..., b^{(L-1)}$.
Output Layer
For the output layer, we apply the formula $\hat{p}^{\,i} = \sigma\left(\left(\vec{\mathbf{w}}^{(l)\,i}\right)^{\text{T}}\;\vec{\mathbf{h}}^{(l-1)\,i} + b^{(l)\,i}\right)$ for each $i$-th example. For a single example, this is a scalar, i.e., a $[1\times 1]$ matrix, and for the entire batch of examples, it is a row vector of size $[1\times N]$. As usual, we pass the entire set of examples from the batch to the NumPy library for fast sequential sigmoid application:
p_hat: Batch = sigmoid(
self.W[-1] @ H + self.b[-1]
)
Here, the notation [-1] means we take the last element of the self.W array, i.e., the output layer matrix $W^{(L)}$ with dimensions $[n^{(L)}\times n^{(L-1)}] = [1\times n^{(L-1)}]$, multiply it by the output of the previous layer $H^{(L-1)}$ of size $[n^{(L-1)}\times N]$, which gives us a row vector of size $[1\times N]$, and then we add the last element of the self.b list with dimensions $[1\times 1]$, which the NumPy library converts into a row vector of size $[1\times N]$ by simply repeating the number b $N$ times.
As a result, we obtain a row vector $\vec{\mathbf{\hat{p}}}$ with numbers between 0 and 1 — the probabilities of the event occurring for each example in the batch.
Backward Pass
For the backward pass, we first need to compute the delta for the output layer, and then, using the recurrent formula, the deltas for the remaining layers.
Output Layer Delta
Recall that the output layer delta for each $i$-th example is a scalar $\delta^{(L)\;i} = \frac{1}{N}\sum_{k=1}^{N}(\hat{p}_k - y_k)^i.$ But for all examples in the batch, it is a row vector of size $[1\times N]$.
In NumPy library notation, this vector is implemented as:
delta: Batch = (p_hat - y) / N
Here, the subtraction operation of two arrays (p_hat - y) computes the difference for each element of the batch vector element-wise, and the division by the scalar N automatically performs the averaging.
Hidden Layers Deltas
As you remember, for hidden layers, the recurrent formula for the delta of each layer for each $i$-th example looks like: $$ \vec{\boldsymbol{\delta}}^{(l)\,i} = \left(W^{(l+1)}\right)^T \vec{\boldsymbol{\delta}}^{(l+1)\,i} \odot f'(\vec{\mathbf{z}}^{(l)\,i}) $$
Step 1
First, we implement the computation of $\vec{\mathbf{z}}^{(l)\,i}.$
For each $i$-th example, $\vec{\mathbf{z}}^{(l)\,i} = \left(W^{(l)}\right)^\text{T}\cdot \vec{\mathbf{h}}^{(l-1)\,i} + \vec{\mathbf{b}}^{(l)}$ is a column vector with dimensions $[n^{(l)}\times 1]$, where $n^{(l)} $ is the number of neurons in the $l$-th layer. And for the entire batch of examples, it is a matrix $Z^{(l)}$ with dimensions $[n^{(l)}\times N].$
In NumPy notation, we can write this as:
z_cache[layer] = self.W[layer] @ H_cache[layer] + self.b[layer]
Recall that H_cache is a temporary storage (cache) for matrices $H^{(0)}, H^{(1)}, ... H^{(L-1)}$, which we created during the forward pass. Each of the matrices has dimensions $[n^{(l)}\times N]$, and self.W[layer] has dimensions $[n^{(l+1)}\times n^{(l)}]$, so the matrix product self.W[layer] @ H_cache[layer] has dimensions $[n^{(l+1)}\times N]$.
The individual element self.b[layer] has dimensions $[n^{(l+1)}\times 1]$, but NumPy automatically expands it to a matrix $[n^{(l+1)}\times N]$ by simple column repetition.
In summary:
- The array elements
z_cache[0], z_cache[1], ...correspond to $Z^{(1)}, Z^{(2)}, ...$ - Each matrix $Z^{(l)}$ has dimensions $[n^{(l)}\times N].$
Step 2
Now that we know $Z^{(l)}$ for all hidden layers $l=1, 2, …, L-1$, we can proceed to implement the delta for hidden layers. For each hidden layer, for the $i$-th example, $\vec{\boldsymbol{\delta}}^{(l)\,i} = \left(W^{(l+1)}\right)^T \vec{\boldsymbol{\delta}}^{(l+1)\,i} \odot f'(\vec{\mathbf{z}}^{(l)\,i})$ is a column vector with dimensions $[n^{(l)}\times 1]$, and for the entire batch of $N$ examples, it is already a matrix with dimensions $[n^{(l)}\times N]$: $$ \Delta^{(l)} = \left(W^{(l+1)}\right)^T \Delta^{(l+1)\;i} \odot f'(Z^{(l)}) $$
where $l$ takes the values $L-1, L-2, …, 1$.
In NumPy notation, this matrix for each layer can be implemented like this:
for layer in range(len(self.W) - 1, 0, -1):
delta = (self.W[layer].T @ delta) * d_relu(z_cache[layer - 1])
-
Here, we iterate in reverse order over the indices
layerfrom the second last to zero, which corresponds to hidden layers $l=L, L-1, ...1$. -
Then, in each layer, we perform the matrix multiplication
self.W[layer].T @ delta, which corresponds to multiplying the matrix $\left(W^{(l+1)}\right)^\text{T}$ of size $[n^{(l+1)}\times n^{(l)}]^\text{T}$ or $[n^{(l)}\times n^{(l+1)}]$ by $\Delta^{(l+1)}$ with dimensions $[n^{(l+1)}\times N]$ for the entire batch. The resulting matrix has dimensions $[n^{(l)}\times N].$ - Finally, we element-wise multiply the matrix of size $[n^{(l)}\times N]$ by the derivative $f'(Z^{(l)})$ of the same dimensions.
Weight Gradients
Now that we know the delta vector $\vec{\boldsymbol{\delta}}^{(L)}$ of the output layer with dimensions $[1\times N]$ for the entire batch and each subsequent delta matrix $\Delta^{(l)}$ of the hidden layer, we can implement the gradient formulas.
Recall that for the output layer:
- For a single example $(\nabla_{\vec{\mathbf{w}}^{(L)}} \mathcal{L}) = \delta^{(L)} \cdot \left(\vec{\mathbf{h}}^{(L-1)}\right)^\text{T}$ is a vector of size $[1\times n^{(L-1)}]$
- For the entire batch of examples — this is the product of the row vector $\vec{\boldsymbol{\delta}}^{(L)}$ with dimensions $[1\times N]$ and the transposed matrix $\left(H^{(L-1)}\right)^\text{T}$ with dimensions $[N\times n^{(L-1)}]$. That is, the result of the product is a matrix of size $[1\times n^{(L-1)}]$
$$ \nabla_{W^{(L)}} \mathcal{L} = \vec{ \boldsymbol{\delta}}^{(L)} \cdot \left(H^{(L-1)}\right)^\text{T} $$
For hidden layers:
- For a single example $\nabla_{W^{(l)}} \mathcal{L} = \vec{ \boldsymbol{\delta}}^{(l)} \cdot \left(\vec{\mathbf{h}}^{(l-1)}\right)^\text{T}$ is a matrix of size $[n^{(l)}\times n^{(l-1)}]$
- For the entire batch of examples — this is the product of the matrix $\Delta^{(l)}$ with dimensions $[n^{(l)}\times N]$ and the transposed matrix $\left(H^{(l-1)}\right)^\text{T}$ with dimensions $[N\times n^{(l-1)}]$. That is, the result of the product is a matrix of size $[n^{(l)}\times n^{(l-1)}]$:
$$ \nabla_{W^{(l)}} \mathcal{L} = \Delta^{(l)} \cdot \left(H^{(l-1)}\right)^\text{T} $$
These formulas are implemented in NumPy with a single line of matrix multiplication:
delta @ H_cache[layer - 1].T
Here delta is a matrix of size $[n^{(l)}\times n^{(l-1)}]$. It takes the dimensions $[1\times n^{(L-1)}]$ for the last layer, since in our network there is only 1 neuron in the output layer: $n^{(L)} = 1.$
cache[layer - 1] is an element of the array $H^{(0)}, H^{(1)}, ... H^{(L)}$ saved during the forward pass, which corresponds to the previous layer.
Since we go through all layers from the last to the first in the backward pass, we save these gradients in reverse order into an array:
# Array of gradients for each layer
dW: list[Matrix] = []
# Start with the output layer
delta: Batch = (p_hat - y) / N
dW.insert(0, delta @ H_cache[-1].T)
# Then recursively move to the first layer
for layer in range(len(self.W) - 1, 0, -1):
delta = (self.W[layer].T @ delta) * d_relu(z_cache[layer - 1])
dW.insert(0, delta @ H_cache[layer - 1].T)
Bias Gradients
Similarly, we can now implement the formulas for the bias gradients.
Recall that for the output layer:
-
For a single example $\nabla_{b^{(L)}} \mathcal{L} = \dfrac{\partial \mathcal{L}}{\partial z^{(L)}} \dfrac{\partial z^{(L)}}{\partial b^{(L)}} = \delta^{(L)} \cdot 1$ is a scalar with dimensions $[1\times 1]$
-
For $N$ examples — this is the product of the row vector $\vec{\boldsymbol{\delta}}^{(l)}$ with dimensions $[1\times N]$ for the entire batch and the unit row vector $\vec{1}$ with dimensions $[N\times 1]$, i.e., the bias gradient is a scalar with dimensions $[1\times 1]$
For hidden layers:
- For a single example $\nabla_{\vec{\mathbf{b}}^{(l)}} \mathcal{L} = \vec{\boldsymbol{\delta}}^{(l)}\cdot 1$ is a vector of size $[n^{(l)}\times 1]$
- For all examples in the batch — this is the product of the matrix $\Delta^{(l)}$ of size $[n^{(l)}\times N]$ and the unit column vector $\vec{1}$ with dimensions $[N\times 1]$, i.e., the resulting gradient is a column vector of size $[n^{(l)}\times 1]$.
We can implement this gradient via matrix multiplication in NumPy:
delta @ np.ones((N, 1))
But even faster is to simply sum the deltas for all examples in the batch:
delta.sum(axis=1, keepdims=True)
Here delta is the matrix $\Delta^{(l)}$ of size $[n^{(l)}\times N]$. In particular, for the output layer it has dimensions $[1\times N]$.
sum(axis=1) means we sum over the columns, i.e., over N. The keepdims=True parameter means we preserve the matrix dimension after summation.
Since we go through all layers from the last to the first in the backward pass, we save these gradients in reverse order into an array:
# Array of gradients for each layer
db: list[Vector] = []
# Start with the output layer
delta: Batch = (p_hat - y) / N
db.insert(0, delta.sum(axis=1, keepdims=True))
# Then recursively move to the first layer
for layer in range(len(self.W) - 1, 0, -1):
delta = (self.W[layer].T @ delta) * d_relu(z_cache[layer - 1])
db.insert(0, delta.sum(axis=1, keepdims=True))
General Gradient Descent Algorithm
Step 1. Shuffle the dataset at the start of the epoch
For each epoch, we take the entire dataset and shuffle it. This can be implemented as:
N_total: int = X.shape[1]
for epoch in range(1, self.epochs + 1):
idx = np.random.permutation(N_total)
X, y = X[:, idx], y[:, idx]
Here Х is the astrological features of all examples taken from AstroDataBank, i.e., a matrix of size $[\text{input_dim}\times N_\text{total}]$.
First, we take the indices $i$ of all examples and shuffle them, then we reassemble the examples $(\vec{\mathbf{x}, y})_i$ in the order of the new (shuffled) indices.
Step 2. Split the dataset into batches
Next, we split the dataset into batches and start training on each individual batch of N examples. This split can be implemented as:
for s in range(0, N_total, self.batch_size):
Xb: Batch = X[:, s: s + self.batch_size]
yb: Batch = y[:, s: s + self.batch_size]
Here, we iterate over the sequential numbers of dataset elements $0, 1, ..., N_\text{total}$ with a step of $N$ (the batch size), i.e., sequentially selecting the $s$-th element, which is a multiple of the batch size.
Then, we extract two submatrices from the dataset — $X_\text{batch}$ of size $[\text{input_dim}\times N]$ and $y_\text{batch}$ of size $[1\times N]$ by simply selecting columns in the range from index $s$ to $s+N$. These two submatrices together form a batch of $N$ examples $(\vec{\mathbf{x}, y})_i$.
Step 3. Forward and Backward Passes
Now we must run a forward pass on our batch and store the results of the layer activations — the row vector $\vec{\mathbf{\hat{p}}}$ for the output layer and the matrices $H^{(l)}$ for all intermediate layers.
Then, using these stored quantities, we perform the backward pass and store the gradients for weights and biases for all layers.
This step can be implemented as:
p_hat, H_cache = forward(Xb)
dW, db = backward(p_hat, yb, H_cache)
Step 4. Update Parameters
Gradients indicate the direction of increase of the loss function, so we need to take a small step in the opposite direction for each example:
This parameter update can be implemented via a regular Python loop over all layers:
self.W = [W - self.lr * dw for W, dw in zip(self.W, dW)]
self.b = [b - self.lr * db_ for b, db_ in zip(self.b, db)]
Complete Code
Now we can put everything together into a single code. This code emulates an example of astrological features and event labels and performs training on a small neural network based on them.
Then the code evaluates the prediction accuracy on the same dataset.
import numpy as np
from numpy.typing import NDArray
from typing import cast
# pyright: reportConstantRedefinition=false
# Geometric types
Vector = NDArray[np.float64] # [n × 1] — column
CoVector = NDArray[np.float64] # [1 × n] — row
Matrix = NDArray[np.float64] # [m × n] — weight matrix
# Semantic types
Sample = NDArray[np.float64] # [n × 1] — single example x⃗
Batch = NDArray[np.float64] # [n × N] — N examples
Dataset = NDArray[np.float64] # [n × N_total] — entire dataset
# Activation functions and their derivatives
def sigmoid(z: Batch) -> Batch:
return 1.0 / (1.0 + np.exp(-z))
def relu(z: Batch) -> Batch:
return np.maximum(0, z)
def d_relu(z: Batch) -> Batch:
result: Batch = (z > 0).astype(np.float64)
return result
class BernoulliNet:
"""
Neural network for binary classification (Bernoulli distribution).
Parameters
----------
input_dim : size of the input vector x
hidden_dims : list of hidden layer sizes, e.g., [64, 32]
lr : gradient descent step λ
epochs : number of epochs
batch_size : mini-batch size
"""
W: list[Matrix]
b: list[Vector]
lr: float
epochs: int
batch_size: int
def __init__(self,
input_dim: int,
hidden_dims: list[int],
lr: float = 0.01,
epochs: int = 100,
batch_size: int = 32,
) -> None:
dims: list[int] = [input_dim] + hidden_dims + [1]
# Weight initialization: W ~ N(0, 2/n_prev)
self.W, self.b = self._init_weights(dims)
self.lr, self.epochs, self.batch_size = lr, epochs, batch_size
def _init_weights(self, dims: list[int]) -> tuple[list[Matrix], list[Vector]]:
"""
Initializes weights W and biases b for all layers:
W ~ N(μ=0, σ²=2/n_prev).
"""
W_list: list[Matrix] = []
b_list: list[Vector] = []
for layer in range(len(dims) - 1):
n_prev = dims[layer] # number of neurons in the previous layer
n_curr = dims[layer + 1] # number of neurons in the current layer
# Random matrix with normal distribution N(μ=0, σ²=2/n_prev),
W_layer: Matrix = np.random.normal(
loc=0.0,
scale=np.sqrt(2.0 / n_prev),
size=(n_curr, n_prev),
).astype(np.float64)
W_list.append(W_layer)
# Bias vector — column vector of zeros
b_layer: Vector = np.zeros((n_curr, 1), dtype=np.float64)
b_list.append(b_layer)
return W_list, b_list
def forward(self, X: Batch) -> tuple[Batch, list[Batch]]:
"""
Forward pass through the network: returns predictions p̂ for all examples in the batch
and the activation cache Hᵢ for the backward pass (for all examples at each layer).
Parameters
----------
X : Batch matrix of astrological features with dimensions [input_dim × N]
"""
# Forward pass through layer zero:
H: Batch = X
H_cache: list[Batch] = [H]
# Forward pass through hidden layers:
for W, b in zip(self.W[:-1], self.b[:-1]):
H = relu(W @ H + b) # size [n_current × N]
H_cache.append(H)
# Forward pass through the output layer:
p_hat: Batch = sigmoid(
self.W[-1] @ H + self.b[-1]) # size [1 × N]
return p_hat, H_cache
def loss(self, p_hat: Batch, y: Batch) -> float:
"""
Binary cross-entropy (Bernoulli log-likelihood)
"""
return float(
-np.mean(
y * np.log(p_hat + 1e-9)
+ (1 - y) * np.log(1 - p_hat + 1e-9)
)
)
def backward(self,
p_hat: Batch,
y: Batch,
H_cache: list[Batch]
) -> tuple[list[Matrix], list[Vector]]:
"""
Backward pass through the network: returns gradients dW and db
for all layers.
Parameters
----------
p_hat : Batch size [1 × N] — predictions for all examples in the batch
y : Batch size [1 × N] — event labels for all examples in the batch
cache : list[Batch] — stored activations H⁽ˡ⁾ for each layer, size [n⁽ˡ⁾ × N]
"""
N: int = y.shape[1]
dW: list[Matrix] = []
db: list[Vector] = []
# Output layer:
delta: Batch = (p_hat - y) / N
dW.insert(0, delta @ H_cache[-1].T)
db.insert(0, delta.sum(axis=1, keepdims=True))
# Hidden layers:
z_cache: list[Batch] = [
self.W[layer] @ H_cache[layer] + self.b[layer]
for layer in range(len(self.W) - 1)
]
# Backward pass through hidden layers
for layer in range(len(self.W) - 1, 0, -1):
delta = (self.W[layer].T @ delta) * d_relu(z_cache[layer - 1])
dW.insert(0, delta @ H_cache[layer - 1].T)
db.insert(0, delta.sum(axis=1, keepdims=True))
return dW, db
def train(self, X: Dataset, y: Dataset, verbose: bool = True) -> None:
"""
X : Dataset shape [input_dim × N_total] — features of all examples in the dataset
y : Dataset shape [1 × N_total] — labels of actual events {0, 1}
"""
N_total: int = X.shape[1]
for epoch in range(1, self.epochs + 1):
idx = np.random.permutation(N_total)
X, y = X[:, idx], y[:, idx]
for s in range(0, N_total, self.batch_size):
Xb: Batch = X[:, s: s + self.batch_size]
yb: Batch = y[:, s: s + self.batch_size]
p_hat, H_cache = self.forward(Xb)
dW, db = self.backward(p_hat, yb, H_cache)
self.W = [W - self.lr * dw for W, dw in zip(self.W, dW)]
self.b = [b - self.lr * db_ for b, db_ in zip(self.b, db)]
if verbose and epoch % 10 == 0:
p_hat, _ = self.forward(X)
print(f"Epoch {epoch:4d} | loss = {self.loss(p_hat, y):.4f}")
def predict_probability(self, X: Dataset) -> Dataset:
"""Returns p̂ ∈ (0,1) of the event for each column of the batch."""
p_hat, _ = self.forward(X)
return p_hat
def predict_label(self, X: Dataset, threshold: float = 0.5) -> Dataset:
"""Returns the observed binary event labels y = {0, 1}."""
result: Batch = (self.predict_probability(X) >=
threshold).astype(np.float64)
return result
# Example usage
if __name__ == "__main__":
np.random.seed(42)
N = 500
# Emulate astrological features (normalized to [0, 1])
X: Dataset = np.random.rand(8, N).astype(np.float64)
# Emulate binary event labels (e.g., "came true" or "did not come true")
y: Dataset = (X[0] + X[3] > 1.0).astype(np.float64).reshape(1, N)
# Model initialization and training
model = BernoulliNet(
input_dim=8,
hidden_dims=[16, 9],
lr=0.05,
epochs=100,
batch_size=32,
)
model.train(X, y)
# Evaluate accuracy on the training set
accuracy: float = cast(float, np.mean(model.predict_label(X) == y))
print(f"\nTraining accuracy: {accuracy:.2%}")
In a real scenario, accuracy evaluation is first performed on validation data to improve model input parameters, and then the same accuracy evaluation is performed on test data. The found weights and biases must be saved to a file for future use.
Also, in a real scenario, instead of the NumPy library, you should use its counterpart CuPy for fast computation on the GPU.