Machine Learning for Predictive Astrology: A Mathematical Formalization

Blog ❘ Studies

March 27, 2026, 8:18 a.m. Advanced Alexey Borealis 26 min. to read

Historically, mathematics and astronomy were used in 16th–17th century European universities to forecast future events. Over many years of observation, a large body of predictive techniques has been accumulated, which can now be used to train AI systems.

Together, these techniques form the scientific paradigm of Renaissance predictive astrology, which encompasses three core components:

Primary directions — for predicting the year of an event
Revolution charts — for narrowing down the month and details of an event
Transits — for pinpointing the day of an event

In this article we formalize the mathematical apparatus required to build computer models capable of making predictions about the future. The article is structured as a textbook, deriving all formulas from first principles. The end result is a ready-to-use algorithm for training an AI.

[toc]

Event Prediction and the Bernoulli Formula

Let us start with the simplest case. A predictive technique forecasts whether a clearly identifiable event will occur within a given time period (usually a day or a month) — for example, the birth of a child, a legal marriage, or emigration to another country.

The expected event has two possible outcomes: it either occurs (event label $y = 1$) or it does not (event label $y = 0$).

Suppose the probability that the event occurs in the specified period is $p.$ Then the probability that it does not occur in that same period is $1 - p.$

We can express both outcomes in a single formula: $$ P(y) = p^y \cdot (1-p)^{1-y} $$ We read this as: the formula gives the probability of occurrence (label $y=1$) or non-occurrence (label $y=0$) of a specific event.

Indeed, if the event occurs, the probability of that outcome is $P(y = 1) = p^1 \cdot (1-p)^{0} = p,$ and if the event does not occur, the probability is $P(y=0) = p^0 \cdot (1-p)^{1 - 0} = 1 - p.$

This is the Bernoulli distribution — a probabilistic model for a binary outcome.

Conditional Probability

Now suppose the observed event depends on a number of astronomical factors. For example, the probability of a violent death on a given day is stronger the closer Mars forms an angle of $x=90°$ to the degree of the zodiacal circle that was rising at the moment of a person's birth.

We can then write the probability of violent death as $P(y = 1 \mid x)$. Here $P$ denotes probability, the label $y=1$ denotes the positive outcome of the event (in our example — a violent death in a given period), and $\mid x$ is read as "given that we know the parameter $x,$ which influences the outcome." In our example, $x$ is the angular distance between Mars and the degree of the ecliptic rising at birth.

Let us denote the conditional probability of the positive outcome as $\hat{p} = P(y = 1 \mid x)$. The Bernoulli formula then becomes:

$$ P(y \mid x) = \hat{p}^y \cdot (1 - \hat{p})^{1-y} $$ Usually an event depends not on one, but on several astronomical factors or features. For example, a violent death might depend on $x_1$ — the angular distance between Mars and the rising degree of the ecliptic at birth, on $x_2$ — the sector of the sky (house) occupied by the ruler of the ascendant, and so on.

We thus obtain a set of astronomical features ${x_1, x_2, \ldots x_d}$ on which the probability $P$ of the future event depends.

It is important to understand that the feature set ${x_1, x_2, \ldots x_d}$ is simply a collection of numbers. These numbers are often normalized to lie in the range from 0 to 1. For example:

The ecliptic angular distance between planets, known as an aspect, can range from 0° to 180°. So an aspect is expressed as a fraction of 180°. For example, a square aspect (90°) equals 0.5, and we write $x_1 = 0.5$.
A planet's celestial condition in astrology can range from −5 (detriment) to +5. To normalize, we add 5 to this value and express the result as a fraction of 10 (scale from −5 to +5). For example, a planet in exaltation (corresponding to 4 points on the −5 to +5 scale) has astrological feature $x_2 = 0.9$.
A planet's position in a sector of the sky (house) can be expressed as the house number divided by the total number of sectors.
And so on. Each feature $x_i$ is simply a number between 0 and 1.

We can represent the feature set $x_i$ as a vector with coordinates: $$ \vec{\mathbf{x}} = \begin{bmatrix} x_1 \ x_2 \ \vdots \ x_n \ \end{bmatrix} $$

The general Bernoulli formula then takes the form: $$ P(y \mid \vec{\mathbf{x}}) = \hat{p}^y \cdot (1 - \hat{p})^{1-y} $$ which is read as the probability of occurrence ($y=1$) or non-occurrence ($y=0$) of an event given the astrological features $\vec{\mathbf{x}}.$

Probability of a Single Prediction

For a computer to make a successful prediction, it must be able to compute the function $\hat{p}^\text{pred} = f(\vec{\mathbf{x}})$, which returns a number between 0 and 1 (the probability of a single event occurring): $$ \hat{p}^\text{pred} = f(\vec{\mathbf{x}}) \in (0, 1) $$ Here the notation $\in (0,1)$ means that the result of the function lies in the interval from 0 to 1.

Computing this function reduces to two steps:

Transform the collection of astrological features $\vec{\mathbf{x}}$ into a single real number $z$.
Express $f(z)$ as a simple mathematical construction that maps any number $z \in (-\infty, +\infty)$ to a value (a probability) $\in (0, 1)$.

Step 1. Affine Transformation

To transform the vector $\vec{\mathbf{x}}$ into a single real number $z$, we take a weighted sum of the values $x_i$ (i.e., $w_1x_1 + w_2x_2 + \ldots$) and add an extra number $b.$

We can write this as:

$$ z = w_1x_1 + w_2x_2 + \ldots + w_nx_n + b = \sum_{i=1}^n w_i x_i + b $$

This type of transformation of the vector $\vec{\mathbf{x}}$ into a number $z$ is called an affine transformation. The values ${w_1, w_2, \ldots}$ are called weights, and the parameter $b$ is called the bias.

If the weights ${w_1, w_2, \ldots}$ are viewed as coordinates of a vector $\vec{\mathbf{w}},$ the transformation can be written in vector form (as the dot product of two vectors):

$$ z = \begin{bmatrix} w_1, w_2, \ldots w_n\end{bmatrix} \times \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \\ \end{bmatrix} = \vec{\mathbf{w}}^\text{T}\; \vec{\mathbf{x}} $$

Here the letter T denotes the transposed (or conjugate) vector — a column of values $w_i$ rotated 90° to form a row.

Einstein Index Notation

For future convenience we use Einstein index notation: values $a_i$ (lower index) denote numbers arranged vertically (an ordinary column vector), while $a^i$ (upper index) denotes numbers arranged horizontally (a transposed row vector):

$$ \begin{align} a_i = \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \\ \end{bmatrix} && a^i = [a_1, a_2, \ldots, a_n] \end{align} $$

The product $\vec{\mathbf{w}}^\text{T}\; \vec{\mathbf{x}} = \sum_{i=1}^n w_i x_i$ can then be written in the abbreviated notation $w^i x_i,$ where summation over paired upper and lower indices is implied automatically, and $i$ runs from 1 to $n.$ From this point on we use Einstein indices alongside vector representations. Our affine transformation in Einstein notation reads: $$ z = w^i x_i + b $$

Step 2. The Sigmoid

Recall that our goal is to express the prediction $\hat{p}^\text{pred} = f(\vec{\mathbf{x}}) \in (0, 1).$

In the previous step we expressed the collection of features $x_i$ as a real number $z = w^i x_i + b.$ Now we need to construct a simple mathematical function that maps any number $z \in (-\infty, +\infty)$ to a value in the interval $(0, 1)$ — i.e., to a probability.

A linear function $f(z) = z$ will not work — it can be negative or greater than 1. We need to "squeeze" the output into the interval $(0, 1)$.

Let us build such a function. Consider the odds ratio of the event occurring versus not occurring: $$ O = \frac{P(y=1 \mid \vec{\mathbf{x}})}{P(y=0 \mid \vec{\mathbf{x}})} = \frac{\hat{p}}{1-\hat{p}} $$ For any $\hat{p}\in(0,1),$ the function $O(\hat{p})$ lies in the interval from 0 (as $\hat{p} \to 0$) to $+\infty$ (as $\hat{p} \to 1$).

Taking the natural logarithm of this function "stretches" it from $-\infty$ to $+\infty.$ $$ z =\ln O(\hat{p}) = \ln\left(\frac{\hat{p}}{1 - \hat{p}}\right) \in (-\infty, +\infty) $$

Visually, for any value of $z$ the graph is bounded along the $\hat{p}$ axis between 0 and 1.

If we simply swap the vertical and horizontal axes, we get the graph we are looking for — for any value of $z$ the quantity $\hat{p}$ stays within $(0, 1)$. This is the mathematical construct that maps any real number to a value $\in (0, 1),$ i.e., to a probability.

We therefore need to express $\hat{p}$ as a function of $z.$

Taking the exponential of the previous formula:

$$ \begin{align} \frac{\hat{p}}{1-\hat{p}} = e^z & \implies \hat{p} = e^z(1-\hat{p}) \implies \hat{p}(1 + e^z) = e^z \\ & \implies \hat{p} = \frac{e^z}{1 + e^z} = \frac{e^z \cdot e^{-z}}{(1 + e^z)\cdot e^{-z}} = \frac{1}{e^{-z} + 1} \end{align} $$

We immediately find the desired function: $\hat{p} = f(z) = \frac{1}{1 + e^{-z}}.$ This function is called the sigmoid and is denoted by the Greek letter sigma: $$ \boxed{\sigma(z) = \frac{1}{1 + e^{-z}}} $$

3. Intermediate Summary

To summarize: for a computer to make a successful prediction, it must compute the function $\hat{p}^\text{pred} = f(\vec{\mathbf{x}})$ from the given astrological features $x_i$, returning a number between 0 and 1 (the probability that a single event occurs).

We can now write this as: $$ \hat{p}^\text{pred} = \sigma(w^ix_i + b) \in (0, 1) $$ In this form the task is easy to formalize. We need to find numerical values of the weights $w^i$ and the bias $b$ such that for any astrological features $x_i$, the predicted probability $\hat{p}^\text{pred}$ is as close as possible to the actually observed probability. The weights $w^i$ and bias $b$ are the same for any feature vector $\vec{\mathbf{x}}$.

The numerical values $(w^i, b)$ are called the parameters of the model, and the process of computing them is called machine learning.

Total Probability over Multiple Predictions

For a single prediction with astrological features $\vec{\mathbf{x}}$ and event label $y$, the Bernoulli formula is: $$ P(y \mid \vec{\mathbf{x}}, \vec{\mathbf{w}}, b) = \hat{p}^{y} \cdot (1 - \hat{p})^{1 - y} $$ where $\hat{p} = \sigma(\vec{\mathbf{w}}^\text{T} \vec{\mathbf{x}} + b).$

We read this as: the model with parameters $\vec{\mathbf{w}}$ and $b$ predicts the probability of occurrence ($y=1$) or non-occurrence ($y=0$) of a single event given the astrological features $\vec{\mathbf{x}}.$

Training a model on a single astrological example makes no sense. The occurrence of a single event under given features $x_i$ could be a pure coincidence. Such a model would be unable to predict the occurrence or non-occurrence of the same event on a different set of astrological features.

A good model therefore trains on many examples of both successful and failed predictions, in order to find optimal weights and bias such that for any astrological features $x_i$ the predicted probability is as close as possible to the real one.

The Likelihood Function

Suppose we have selected $M$ arbitrary horoscopes from the AstroDataBank database with documented relocation dates. For each horoscope, an astrologer computes a set of astrological feature vectors $\vec{\mathbf{x}}_1, \vec{\mathbf{x}}_2, \ldots, \vec{\mathbf{x}}_M$ at the time of relocation. Such features might include:

Normalized angular distance between the ruler of the ascending sign and the cusp of the 9th house
Normalized ecliptic degree of the ruler of the 9th house
Normalized house number occupied by the ascendant ruler
Other numerical values from 0 to 1, collectively forming the feature vector $\vec{\mathbf{x_i}}$

The astrologer also selects an equal number of negative examples — random horoscopes and dates when no relocation occurred — and computes the corresponding feature vectors with event label $y = 0$. In total we have $M$ positive examples $(\vec{\mathbf{x}}_1, y=1), (\vec{\mathbf{x}}_2, y=1), \ldots (\vec{\mathbf{x}}_M, y=1)$ and $M$ negative examples $(\vec{\mathbf{x}}_1, y=0), (\vec{\mathbf{x}}_2, y=0), \ldots (\vec{\mathbf{x}}_M, y=0)$, for a total of $N$ examples.

In the general case we can say that we have a dataset of $N = 2M$ examples: $(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_N, y_N),$ on which the model will be trained to predict relocations from astrological indicators.

We assume the examples are independent of one another (a standard assumption). Then the probability of observing the entire dataset at once is the product of the individual probabilities:

$$ P(\text{all data} \mid \vec{\mathbf{w}}, b) = \prod_{i=1}^{N} P(y_i \mid \vec{\mathbf{x}}_i, \vec{\mathbf{w}}, b) = \prod_{i=1}^{N} \hat{p}_i^{y_i}(1-\hat{p}_i)^{1-y_i} $$

where $\hat{p}_i = \sigma(\vec{\mathbf{w}}^\text{T} \vec{\mathbf{x}}_i + b).$

This product is called the likelihood function $\mathcal{P}(\vec{\mathbf{w}}, b)$. Its meaning: it shows at which parameter values $\vec{\mathbf{w}}, b$ the observed data are most plausible.

The machine learning task: find $(\vec{\mathbf{w}}, b)$ under which the total probability $\mathcal{P}(\vec{\mathbf{w}}, b)$ of observing the entire dataset exactly as it is comes as close to 1 as possible.

Log-Likelihood

A product of a million numbers between 0 and 1 rapidly approaches $10^{-1000000}$ — a computer quickly hits the limits of numerical precision. We therefore work with the logarithm of the likelihood function.

Why the logarithm?

First, the logarithm is a monotone function: the maximum of $\ln \mathcal{P}$ is achieved at the same $(\vec{\mathbf{w}}, b)$ as the maximum of $\mathcal{P}$.
Second, the logarithm of a product becomes a sum. And a sum of a million numbers between 0 and 1 does not approach zero, unlike their product.

$$ \begin{align} \ln \mathcal{P}(\vec{\mathbf{w}}, b) & = \sum_{i=1}^{N} \ln\left[\hat{p}_i^{y_i}(1-\hat{p}_i)^{1-y_i}\right] \\ & = \sum_{i=1}^{N}\left[y_i \ln \hat{p}_i + (1 - y_i)\ln(1 - \hat{p}_i)\right] \end{align} $$

where $\hat{p}_i = \sigma(\vec{\mathbf{w}}^\text{T} \vec{\mathbf{x}}_i + b).$

This is the log-likelihood. Our task is to find $(\vec{\mathbf{w}}, b)$ that maximizes this function.

By convention in machine learning we minimize a loss function, so we negate the log-likelihood and normalize by the number $N$ of examples so that the loss does not grow with the size of the dataset:

$$ \boxed{\mathcal{L}(\vec{\mathbf{w}}, b) = -\frac{1}{N}\sum_{i=1}^{N}\left[y_i \ln \hat{p}_i + (1 - y_i)\ln(1 - \hat{p}_i)\right]} $$

where $\hat{p}_i = \sigma(\vec{\mathbf{w}}^\text{T} \vec{\mathbf{x}}_i + b).$

This is Binary Cross-Entropy (BCE) — the loss function for the entire dataset. The smaller this function, the smaller the loss (i.e., the smaller the total deviation of the predicted event labels ${y_1, y_2, \ldots y_N}$ from those actually observed).

Gradient Descent

To better understand the machine learning process, let us suppose we have only one astrological feature $x.$ Then $\hat{p}_i = \sigma(w x + b),$ so the model has just two parameters — one weight $w$ and one bias $b.$ In this case the loss function is a function of two variables $\mathcal{L}(w, b),$ i.e., a two-dimensional surface. This surface has a depression: for some values of $w$ and $b$ the function has a minimum. Our task is to find the coordinates $(w_0, b_0)$ of that minimum.

Let us visualize this function and the process of finding its minimum.

We start the search from the point $(0, 0).$ We construct tangents to the surface along the $w$ and $b$ axes. The resulting vector (the sum of the two tangents) points in the direction along which we "roll downhill" toward the minimum.

Since the tangents along $w$ and $b$ are mathematically defined by the partial derivatives $\partial \mathcal{L(w, b)} / \partial w$ and $\partial \mathcal{L(w, b)} / \partial b$ (the slope of the surface along $w$ and $b$), the resulting vector is the negative gradient (direction of steepest ascent) of the surface. We take a small step in the opposite direction (against the gradient).

At the new point we again construct two tangents and move downward along the resulting vector. We repeat this procedure until the tangents become nearly parallel to the $(w, b)$ plane — this signals that we have arrived at the minimum, i.e., we have found the coordinates $(w_0, b_0).$

This procedure is known as gradient descent.

In the general case with $n$ weights ${w_1, w_2, \ldots, w_n}$ and one bias $b$, visualizing the surface in $(n+2)$-dimensional space is harder. However, the algorithm remains the same — we construct tangents along $w_1, w_2, \ldots, w_n, b$ and, step by step in the direction of the gradient, converge toward the minimum $(\vec{\mathbf{w}}_0, b_0).$ The coordinates of that point are the result of machine learning: the internal parameters of a model that can predict future events.

Introduction to Neural Networks

However, the model described above has a fundamental limitation. A single affine transformation $z = w^ix_i + b$ is essentially a weighted sum of astrological features. Such a model assumes that each feature influences the probability of an event linearly and independently of the others.

For example, if feature $x_1$ is the angular distance (aspect) between two planets, then the closer this aspect is to 180°, the more strongly that feature pushes the model toward predicting the event — regardless of all other factors.

But astrological tradition says otherwise. The same aspect between planets works very differently depending on the house the planet occupies, its speed, and the other aspects it simultaneously forms with other planets and the observer's horizon. In other words, the features interact nonlinearly.

A linear model cannot capture this — no matter how precise the weights $w^i$, it always sums the features into a single total, ignoring their mutual influence.

The solution is to add intermediate affine transformations (layers) to the model: these first form new composite features from the originals (e.g., "Mars in an angular house while retrograde") and then combine and sum those composite features. Each such layer is another affine transformation $z = w^ix^\text{prev}_i + b,$ applied to the output of the previous layer rather than to the raw input.

After the intermediate result $z$ of an affine transformation, a nonlinear function $f(z)$ is applied — the sigmoid or an analogue — to prevent composite features from collapsing back into a single linear sum. Two consecutive affine transformations without a nonlinearity between them are mathematically equivalent to one, so the whole depth of the network collapses to a single layer.

This multi-layer mathematical structure is called a neural network.

For example, neurons in the first hidden layer may learn to respond to the combination of an aspect and a planet's speed simultaneously; neurons in the second layer may respond to combinations of these composite features with house position. Which combinations turn out to be significant — the model determines on its own during training, without explicit instructions from the astrologer.

The more layers, the more sensitive the model becomes to nonlinear dependencies among features.

Neural Network Architecture

The network has $L$ layers (stages of affine transformation), numbered from $1$ to $L$. Each layer contains not one, but several independent transformations. Each transforming unit within a layer is called a neuron.

Each neuron in layer $l$ receives inputs ${h_1^{(l-1)}, h_2^{(l-1)}, \ldots, h^{(l-1)}_k}$ from neurons 1, 2, … k of the previous layer $(l-1)$, transforms them via the affine transformation $z = w^ih^{(l-1)}_i + b,$ with its own individual weights $w^i$ and bias $b$.

The resulting value $z$ is then passed through a nonlinear activation function $f(z)$ and the result is forwarded to the next layer $(l+1)$.

By analogy with the human brain, the inputs to a neuron are called signals, and $f(z)$ is called the activation function. We say the neuron activates the weighted sum of its incoming signals.

The figure below shows a simple neural network with two hidden layers containing 2 and 3 neurons respectively, as well as the input and output layers.

%%{init: {'flowchart': {'curve': 'basis'}}}%% graph LR x1((x₁)) --> h01(h⁽⁰⁾₁) x2((x₂)) --> h02(h⁽⁰⁾₂) x3((x₃)) --> h03(h⁽⁰⁾₃) h01 -- h⁽⁰⁾₁ --> h11("h⁽¹⁾₁ = f(w⁽¹⁾₁ⁱ h⁽⁰⁾ᵢ + b₁)") h02 -- h⁽⁰⁾₂ --> h11 h03 -- h⁽⁰⁾₃ --> h11 h01 --> h12("h⁽¹⁾₂ = f(w⁽¹⁾₂ⁱ h⁽⁰⁾ᵢ + b₂)") h02 --> h12 h03 --> h12 h11 -- h⁽¹⁾₁ --> h21("h⁽²⁾₁ = f(w⁽²⁾₁ⁱ h⁽¹⁾ᵢ + b₁)") h12 -- h⁽¹⁾₂ --> h21 h11 --> h22("h⁽²⁾₂ = f(w⁽²⁾₂ⁱ h⁽¹⁾ᵢ + b₂)") h12 --> h22 h11 --> h23("h⁽²⁾₃ = f(w⁽²⁾₃ⁱ h⁽¹⁾ᵢ + b₃)") h12 --> h23 h21 -- h⁽²⁾₁ --> h3("p=f(w⁽³⁾₄ⁱ h⁽²⁾ᵢ + b₄)") h22 -- h⁽²⁾₂ --> h3 h23 -- h⁽²⁾₃ --> h3

Let us trace how a signal passes through the neural network.

Input (zeroth) "layer" — simply takes the input data, i.e., the set of astrological features, and passes them on without any modification: $$ \vec{\mathbf{h}}^{(0)} = \vec{\mathbf{x}} \in \mathbb{R}^{d} $$ Here $\in \mathbb{R}^{d}$ means that the dimension of the column vector $\vec{\mathbf{h}}^{(0)}$ of the zeroth layer equals the number $d$ of astrological features. This is equivalent to writing: $$ h^{0}_i = x_i, \quad i = 1,2, \ldots d. $$

Hidden layers $l = 1, \ldots, L-1$:

Each subsequent layer $l$ has $n^{(l)}$ neurons. Each neuron takes the entire vector $\vec{\mathbf{h}}^{(l-1)}$ from the previous layer, forms the linear combination $z=w^{(l)^i} x_i + b$ of its coordinates, and applies the activation function $f(z)$.

The full set of neurons in layer $l$ first produces a set of linear combinations $z_k = w^{(l)i}_k x_i + b_k,$ and then applies the activation function to these values, forming the layer's total output: $h_k = f(z_k),$ where $k = 1, 2, \ldots n^{(l)}.$

The resulting action of the entire layer $l$ in matrix form is: $$ \vec{\mathbf{z}}^{(l)} = W^{(l)} \vec{\mathbf{h}}^{(l-1)} + \vec{\mathbf{b}}^{(l)} $$ where $W^{(l)} \in \mathbb{R}^{n^{(l)} \times n^{(l-1)}}$ is the weight matrix of layer $l$ with dimensions $n^{(l)} \times n^{(l-1)},$ $\vec{\mathbf{b}}^{(l)} \in \mathbb{R}^{n^{(l)}}$ is the bias vector of layer $l$, and $f$ is the activation function applied element-wise.

Activation function in hidden layers:

For hidden layers it is convenient to use the Rectified Linear Unit (ReLU): $$ f(z) = \max(0, z) $$ Why is this more convenient than the sigmoid?

For hidden layers (unlike the final output layer) there is no requirement that they return values between 0 and 1, i.e., probabilities. Hidden layers serve a different purpose — they simply capture nonlinear interactions among astrological features.
Unlike the sigmoid, ReLU does not approach 1 for large $z,$ so ReLU gradients do not vanish as $z$ grows.
It is fast to compute.
Despite its simplicity it is nonlinear (it clips all values $z < 0$) — all of this makes training fast and stable.

Output layer $l = L$:

The output layer has a single neuron, and its activation function is the sigmoid — because we want the network's output to be the predicted probability of a single future event $P(y=1, \vec{\mathbf{x}})$ given the astrological features $x_i$:

$$ \begin{align} & z^{(L)} = \left(\vec{\mathbf{w}}^{(L)}\right)^\text{T} \cdot \vec{\mathbf{h}}^{(L-1)} + b^{(L)} \in \mathbb{R} \\ & f(z) = \sigma(z) = \frac{1}{1 + e^{-z}} \end{align} $$

The sigmoid maps any real number to a probability.

Gradient Descent for the Neural Network

We established earlier that the loss function for the full dataset is: $$ \mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}\left[y_i \ln \hat{p}_i + (1 - y_i)\ln(1 - \hat{p}_i)\right] $$ We also discussed that to apply gradient descent we must sequentially compute gradients — i.e., partial derivatives of $\mathcal{L}(\vec{\mathbf{w}}, b)$ along the axes $w_1, w_2, \ldots, w_d, b$ — which will lead us to the minimum of the function at the point $(\vec{\mathbf{w}}_0, b_0).$ The coordinates of that point are the parameters of the event-predicting model.

In the case of a neural network the loss function formula remains unchanged. However, whereas before the probability $\hat{p} = \sigma(\vec{\mathbf{w}}^\text{T} \vec{\mathbf{x}} + b)$ of an event on a single example depended directly on the input features $x_i$, in a neural network the probability $\hat{p} = \sigma\left(\left(\vec{\mathbf{w}}^{(L)}\right)^\text{T} \cdot \vec{\mathbf{h}}^{(L-1)} + b^{(L)}\right)$ now depends on the output of the last hidden layer $L-1$, whose output in turn depends on the output of layer $L-2$, and so on.

In essence, before we introduced the neural network we worked with a single neuron — we immediately formed a linear combination of astrological features and fed it to the sigmoid to obtain the output. The loss function $\mathcal{L}(\vec{\mathbf{w}}, b)$ depended on one weight vector $\vec{\mathbf{w}}$ and one bias $b.$

In a neural network, each layer has many neurons, each with its own weight vector and bias. The collection of all weight vectors $\vec{\mathbf{w}}_i$ in each layer $l$ forms the weight matrix $W^{(l)}$ of the layer, and the collection of all biases $b_i$ forms the bias vector $\vec{\mathbf{b}}^{(l)}.$ The loss function now depends on these parameters: $\mathcal{L}(W^{(l)}, \vec{\mathbf{b}}^{(l)})$ for each layer $l$.

For the neural network we therefore seek partial derivatives of $\mathcal{L}$ with respect to these parameters for each layer: $\partial \mathcal{L} / \partial W^{i\;(l)}_j$ and $\partial \mathcal{L} / \partial b^{(l)}_j,$ where index $j$ (column) runs over all neurons in layer $l$, and index $i$ (row) runs over all neurons in the previous layer. The matrix element $W^{i\;(l)}_j$ is the weight coefficient of the connection from the $i$-th neuron of the previous layer to the $j$-th neuron of the current layer.

Layer Delta

To compute these gradients, let us introduce for convenience the quantity: $$ \delta^{(l)} = \frac{\partial \mathcal{L}}{\partial z^{(l)}} $$ Recall that for the output layer $z^{(L)} = \left(\vec{\mathbf{w}}^{(L)}\right)^\text{T}\cdot \vec{\mathbf{h}}^{(L-1)} + b^{(L)},$ and for each $k$-th neuron of a hidden layer $l$: $z^{(l)}_k = \left(\vec{\mathbf{w}}^{(l)}_k\right)^\text{T} \cdot \vec{\mathbf{h}}^{(l-1)} + b^{(l)}_k.$

Output layer

First we compute $\delta^{(L)}$ for the output layer. By the chain rule: $$ \delta^{(L)} = \frac{\partial \mathcal{L}}{\partial z^{(L)}} = \frac{\partial \mathcal{L}}{\partial \hat{p}} \cdot \frac{\partial \hat{p}}{\partial z^{(L)}} $$ We compute each factor.

Derivative of the BCE loss with respect to $\hat{p}$ (for one example, without the sum):

$$ \begin{align} \frac{\partial \mathcal{L}}{\partial \hat{p}} &= - \frac{\partial}{\partial \hat{p}} \left[y \ln \hat{p} + (1 - y)\ln(1 - \hat{p})\right] \\ &=-y\; \frac{\partial}{\partial \hat{p}}\ln(\hat{p}) - (1-y) \frac{\partial}{\partial \hat{p}}\ln(1-\hat{p}) \\ &=-\frac{y}{\hat{p}} - (1-y) \frac{\partial\ln(1-\hat{p})}{\partial (1-\hat{p})}\cdot \frac{\partial (1-\hat{p})}{\partial\hat{p}} \\ &= -\frac{y}{\hat{p}} + \frac{1-y}{1-\hat{p}} \end{align} $$

Derivative of the sigmoid:

$$ \begin{align} \frac{\partial \hat{p}}{\partial z^{(L)}} & = \frac{\partial}{\partial z^{(L)}}\sigma(z^{(L)}) = \frac{\partial}{\partial z^{(L)}}\left(\frac{1}{1+e^{-z^{(L)}}}\right) \\ & = - \frac{1}{(1 + e^{-z^{(L)}})^2} \cdot \frac{\partial}{\partial z^{(L)}} \left(1 + e^{-z^{(L)}}\right) \\ & = \frac{e^{-z^{(L)}}}{(1 + e^{-z^{(L)}})^2} = \sigma\left(z^{(L)}\right) \frac{e^{-z^{(L)}}}{1+e^{-z^{(L)}}} \\ & = \sigma\left(z^{(L)}\right) \left(1 - \frac{1+e^{-z^{(L)}}}{1+e^{-z^{(L)}}} + \frac{e^{-z^{(L)}}}{1+e^{-z^{(L)}}}\right) \\ & = \sigma\left(z^{(L)}\right) \left(1 - \frac{1}{1+e^{-z^{(L)}}}\right) = \sigma\left(z^{(L)}\right)\left(1- \sigma\left(z^{(L)}\right)\right) \\ & = \hat{p}(1-\hat{p}) \end{align} $$

Multiplying these two derivatives — everything cancels beautifully:

$$ \begin{align} & \left(-\frac{y}{\hat{p}} + \frac{1-y}{1-\hat{p}}\right)\cdot \hat{p}(1-\hat{p}) \\ & = \left(\frac{-y+y\;\hat{p} + \hat{p} - y\;\hat{p}}{\hat{p}(1-\hat{p})}\right)\cdot \hat{p}(1-\hat{p}) \\ & = \hat{p} - y \end{align} $$

Averaging over a sample of $N$ examples (i.e., for the full dataset):

$$ \boxed{\delta^{(L)} = \frac{1}{N}\sum_{i=1}^{N}(\hat{p}_i - y_i)} $$

The sigmoid derivative and the BCE logarithm cancel each other elegantly — this is exactly why sigmoid and BCE form a canonical pair.

Hidden layers — the recurrence formula

We compute $\delta^{(l)}$ for the $i$-th neuron of layer $(L-1)$:

$$ \delta^{i\;(L-1)} = \frac{\partial \mathcal{L}}{\partial z^{(L-1)}_i} = \frac{\partial \mathcal{L}}{\partial z^{(L)}} \cdot \frac{\partial z^{(L)}}{\partial h^{(L-1)}_i} \cdot \frac{\partial h^{(L-1)}_i}{\partial z^{(L-1)}_i} $$

Expanding each factor:

$\dfrac{\partial \mathcal{L}}{\partial z^{(L)}} = \delta^{(L)}$ — the already-known delta of the output layer
$\dfrac{\partial z^{(L)}}{\partial h^{(L-1)}_i} = w^{i\;(L)}$ — the $i$-th weight in the weight row from the formula $z^{(L)} = \left(\vec{\mathbf{w}}^{(L)}\right)^\text{T}\cdot \vec{\mathbf{h}}^{(L)} + b^{(L)}$
$\dfrac{\partial h^{(L-1)}_i}{\partial z^{(L-1)}_i} = f'(z^{(L-1)}_i)$ — the derivative of the ReLU activation function: one if $z^{(L-1)}_i > 0,$ zero otherwise.

Substituting:

$$ \boxed{\delta^{i\;(L-1)} = \left(w^{i\;(L)}\, \delta^{(L)}\right) \cdot f'(z^{(L-1)}_i)} $$

or in matrix form for all neurons of the layer:

$$ \boxed{\vec{\boldsymbol{\delta}}^{(L-1)} = \left(\vec{\mathbf{w}}^{(L)}\right)^{\text{T}}\delta^{(L)} \odot f'(\vec{\mathbf{z}}^{(L - 1)})} $$

where

$\left(\vec{\mathbf{w}^{(L)}}\right)^{\text{T}}$ — the row vector of weights of the output neuron
$\delta^{(L)}$ — a scalar
$f'(\vec{\boldsymbol{z}}^{(l)})$ for ReLU: a vector of zeros and ones — ones where $z^{(l)}_i > 0$, zeros otherwise.
$\odot$ — element-wise product of two vectors.

By analogy we compute $\delta^{(l)}$ for the $i$-th neuron of layer $(L-2)$. By the chain rule:

$$ \delta^{i\;(L-2)} = \frac{\partial \mathcal{L}}{\partial z^{(L-2)}_i} = \sum_{j} \frac{\partial \mathcal{L}}{\partial z^{(L-1)}_j} \cdot \frac{\partial z^{(L-1)}_j}{\partial h^{(L-2)}_i} \cdot \frac{\partial h^{(L-2)}_i}{\partial z^{(L-2)}_i} $$

The summation here runs over all neurons of layer $(L-1)$. Expanding each factor:

$\dfrac{\partial \mathcal{L}}{\partial z^{(L-1)}_j} = \delta^{j\;(L-1)}$ — the already-known delta of the pre-output layer
$\partial z^{(L-1)}_j / \partial h^{(L-2)}_i = W^{i\;(L-1)}_j$ — from the formula $z^{(L-1)}_j = W^{i\;(L-1)}_j h^{(L-2)}_i + b_j$
$\dfrac{\partial h^{(L-2)}_i}{\partial z^{(L-2)}_i} = f'(z^{(L-2)}_i)$ — the ReLU activation derivative, applied element-wise for each $i$-th neuron.

Substituting:

$$ \delta^{i\;(L-2)} = \left(W^{i\;(L-1)}_{j}\, \delta^{j\;(L-1)}\right) \cdot f'(z^{(L-2)}_i) $$

Here the summation runs over index $j$, which runs over the coordinates of the row vector $\left(\vec{\boldsymbol{\delta}}^{(L)}\right)^\text{T}$. This formula shows the connection between two adjacent layers — if we know $\left(\vec{\boldsymbol{\delta}}^{(L-1)}\right)^{\text{T}},$ we now know how to find $\left(\vec{\boldsymbol{\delta}}^{(L-2)}\right)^{\text{T}},$ and from that the next one, all the way back to layer 1. In the general case this recurrence can be written as:

$$ \boxed{\delta^{i\;(l-1)} = \left(W^{i\;(l)}_{j}\, \delta^{j\;(l)}\right) \cdot f'(z^{(l-1)}_i)} $$

where index $j$ (column of matrix $W^{(l)}$) runs over all neurons of layer $l$, and index $i$ (row of the matrix) refers to the $i$-th neuron of the previous layer. The matrix element $W^{i\;(l)}_j$ is the weight coefficient of the connection from the $i$-th neuron of the previous layer to the $j$-th neuron of the current layer.

In matrix form for the full layer, this recurrence reads:

$$ \boxed{\vec{\boldsymbol{\delta}}^{(l-1)} = \left(W^{(l)}\right)^T \vec{\boldsymbol{\delta}}^{(l)} \odot f'(\vec{\mathbf{z}}^{(l-1)})} $$

where

$\left(W^{(l)}\right)^T$ — the transposed (rotated 90°) weight matrix of layer $l$.
$f'(\vec{\boldsymbol{z}}^{(l)})$ for ReLU: a vector of zeros and ones — ones where $z^{(l)}_i > 0$, zeros otherwise.
$\odot$ — element-wise product.

Gradients with Respect to Layer Parameters

Having the formula for the delta $\vec{\boldsymbol{\delta}}^{(l)}$ of any layer, we can now compute the gradients of the loss with respect to the weight matrix of each layer:

$$ \frac{\partial \mathcal{L}}{\partial W^{i\;(l)}{j}} = \frac{\partial \mathcal{L}}{\partial z^{(l)}_i} \cdot \frac{\partial z^{(l)}_i}{\partial W^{i\;(l)}{j}} = \delta^{i\;(l)} \cdot h^{(l-1)}_j $$

Or in matrix form:

$$ \boxed{\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \left(\vec{\boldsymbol{\delta}}^{(l)}\right)^{\text{T}} \cdot \vec{\mathbf{h}}^{(l-1)} \in \mathbb{R}^{n^{(l)} \times n^{(l-1)}}} $$

The dimension of the gradient matches the dimension of $W^{(l)}.$

Likewise, the gradient with respect to the bias vector of each layer:

$$ \frac{\partial\mathcal{L}}{\partial b^{(l)}_i} = \frac{\partial\mathcal{L}}{\partial z^{(l)}_i} \cdot \frac{\partial z^{(l)}_i}{\partial b^{(l)}_i} = \delta^{i\;(l)} $$

or in matrix form:

$$ \boxed{\frac{\partial \mathcal{L}}{\partial \vec{\mathbf{b}}^{(l)}} = \left(\vec{\boldsymbol{\delta}}^{(l)}\right)^{\text T} \in \mathbb{R}^{n^{(l)}}} $$

Step-by-Step Gradient Descent Algorithm

We can now formalize the gradient descent algorithm for minimizing the loss function of the full neural network.

For each layer $l$ we initialize the weight matrix $W^{(l)}$ and the bias vector $\vec{\mathbf{b}}^{(l)}$ with random values.
We then traverse all hidden layers — from 1 to $(L-1)$ — computing $\vec{\mathbf{z}}^{(l)}$ and $\vec{\mathbf{h}}^{(l)}.$ We save these values — they will be needed later. This step is known as the forward pass.
We then compute:
$\delta^{(L)} = \frac{1}{N}\sum_{i=1}^{N}(\hat{p}_i - y_i)$ — for the output layer
$\vec{\boldsymbol{\delta}}^{(L-1)} = \left(\vec{\mathbf{w}}^{(L)}\right)^{\text{T}}\delta^{(L)} \odot f'(\vec{\mathbf{z}}^{(L - 1)})$ — for the pre-output layer
$\vec{\boldsymbol{\delta}}^{(l-1)} = \left(W^{(l)}\right)^T \vec{\boldsymbol{\delta}}^{(l)} \odot f'(\vec{\mathbf{z}}^{(l-1)})$ — for all remaining layers (moving toward the first)
In parallel, for each layer we compute gradients with respect to weights $\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \left(\vec{\boldsymbol{\delta}}^{(l)}\right)^{\text{T}} \cdot \vec{\mathbf{h}}^{(l-1)}$ and with respect to biases $\frac{\partial \mathcal{L}}{\partial \vec{\mathbf{b}}^{(l)}} = \left(\vec{\boldsymbol{\delta}}^{(l)}\right)^{\text T}.$ This step is known as the backward pass — computation from the last layer to the first.
The gradients point in the direction of growth of the loss function, so we take a small step in the opposite direction:

$$ \begin{align} & W^{(l)} = W^{(l)} - \lambda\frac{\partial \mathcal{L}}{\partial W^{(l)}}\\ & \vec{\mathbf{b}}^{(l)} = \vec{\mathbf{b}}^{(l)} - \lambda \frac{\partial \mathcal{L}}{\partial \vec{\mathbf{b}}^{(l)}} \end{align} $$

In step 4 we effectively update/correct the weights and biases of all neurons in every layer. This corresponds to a small step of size $\lambda$ along the surface of $\mathcal{L}$ toward the coordinates of its minimum.

Through successive iterations the loss function begins to approach its minimum, and the weights and biases of all neurons converge to values at which the model predicts the probabilities of future events as close as possible to what we observe in the training examples.

Practical Steps

Below are the practical steps for training an artificial intelligence to make astrological predictions.

Step 1. Data Preparation

A group of astrologers begins by selecting a set of predictive techniques from a single astrological school/tradition whose combined application allows a forecast to be made. These may be directions, revolution charts, and transits, or progressions, profections, and solar returns — this choice is left to the astrologers' discretion.

The astrological features are then converted to numerical values. For example, the ruler of a house can be expressed as a number where each number corresponds to a planet; the current firdaria can be expressed as a sequential index in the firdaria list; an aspect in degrees; a planet's house position as the house number; dignities and debilities as numerical scores; the conjunction (or lack thereof) of house cusps in overlaid charts as 0 or 1.

Each astrological feature is then normalized so that its value lies in the interval from 0 to 1. For example, house and sign numbers are divided by 12, zodiacal degrees by 360. The result is dozens of features whose combinations lead to the predicted event.

Once the feature list is ready, an event is chosen as the training target — relocation, marriage, birth of a child, and so on. This is what the model will learn to predict.

From the AstroDataBank database, all available horoscopes with documented times of the chosen event (e.g., year, day, and month of a child's birth) are selected. Astrological techniques are applied to each such event and the astrological feature vector $x_i$ is filled in with event label $y = 1$. These features might include, for example, planetary positions in houses and signs (expressed as numbers from 0 to 1) in the solar return chart, mutual aspects when overlaying the lunar return chart on the natal chart (also expressed as numbers from 0 to 1), planetary dignities and debilities expressed as numbers (from 0 to 1), and so on — according to the feature list.

An equal number of negative examples must then be selected — random horoscopes and dates when the event did not occur — and for each such example the feature vector $x_i$ is constructed with event label $y = 0$.

Step 2. Batch Size Selection

Since the dataset may be very large, training is best performed in small groups of $N$ examples called mini-batches. The number of examples processed in a single training step is the batch size.

Step 3. Data Splitting

A large portion of the data (70%) is set aside for training, 15% for validation, and 15% for testing.

Training data (the main dataset) — the astrological feature vectors and event labels $(\vec{\mathbf{x}}_1, y_1), (\vec{\mathbf{x}}_2, y_2), \ldots$ fed to the neural network so that it can find the optimal weights and biases for making good predictions.

Validation data — a control set used to check how well the model predicts events. It may turn out that the model predicts events well only on the training data but misses badly when presented with unfamiliar input features and labels.

In that case the model's global parameters (hyperparameters) must be tuned so that the prediction accuracy on the training set is similar to that on the validation set. These parameters include:

Number of layers
Number of neurons per layer
Batch size
Gradient descent step size $\lambda$ (learning rate)
Number of passes over the full training dataset (number of epochs)

Test data is used last, once optimal hyperparameters have been determined. The model makes predictions on the test data. If successful, the model is considered trained.

Step 4. Gradient Descent

After the data are split, the training dataset is shuffled thoroughly and divided into $N$ batches.

Before the first pass, the weight matrix $W^{(l)}$ is initialized with random numbers (see the Appendix), and the bias vector $\vec{\mathbf{b}}^{(l)}$ is set to zero for each layer $l$.

One gradient descent step (forward and backward pass) is then performed on the first batch. Then the next batch is taken and gradient descent is repeated. This continues until all batches are exhausted.

A complete pass over all batches is called a training epoch. When one epoch ends, the training dataset is shuffled again, split into new batches, and one gradient descent step is performed per batch.

Training epochs are repeated many times (typically dozens or hundreds). Throughout training we continuously monitor the loss function as it converges geometrically toward its minimum — this behavior is called loss convergence.

Step 5. Data Validation

After the model has completed its first training run, we take the validation data and compare the model's predictions with the actual distribution of event labels. If the discrepancies are significant, the model's global parameters must be adjusted:

Batch size
Gradient descent step size (learning rate, $\lambda$)
Number of neurons per layer
Number of layers
Number of epochs

We adjust these parameters and retrain the model until it passes validation.

Step 6. Model Testing

The final step is to give the trained model the test data — astrological features $\vec{\mathbf{x}_i}$ — and compare the model's predictions with the actual distribution of event labels $y_i$. If steps 1–5 were carried out correctly, the test almost always passes.

If the test fails:

Check for data leakage — identical examples appearing in both training and test/validation sets. If found, remove them and retrain from scratch.
Check data quality — are all features included, correctly computed, and properly normalized? If errors are found, correct them and retrain.
If no errors are found, shuffle all examples thoroughly, re-split the data into training, validation, and test sets, and repeat training with the same global parameters.

Applying the Model

Once the model passes testing it is considered trained and can be applied across various domains of human life. To use it, one compiles a set of astrological features $\vec{\mathbf{x}}_t$ for each year, month, and day of a person's life, and the model — trained to predict a specific event — outputs a probability distribution for each day. On the days when the probability is close to 1, the event in question can be expected.

Appendix. Parameter Initialization Before the First Pass

We mentioned that before applying the gradient descent algorithm we initialize the weight matrices and bias vectors for each neural layer with arbitrary values. Two negative scenarios must be avoided.

Scenario 1. Very Large Weights

If neurons in the network start with very large (and diverse) weights, the vector $\vec{\mathbf{z}}^{(l)}$ (the result of the affine transformation at layer $l$) will have wildly different coordinates — for example, −1000, 549, −3261. In other words, the variance of the coordinates will be very large. Large values in one layer are amplified by large weights in the next, so by the final neuron the signals $\vec{\mathbf{h}}^{(L-1)}$ are enormous. Their linear combination $z = w^{i\;(L)}h_i^{(L-1)}$ produces large positive and negative values of $z$.

As you may recall from the graphs, the sigmoid $\sigma(z)$ saturates quickly — for large positive $z$ it approaches 1 and barely grows, while for large negative $z$ it approaches zero.

As a consequence, the output-layer delta $\delta^{(L)} = \frac{1}{N}\sum_{i=1}^{N}(\hat{p}_i - y_i)$ takes values close to $-1$, $0$, or $1.$

Since $\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \left(\vec{\boldsymbol{\delta}}^{(l)}\right)^{\text{T}} \cdot \vec{\mathbf{h}}^{(l-1)}$, the weight gradients become the product of potentially large $\vec{\mathbf{h}}^{(l-1)}$ values and nearly discrete $\vec{\boldsymbol{\delta}}^{(l)}$ values. The gradients jump erratically — the loss surface is alternately nearly flat and nearly vertical. It is almost impossible to descend smoothly to the minimum over such a "jagged" surface. This negative scenario is called exploding variance.

Scenario 2. Very Small Weights

If we initialize the weights and biases to zero or very small values, $\vec{\mathbf{z}}^{(l)}$ will have very similar coordinates — for example, −0.00005, 0.00001, 0.00002. In other words, the variance of the coordinates will be very small. Small values in one layer are further reduced by small weights in the next, so by the final neuron the signals $\vec{\mathbf{h}}^{(L-1)}$ are tiny. Their linear combination $z = w^{i\;(L)}h_i^{(L-1)}$ produces near-zero values of $z$.

Near zero, the sigmoid behaves almost linearly: $$ \sigma(z) \approx 0.5 + \frac{z}{4} $$ So the model's predictions $\hat{p}_i$ for all examples are close to 0.5, barely distinguishable from each other.

As a consequence, the output-layer delta $\delta^{(L)} = \frac{1}{N}\sum_{i=1}^{N}(\hat{p}_i - y_i)$ is small in magnitude, since $\hat{p}_i \approx 0.5$ and the difference $(\hat{p}_i - y_i)$ is bounded and does not produce a strong error signal.

During backpropagation the deltas $\vec{\boldsymbol{\delta}}^{(l)}$ shrink further at each layer, since they are sequentially multiplied by small weights. By the time the error reaches the early layers it is extremely weak.

Since $$ \frac{\partial \mathcal{L}}{\partial W^{(l)}} = \left(\vec{\boldsymbol{\delta}}^{(l)}\right)^{\text{T}} \cdot \vec{\mathbf{h}}^{(l-1)}, $$ the weight gradients are the product of small $\vec{\boldsymbol{\delta}}^{(l)}$ and small $\vec{\mathbf{h}}^{(l-1)},$ resulting in near-zero gradients throughout. The loss surface is nearly flat: changing the weights barely changes the loss. Gradient descent "cannot sense" the direction of movement and makes infinitesimally small steps.

By such an almost horizontal surface, movement toward the minimum is extremely slow or practically halts. This negative scenario is called vanishing gradients.

Solution

We need initial weight values such that the variance $\text{Var}(\vec{\mathbf{h}}^{(l)})$ of the signal is preserved when passing from layer to layer — without explosive accumulation or decay: $$ \text{Var}(\vec{\mathbf{h}}^{(l)})\approx \text{Var}(\vec{\mathbf{h}}^{(l−1)}) $$ Because the ReLU activation clips approximately half the values of $\vec{\mathbf{z}^{(l)}}$ (since $\vec{\mathbf{h}}^{(l)} = \max(0, \vec{\mathbf{z}}^{(l)})$, roughly half the coordinates of $\vec{\mathbf{z}^{(l)}}$ are zeroed out): $$ \mathrm{Var}\left(h^{(l)}_j\right) \approx \frac{1}{2} \mathrm{Var}\left(z^{(l)}_j\right) $$ Since $\text{Var}(z_j^{(l)}) = \text{Var}\left(W^{i\;(l)}_j h_i^{(l-1)}\right)$, and using the rules that the variance of a sum equals the sum of variances, and the variance of a product (assuming zero mean) equals the product of variances:

$$ \begin{align} \text{Var}(z_j^{(l)}) & = \sum_i \text{Var}\left(W^{i\;(l)}_j h_i^{(l-1)}\right) \\ & = n^{(l-1)} \text{Var}\left(W^{i\;(l)}_j h_i^{(l-1)}\right) \\ & = n^{(l-1)} \text{Var}\left(W^{i\;(l)}_j\right) \text{Var}\left(h_i^{(l-1)}\right) \end{align} $$

Combining:

$$ \mathrm{Var}\left(h^{(l)}_j\right) \approx \frac{1}{2} \mathrm{Var}\left(z^{(l)}_j\right) = \frac{n^{(l-1)}}{2} \text{Var}\left(W^{i\;(l)}_j\right) \text{Var}\left(h_i^{(l-1)}\right) $$

Our variance-preservation requirement can now be expressed as:

$$ \begin{align} & \text{Var}\left(h^{(l)}_j\right) \approx \text{Var}\left(h^{(l-1)}_j\right) \implies \\ & \frac{n^{(l-1)}}{2} \text{Var}\left(W^{i\;(l)}_j\right) \text{Var}\left(h_i^{(l-1)}\right) \approx \text{Var}\left(h^{(l-1)}_j\right) \implies \\ & \boxed{\text{Var}\left(W^{i\;(l)}_j\right) \approx \frac{2}{n^{(l-1)}}} \end{align} $$

This means that the weights in layer $l$ should be random numbers with zero mean and variance $2/n^{(l-1)}$, where $n^{(l-1)}$ is the number of neurons in the previous layer.

This is easily achieved with a normal distribution: $$ W^{(l)} \sim \mathcal{N}\left(0, \frac{2}{n^{(l-1)}}\right) $$ This means we draw from a Gaussian distribution centered at 0 with spread $2/n^{(l-1)}$. Such an initialization makes training stable.

Alexey Borealis

Master of Science in Physics, Professional astrologer (MAPAI). About the author

Search the Articles