[{"content":"In this post we will present an introduction of how Spatial Graph Neural Networks (GNNs) or Graph Convolutional Neural Networks (GCNs) work. First, we are going to define graph data structures. Then, we are going to explain the mechanism on GNNs. And finally, we will explain how to incorporate an attention mechanism in the network.\nNotation of GNNs\nDuring the whole text, we will use the notation of GNN as Spatial Graph Neural Network, although GCN or Graph Convolutional Neural Network is another notation to say it. There are other types of GNNs like Spectral Graph Neural Networks, but in this post we will focus on the first mentioned ones.\nWhat are graphs? Graphs are data structures with two main elements: the nodes and their relationships between elements, named edges. The core of graphs is that we can define very easily the interaction of entities. Here, the links are key to the power of relational-data. Graphs are used in several applications, for instance, in social networks, distribution networks, and state-machines.\nFirst steps We have already seen three clear concepts:\nGraph. A data type consisting of nodes and edges. Nodes\/Vertices. Endpoint in a graph, they are also called vertices or points. Edges. The link or relationship between nodes. When trying to describe how is a specific graph, we could always draw the elements and the link between them if necessary. Although it is great to see some properties in a straight-forward way, we could not calculate anything from it. One of the most common mathematical formalisms for describing relations in graphs are adjacency matrices.\nFig. 1. Adjacency matrix of a graph.\nGraphs can be classified as directed or unidrected. In undirected graphs, the adjacency matrix is a symmetric matrix across the diagonals. Each column and row correspond to a node and their values are their corresponding links to the nodes of the graph. For example, in column \\(k\\) we can see the links of node \\(v_k\\) and in row \\(k\\) we can see the same links too, but transposed.\nGiven two nodes \\(v_j\\) and \\(v_i\\), if the value of \\(e_{i, j}\\) or \\(e_{j, i}\\) is zero, that means that no link exists between these nodes. Otherwise, there is a link between these nodes.\nIf the adjacency matrix is not symmetric, we have a directed graph. The links of directed graphs have directions, so that means that a node \\(v_j\\) can interact to \\(v_i\\), but it does not have to be the other way around. Directed edges are usually represented by an arrow, denoting a one-way relationship.\n\\[ G = \\begin{pmatrix} 1 & 0 & 0 & 1\\\\ 0 & 0 & \\textcolor{blue}{1} & \\textcolor{blue}{1}\\\\ 0 & \\textcolor{red}{0} & 0 & 1\\\\ 1 & \\textcolor{red}{0} & 1 & 0 \\end{pmatrix} \\qquad \\text{This is a directed graph.} \\]If not all the non-zero values of the adjacency matrix are one, we are talking about weighted graphs. Weighted graphs can have weighted links.\nOther key terms Other key terms about edges and nodes are:\nSelf-loop. An edge that connects a node to itself. Parallel edges. Multiple edges that connect the same two nodes. Joint nodes or neighbours are those nodes that are directly connected via an edge. A node \\(v_i\\) is adjacent to another node \\(v_j\\) if there is a edge. The set of all the neighbours from a node is called its neighbourhood. \\[ \\mathcal{N}(v_i) = \\{ v_j \\in V \\mid (v_i, v_j) \\in E \\} \\] Fig. 2. A graph that has a self-loop, parallel edges. The diagram shows the neighbourhood of node.\nOnce defined the different kind of graphs depending on the nodes and the edges, we can see other properties checking the structure of a specific node.\nSize is the number of edges. Order is the number of nodes. The degree of a node is the number of edges associated to a node. It is the count of its adjacent nodes. The degree distribution is the distribution of all the degrees of all nodes in a graph. In directed graphs, there are two types of degrees: the in-degrees for edges directed to the node and the out-degrees, for edges directed outward from the node. The diameter of a graph is the maximum number of the longest shortest path. Fig. 3. Degrees and diameter of the graph. The diameter is 4 given the walk (0,1,2,5,6).\nA connected graph is a graph where all its nodes are connected. Otherwise, it is a disconnected graph. For a disconnected graph, each disconnected piece is called a component. For a directed graph, a strongly connected graph is when it is always possible to reach any node from any other node. In directed graphs, we can have strongly connected components and weakly connected components. Strongly connected components are the subgraphs that are connected to each other and any node in the subgraph can reach any other node in the subgraph. Weakly connected components are subgraphs where any node can be reached but not all the nodes can reach each other. Graph Traversals\nWhen inspecting a graph we might ask ourselves a lot of questions. For example, in a social network how many connections do I have to hop to meet Meryl Streep? In a distribution system, which is the smallest number of hops from one node to another? Which is the largest number of hops without repeating in a graph?\nThe trip that we have to do to travel from a given node to a second node is called traversal or walk. A walk is open when the ending node is different from the starting node. If we start and end with the same node, we call it closed walk.\nA path is a walk when no node is repeated. When the path is closed, we call it cycle. The diameter of a graph is the maximum number of steps of a path. A diameter is also called the longest shortest path.\nA trail is when the walk do not repeat edges. A circuit is when we have a closed trail.\nHow Graph Neural Networks work? A Graph Neural Network (GNN) is a deep learning model that allows you to represent and learn from graphs (Scarselli et al., 2009). The reason why they were invented is due to the lack of inductive biases in the nature of traditional machine learning and deep learning models such as MLP. In machine learning involving tabular data, images, or text, our data is organized in an expected way, with implicit and more explicit rules. When dealing with tabular data, rows are treated as observations while columns as features. But in graph data, relations between &ldquo;rows&rdquo; or instances are meaningful. IN GNNs, we have the information of each node, called embeddings and the links that connects each node to its neighbourhood.\nMain idea of GNNs\nThe essential idea of graph neural networks is to iteratively update the node representations by combining the representations of their neighbours and their own representations.\nThe main objective of a Graph Neural Network is to encode the data structure to then predict. But, what? There are several tasks that we could come up with, but they can be classified as:\nNode-level tasks. Given a graph, classify the nodes, use regression, or detect anomalies. Edge-level tasks. Similar to the node-level tasks, but for edges. Graph-level tasks. Graph classification or regression. Graph AutoEncoders (GAE) could be also in this category. In section 1 we have already understood the importance of the links when defining a graph. GNNs came up to use this nature in the network architecture: use a mechanism to encode and exchange information across the graph structure during the inference: the graph convolutional layers (GCL). The GCN layer \\(l\\) is mathematically defined as\n\\[ \\mathbf{x}_i^{(l)} = \\sum_{j \\in \\mathcal{N}(i) \\cup \\{ i \\}} \\frac{1}{\\sqrt{\\deg(i)} \\cdot \\sqrt{\\deg(j)}} \\cdot \\left(\\mathbf{W}^{\\top} \\cdot \\mathbf{x}_j^{(l-1)} \\right) + \\mathbf{b}, \\]where neighboring node features are first transformed by a learnable weight matrix \\(\\mathbf{W}^{\\top}\\), normalized by their degree, and finally summed up. Lastly, we apply the bias vector to the aggregated output.\nHowever, when we talk unformally about GCL, we talk about a bundle of sequential layers:\nMessage Passing layer: where information is aggregated from neighbours and updated for each node. Activation Layer: where the information is passed to the next layer. Dropout layer: switching off some neurons to improve generalization and performance. Normalization Layer: normally Batch Normalization, that means the activated outputs to zero with a variance of 1. The message-passing method Message passing is designed specifically for asking about the graph data structure. For each node in the graph, each message passing step represents a communication that spans nodes one hop away. If we want our node representations to take account node from 5 hops from each node, we should need 5 message passing layers.\nMessage passing can be understood as a form of convolution but applied to the neighbourhood of the nodes instead of a neighbourhood of pixels. Through convolution operators, we are encoding neighbour states to the current node and gathering more global information of the graph.\nInductive biases of GNNs\nHence, Graph Neural Networks can be considered an abstraction of Convolutional Neural Networks.\nA popular way to introduce GCLs is to break down the filter into two operations:\nAGGREGATE-NODES. Given a node \\(v_i\\) and its linked nodes \\(\\mathcal{N}(v_i)\\), we aggregate the information of each neighbour node: \\[ m_i^{(l)} = \\operatorname{AGGREGATE} \\left( \\left\\{ h_j^{(l)} \\mid v_j \\in \\mathcal{N}(v_i) \\right\\} \\right) \\]The aggregation function \\(AGGREGATE\\) is commonly a sum operation, although it can be the mean, minimum, maximum, or multiplication.\nFig. 4. Aggregate function of node \\(6\\). This step aggregates the embeddings of the neighbours of the node by multiplying or summing. The resulting value is the message \\(m_6^{(l)}\\)\nUPDATE-EMBEDDING. Then, we update the embedding of the node \\(v_i\\) with the aggregated information applying linear transformations: \\[ h_i^{(l+1)} = \\sigma \\left(W^{(l)} \\cdot \\operatorname{CONCAT} \\left(h_i^{(l)}, m_i^{(l)} \\right) + b^{(l)} \\right) \\] Fig. 5. We concat the current embedding of the node \\(6\\) with the aggregate message of its neighbours. The resulting value is multiplied by the kernel matrix \\(W^{(l)}\\) and added to the bias vector \\(b^{(l)}\\).\nSince we will be working on convolutions, a key item must be introduced: the kernel \\(W^{(l)}\\). The kernel is a matrix that we will be using to transform the input data (from the neighbours) and highlight specific features from them. The kernel is the learnable weight matrix to be optimized by our loss function, along with the bias \\(b^{(l)}\\).\nInductive biases of GNNs\nGraph-based learning techniques focuses on approaches that are permutation invariant or equivariant. This means that the model is not influenced by the ordering of the graph representation. Therefore, if we shuffle the rows and the columns of the adjacency matrix, our results should not change.\nAlthough this example helps to understand how GNNs work at first glance, it is not the most used algorithm. The generic message-passing equation is\n\\[ \\mathbf{x}_i^{(l)} = \\gamma^{(l)} \\left( \\mathbf{x}_i^{(l-1)}, \\bigoplus_{j \\in \\mathcal{N}(i)} \\phi^{(l)} \\left( \\mathbf{x}_i^{(l-1)}, \\mathbf{x}_j^{(l-1)}, \\mathbf{e}_{j,i} \\right) \\right) \\]where \\(\\bigoplus\\) is the aggregator function, \\(\\phi^{(l)}\\) is the message function at layer \\(l\\) and \\(\\gamma^{(l)}\\) is the update function. At first, it might look a bit messy since the order is not defined as it is in the example above. So let&rsquo;s break it by parts again.\nThe message function \\(\\phi^{(l)}\\) defines what information each neighbor sends to node \\(i\\). In the first example, the message function is based solely on sending the embedding information \\(h_j^{(l)}\\) to node \\(i\\). Nevertheless, differentiable functions such as MLPs (Multi Layer Perceptrons) can be applied. In addition, the message function can also have as an input the node embedding \\(h_i^{(l)}\\) as well as the link between the neighbours (useful for weighted graphs). In our example, \\[ \\phi^{(l)} \\left(\\mathbf{x}_i^{(l-1)}, \\mathbf{x}_j^{(l-1)}, \\mathbf{e}_{j,i} \\right) = x_j^{(l-1)}\\] The aggregator function \\(\\bigoplus\\) is the permutation invariant function: sum, max, mean&hellip; The update function \\(\\gamma^{(l)}\\) is the function that updates the node embedding using the old embedding plus the aggregated message. As you can see, the generic message-passing joins the update with the concat function, which is not needed to be a neither concatenation. Normally, the update functions are also differentiable functions such as MLPs. The following equation is the update function for the first example: \\[ \\gamma^{(l)} \\left(\\mathbf{x}_i^{(l-1)}, \\bigoplus_{j \\in \\mathcal{N}(i)} x_j^{(l-1)} \\right) = \\sigma \\left(W^{(l)} \\cdot \\operatorname{CONCAT} \\left(x_i^{(l-1)}, \\bigoplus_{j \\in \\mathcal{N}(i)} x_j^{(l-1)} \\right) + b^{(l)} \\right)\\] How to apply GNNs to downstreaming tasks? Since now, we have introduced how to process the data of the nodes with the information of their neighbours. However, we did not apply the information of the nodes to the downstream tasks. In a lot of cases, the GNNs are used only as encoders, while decoders or head networks, such as MLPs, are used to apply the learned graph representation to downstreaming tasks such as node classification, edge prediction, or graph classification.\nThe GNNs will output the node embeddings, we must think another way around to apply downstream tasks. For example, we can have an MLP to classify the node given the output embeddings given a node (treating then nodes as instances or rows). We could also apply a function to the embeddings of two nodes to get the distance between them and, then, predict if they are connected or not. Regarding graph classification, we could apply global pooling to the embeddings of the nodes to get the final representation of the graph and, then, use an MLP to classify it.\nFig. 6. Given the output of the GNN, we can apply downstream tasks such as node, edge, or graph prediction (regression or classification).\nAttention mechanism in GNNs A GNN with a lot of layers makes the nodes to be more similar to each other, since they will have more global information than first layers. Although it can be great for some tasks, the embedding nodes tend to be more similar to each other, vanishing local information. This problem is called over-smoothing. A way to check if our GNN over-smooths the embeddings of the nodes is to calculate the similarity between each embedding and the average embedding of the nodes. If the similarity of the node and the average is close to 1, it means that the GNN is clearly over-smoothing.\nTo over-smooth or not to over-smooth?\nDepending on the case, we do not care if some nodes are &ldquo;over-smoothed&rdquo;, especially those that are central or has closeness centrality. Furthermore, in some downstream tasks, these nodes might be important. There are different techniques to avoid over-smoothing such as applying skip connections, similar to U-Net (Ronneberger et al., 2015) or ResNets (He et al., 2016). As you could notice, nodes that have more links tend to be more similar to the average embedding. Specifically, one of the techniques that helps to prevent over-smoothing in these cases is to include attention in the message passing. Therefore, the attention mechanism allows to learn which nodes the network has to put emphasis on.\nSee also Introduction to Attention Mechanism and Transformers \u2192 There are two different attention mechanisms that became popular in the literature: GAT (Veli\u010dkovi\u0107 et al., 2018) and GATv2 (Brody et al., 2022). The difference between them is where they apply the attention mechanism.\nGraph Attention Network (GAT) Computes the attention weights once per training loop by using individual node and neighborhood features, being static across all layers.\nIn standard GAT, the unnormalized attention score is usually written as:\n\\[ e_{i, j} = \\text{LeakyReLU} (a^T [W \\cdot h_i || W \\cdot h_j]) \\]where \\(a\\) is the learnable attention vector, and \\(W\\) another learnable parameter. We can have also two different learnable parameters \\(W_s\\) (for the source \\(h_i\\)) and \\(W_t\\) (for the target \\(h_j\\)).\nFig. 7. Unnormalised Attention Score mechanism in GAT.\nThen, softmax is applied to normalize the attention scores.\n\\[ \\alpha_{i, j} = \\text{softmax}_{j \\in \\mathcal{N}(i)} (e_{i, j}) \\]We can express GAT and GATv2 as specific choices of: \\(\\phi^{(l)}\\). For example:\n\\[ \\phi^{(l)}_{\\text{GAT}} (x_i^{(l-1)}, x_j^{(l-1)}, e_{i,j}) = \\alpha_{i, j} \\cdot x_j^{(l-1)} \\]We can also add a learnable parameter to process the attention vector alongside with the neighbour node embedding.\n\\[ \\phi^{(l)}_{\\text{GAT}} (x_i^{(l-1)}, x_j^{(l-1)}, e_{i,j}) = \\alpha_{i, j} \\cdot W^{(l)} \\cdot x_j^{(l-1)} \\]The problem found in GAT is subtle: the attention vector can be decomposed as:\n\\[ a^T [W h_i || W h_j] = a_s^T \\; W \\; h_i + a_t^T \\; W \\; h_j \\]For a fixed source node \\(i\\), the term \\(a_s^T \\; W \\; h_i \\) is constant to all its neighbours \\(j\\). Therefore, the ranking is mostly determined by \\(a_t^T \\; W \\; h_j \\). This kind of attention mechanism is called static, since the ranking of the neighbours does not depend on the source node but only the target node. That means that the attention could be calculated before knowing the link between the nodes. In other words, the attention can be calculated and cached for the whole graph.\nGATv2 GATv2 changes the order of operations in the attention mechanism.\n\\[ e_{i, j} = a^T \\text{LeakyReLU} ( W [ h_i || h_j]) = a^T \\text{LeakyReLU} ( W_s h_i + W_t h_j) \\]As in GAT, we can have two learnable parameters \\(W_s\\) and \\(W_t\\) for the attention scores, one for the source node and one for the target node plus the attention vector \\(a\\). Then, we will apply softmax to normalize the attention scores.\n\\[ \\alpha_{i,j} = \\text{softmax}_j \\left( e_{i, j} \\right) \\]The main difference is that the non-linearity is applied before the final attention projection. This allows the source node \\(i\\) to influence the ranking of the neighbours \\(j\\). This small change makes GATv2 dynamic attention, showing more expressive attention than GAT while keeping the same parametric cost. Since this mechanism implies calculating the attention for each link, the attention cannot be calculated for the whole graph but for each node. Hence, the number of links from the nodes (the order) increases the cost.\nUsing the generic message-passing equation, we can rewrite the GATv2 attention function as it was with GAT:\n\\[ \\phi^{(l)}_{\\text{GATv2}} (x_i^{(l-1)}, x_j^{(l-1)}, e_{i,j}) = \\alpha_{i,j} \\; \\text{LeakyReLU} ( W_s h_i + W_t h_j), \\]but the difference is in the attention score already explained. We can keep \\(\\gamma^{(l)} \\) and \\(\\bigoplus\\) the same.\nIt is important to note that the attention mechanism adds new learnable parameters to the model, which can be significantly increased in the number of parameters. However, the attention mechanism can help not only to prevent over-smoothing but detect specific important links in the graph.\nConclusion Graph Neural Networks extend deep learning to data where the relationships between entities are as important as the entities themselves. Instead of processing nodes independently, GNNs exploit the graph structure through message passing, allowing each node to update its representation by combining its own features with information coming from its neighbourhood.\nAlthough GNNs adds a new dimension to deep learning and graph understanding, stacking many message-passing layers can lead to over-smoothing. Attention mechanisms help address this issue by allowing the model to learn which neighbours should contribute more strongly to each node update.\nReferences Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., &amp; Monfardini, G. (2009). The graph neural network model. IEEE Transactions on Neural Networks, 20(1), 61\u201380. https:\/\/doi.org\/10.1109\/TNN.2008.2005605\nHe, K., Zhang, X., Ren, S., &amp; Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770\u2013778). IEEE. https:\/\/doi.org\/10.1109\/CVPR.2016.90\nRonneberger, O., Fischer, P., &amp; Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In N. Navab, J. Hornegger, W. M. Wells, &amp; A. F. Frangi (Eds.), Medical Image Computing and Computer-Assisted Intervention \u2013 MICCAI 2015 (pp. 234\u2013241). Springer. https:\/\/doi.org\/10.1007\/978-3-319-24574-4_28\nVeli\u010dkovi\u0107, P., Cucurull, G., Casanova, A., Romero, A., Li\u00f2, P., &amp; Bengio, Y. (2018). Graph attention networks. International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=rJXMpikCZ\nBrody, S., Alon, U., &amp; Yahav, E. (2022). How attentive are graph attention networks? International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=F72ximsx7C1\nCitation @article{alas2026, title = &#34;Under the Hood of Graph Neural Networks: Message Passing, Over-Smoothing and Attention&#34;, author = &#34;Al\u00e0s Cerc\u00f3s, Oriol&#34;, journal = &#34;oriolac.github.io&#34;, year = &#34;2026&#34;, month = &#34;June&#34;, url = &#34;https:\/\/oriolac.github.io\/posts\/20260624-gnns\/&#34; } ","permalink":"https:\/\/oriolac.github.io\/posts\/20260624-gnns\/","summary":"<p>In this post we will present an introduction of how <strong>Spatial Graph Neural Networks (GNNs)<\/strong> or <strong>Graph Convolutional\nNeural\nNetworks (GCNs)<\/strong> work. First, we are going to define graph data structures. Then, we are going to explain the mechanism\non GNNs. And finally, we will explain how to incorporate an attention mechanism in the network.<\/p>\n<div class=\"callout callout-info\" role=\"note\">\n  <div class=\"callout-body\">\n    <p class=\"callout-title\">\n    <svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\"><circle cx=\"12\" cy=\"12\" r=\"10\"\/><path d=\"M12 16v-4\"\/><path d=\"M12 8h.01\"\/><\/svg>\n\n      Notation of GNNs<\/p>\n    <div class=\"callout-content\"><blockquote>\n<p>During the whole text, we will use the notation of GNN as Spatial Graph Neural Network, although GCN or Graph\nConvolutional Neural Network is another notation to say it. There are other types of GNNs like Spectral Graph Neural\nNetworks, but in this post we will focus on the first mentioned ones.<\/p>","title":"Under the Hood of Graph Neural Networks: Message Passing, Over-Smoothing and Attention"},{"content":"Object detection is one of the most popular tasks in computer vision, since it can be applied to a wide range of applications: robotics, autonomous driving or fault detection. In this post, we will try to give a brief overview of the YOLO algorithm and the components that make it work.\nTo do that, I have classified the main components of the algorithm into three categories:\nCharacteristics based on the model architecture, how YOLO-based models improved the performance by using a new architecture and which are the improvements made. Strategies based on the model training, such as the function loss or data augmentation. Methods for post-processing the output of the model, such as the non-maximum suppression (NMS) and the confidence threshold. Two-stage vs One-stage Detectors Before YOLO, SoTA detectors were based on a two-stage detector: the first stage is used to detect the bounding boxes, and the second stage is used to classify the bounding boxes. This kind of model is called region-based detectors, because they need the region to then run the classification.\nFig. 1. RCNN architecture. (Bhalla, 2022)\nIn contrast, YOLO is a one-stage detector, YOLO models skip the first stage and runs directly over a dense sampling of possible locations and gives the bounding boxes and the classification all at once. The first idea behind YOLO was to reduce the computational cost of the region-based models (increasing the FPS) although mantaining or decreasing a little bit the performance. This idea was inspired by Single Shot MultiBox Detector (SSD), introduced in 2016.\nFrames per second (FPS)\nFPS measures how many frames a model can process per second \u2014 the key metric for real-time inference. A detector running at 7 FPS cannot drive a car or track objects in live video; one running at 45 FPS can. The architecture: Backbone, neck and head A YOLO model usually has three main parts:\nThe backbone network, which extracts features from the input image. The backbone progressively reduces spatial resolution while increasing semantic richness. The neck, which combines the feature maps from the backbone with a convolutional layer. So the neck helps the detector handle objects of different sizes. The head, which produces the final prediction. Fig. 2. YOLO-based network architecture. Check image source\nThese parts are strongly connected to the idea of multi-scale detection. YOLO does not predict bounding boxes from only one feature map. Instead, it preodicts object at different scale resolutions. For example, if the input image has size 640 \u00d7 640, a YOLO model may produce three detection scales:\nP3 -&gt; 80 x 80 -&gt; small objects P4 -&gt; 40 x 40 -&gt; medium objects P5 -&gt; 20 x 20 -&gt; large objects At the beginning of the backbone, the feature maps have high spatial resolution and contain low-level information, such as edges, corners, textures, and small visual patterns. As the image passes through deeper layers, the spatial resolution decreases, but the semantic meaning of the features increases. The backbone network gives intermediate feature maps to the neck according to the idea of multi-scale detection. Usually the backbone is already pre-trained on a large dataset.\nSince the backbone is meant to extract features from the input image in different scales, it is important to create a network that has neither stride operations nor pooling layers, as they can reduce the spatial resolution and semantic information.\nThe neck takes the feature maps from the backbone and mixes these features so that each detection scale benefits from both.\nThe detection heads are the final prediction layers. Usually, YOLO has one detection head per scale. Each head predicts bounding boxes (4 values), confidence score (1 value), and class probabilities (C values) for its corresponding feature map. The final prediction output of each head has the shape \\(S \\times S \\times (5B + C) \\). If there are 3 different scales, then the output will have three tensors with the previous shape.\nFig. 3. YOLOv4 Architecture. (Terven &amp; Cordova-Esparza, 2023)\nB is the number of bounding boxes per grid cell. It can only be 1, 2 or more. If B is 1, then the model predicts only one bounding box per grid cell.\nThe stride The stride (S) tells you how much the input of the image has been downsampled. Therefore, each cell in the feature map corresponds to a region of the original image. The \\(5B + C\\) part of the output is the bounding box (x, y, height and width) and class probabilities.\nThis is important because YOLO predicts object centers relative to grid cells. Small objects need high-resolution feature maps, so they are usually predicted at lower stride, such as stride 8.\nFig. 4. Example of strides using different scales, with the centroid of the bounding box to determine which is the stride cell of the image to predict.\nThe confidence score A YOLO head usually predicts something like: tx, ty, tw, th, objectness, class probabilities While tx and ty are the predicted center offset of the bounding box and tw and th are the predicted width and height of the bounding box, the objectness value is the probability that an object exists in this prediction. The confidence score is commonly calculated by multiplying the objectness value with the class probabilities.\n\\[\\text{Confidence score} = \\text{Objectness} \\times \\text{Class probability}\\]Confidence score is used to remove weak predictions from the output of the model and reduces the number of low-quality detections\nThe anchor boxes YOLO models can be divided into two families:\nAnchor-free models, such as (Ge et al., 2021). Anchor-based models, such as (Redmon &amp; Farhadi, 2018). Anchors are predefined box shapes. YOLO models use anchors to predict offsets relative to these boxes. In anchor-based model of three scales, we can have the following anchors:\np3 = [(62, 66), (45, 213), (105, 104)] p4 = [(196, 76), (153, 143), (96, 316)] p5 = [(266, 266), (350, 465), (420, 500)] Anchors are usually defined by classifying the objects of the training set using an unsupervised clustering algorithm, such as k-means.\nWhen using anchors, the output the model is a tensor of shape \\(S \\times S \\times A \\times (5B + C) \\). Suppose we are at grid cell (i, j) on a feature map with stride s. The model predicts raw values: tx, ty, tw, th. Hence, a classical YOLO-style decoding example with anchor (60, 40) is:\nbx = (sigmoid(tx) + j) * stride # e.g. (0.55 + 30) \u00d7 8 = 244.4 by = (sigmoid(ty) + i) * stride # e.g. (0.40 + 20) \u00d7 8 = 163.2 bw = anchor_w * exp(tw) # e.g. 40 \u00d7 1.10 = 44.0 bh = anchor_h * exp(th) # e.g. 60 \u00d7 0.82 = 49.2 In modern anchor-free YOLO variants such as YOLOX, anchors may not be used explicitly. Instead, the model directly predicts box distances or center-based boxes.\nDecoupled head YOLOX utilizes a decoupled head, a significant departure from the single-head design in the previous YOLO models.\nIn traditional YOLO models, the head predicts object classes and bounding box coordinates using the same set of features. This approach simplified the architecture back in 2015, but it had a drawback. It can lead to suboptimal performance, since classification and localization of the object was performed using the same set of extracted features, and thus leads to conflict. Therefore, YOLOX introduced a decoupled head.\nFig. 5. YOLOX decoupled head architecture. (Ge et al., 2021)\nThe decoupled head consists of two separate branches:\nClassification Branch. Focuses on predicting the class probabilities for each object in the image. Regression Branch. Concentrates on predicting the bounding box coordinates and dimensions for the detected objects. See also Loss Functions and Activation Functions \u2192 Model training Unlike image classification, where the model predicts one label for the whole image, object detection requires solving several problems at the same time: deciding whether an object exists in a given location, estimating the coordinates of its bounding box, and assigning the correct class. For this reason, YOLO training is usually based on a multi-part loss function that combines localization, objectness, and classification terms.\nIntersection over Union (IoU) Intersection over Union (IoU) is a measure of the similarity between two bounding boxes. It is the division between the area of the intersection and the area of the union.\n\\[IoU = \\frac{\\text{Area of Intersection}}{\\text{Area of union}}\\]It is used in several steps of the training process:\nTraining loss. Anchor assignment. Post-processing techniques such as non-maximum suppression (NMS). Evaluation metrics such as mAP. Fig. 6. Intersection over Union (IoU) formula.\nmAP \u2014 Mean Average Precision\nmAP is the standard metric for comparing object detectors across classes and IoU thresholds. For each class, predictions are sorted by confidence score and a precision-recall curve is computed. The area under that curve is the Average Precision (AP) for that class. mAP averages AP over all \\(C\\) classes:\n\\[mAP = \\frac{1}{C}\\sum_{c=1}^{C}AP_c, \\qquad AP_c = \\sum_{n} (R_n - R_{n-1})\\, P_n\\]where \\(R_n\\) is the recall and \\(P_n\\) the best precision at each confidence threshold.\nLoss function The loss function tells the model how wrong its predictions are during training. The YOLO loss trains the model to solve three tasks at the same time:\nThe box loss: localize the object correctly. The objectness loss: predict whether an object exists. The class loss: classify the object correctly. A simplified YOLO loss can be written as:\n\\[ L = \\lambda_{\\text{box}}L_{\\text{box}} + \\lambda_{\\text{obj}}L_{\\text{obj}} + \\lambda_{\\text{cls}}L_{\\text{cls}} \\]where \\(\\lambda_{\\text{box}}\\), \\(\\lambda_{\\text{obj}}\\), and \\(\\lambda_{\\text{cls}}\\) are weighting factors used to balance the contribution of each term.\nThe box loss The box loss measures how well the predicted bounding box matches the ground-truth box. Older YOLO versions used mean squared error (MSE) over the box coordinates, but modern YOLO models usually use IoU-based losses because they are more directly aligned with the object detection objective. A simple IoU loss can be defined as:\n\\[ L_{\\text{box}} = 1 - IoU(b, \\hat{b}) \\]where \\(b\\) is the ground-truth box and \\(\\hat{b}\\) is the predicted box.\nHowever, modern detectors often use more advanced variants such as GIoU, DIoU, or CIoU. For example, CIoU includes not only the overlap between boxes, but also the distance between their centers and their aspect-ratio consistency:\n\\[ L_{\\text{CIoU}} = 1 - IoU + \\frac{\\rho^2(b, \\hat{b})}{c^2} + \\alpha v \\]where \\(\\rho^2(b, \\hat{b})\\) is the squared distance between the centers of the predicted and ground-truth boxes, \\(c^2\\) is the squared diagonal length of the smallest enclosing box, and \\(\\alpha v\\) penalizes differences in aspect ratio.\nThe objectness loss The objectness loss teaches the model whether a prediction contains an object. YOLO makes thousands of predictions per image. Therefore, it is important to balance the false positives. How? By adding weights when the model predicts a positive object. For example, a training implementation may use two different weights:\nneg_obj_weight_with_pos: the weight applied to negative predictions in a scale where at least one positive object exists. neg_obj_weight_no_pos: the weight applied to negative predictions in a scale where no positive object exists. This distinction is useful in multi-scale YOLO training. Suppose that an image contains a small object assigned to the P3 scale, but no objects are assigned to P4 or P5. In that case, P3 contains both positive and negative samples, while P4 and P5 contain only negative samples. If the loss gives too much weight to all negative predictions, the model may learn to predict background everywhere and become too conservative. These weights help balance the objectness loss so that negative examples are useful but do not dominate the training signal.\nThe class loss The class loss teaches the model which class is present in a positive prediction. There are two common ways to compute it. If each object belongs to exactly one class, the model can use a softmax activation followed by categorical cross-entropy:\n\\[ L_{\\text{cls}} = - \\sum_{c=1}^{C} y_c \\log(\\hat{p}_c) \\]where \\(y_c\\) is the ground-truth class indicator and \\(\\hat{p}_c\\) is the predicted probability for class \\(c\\).\nHowever, many YOLO implementations use binary cross-entropy independently for each class:\n\\[ L_{\\text{cls}} = - \\sum_{c=1}^{C} \\left[ y_c \\log(\\hat{p}_c) + (1-y_c)\\log(1-\\hat{p}_c) \\right] \\]This formulation treats class prediction as \\(C\\) independent binary classification problems. It is especially useful when multi-label classification is possible, although it is also commonly used in single-label YOLO detectors.\nCheck that each stride \\((S, S)\\) has only one positive prediction. Therefore, the model cannot learn to predict two objects in the same location.\nData Augmentation Data augmentation is another important part of YOLO training. Its goal is to expose the model to more visual variation without manually collecting more data. Common augmentations include random scaling, cropping, horizontal flipping, color jittering, mosaic augmentation, and MixUp.\nMixUp combines two images and their labels into a single training example. The resulting image is a weighted combination of both images:\n\\[ \\tilde{x} = \\lambda x_1 + (1-\\lambda)x_2 \\]where \\(x_1\\) and \\(x_2\\) are two training images, and \\(\\lambda\\) controls how much each image contributes to the final mixed image.\nFig. 5. MixUp Data Augmentation. Check source\nMixUp is used for example in YOLOX models and it has been found to be more effective in larger models.\nPost-processing After the model produces its raw predictions, these outputs still need to be converted into final detections. A YOLO model usually predicts many candidate boxes for the same object, many low-confidence boxes, and sometimes overlapping detections from different scales. Post-processing transforms these dense predictions into a clean final set of bounding boxes by applying confidence filtering, decoding the predicted coordinates, and removing duplicated detections with non-maximum suppression.\nNon-maximum suppression (NMS) Non-Maximum Suppression (NMS) is used to remove duplicate detections in bounding box prediction. YOLO often predicts many boxes around the same object. NMS keeps the strongest one and removes highly overlapping boxes.\nFig. 6. Non-Maximum Suppression. (Terven &amp; Cordova-Esparza, 2023)\nThe process is as follows:\nSort the predictions by their confidence score. For each class: Keep the box with the highest confidence score. Check other boxes with high IoU overlap with the kept box. Repeat until all boxes have been processed. Class-agnostic vs class-aware NMS\nClass-agnostic NMS suppresses overlapping boxes regardless of predicted class \u2014 simpler and faster. Class-aware NMS applies suppression separately per class, which matters when two objects of different classes genuinely overlap in the same region (e.g., a person holding a tennis racket). Bhalla, D. (2022, June). Region Proposal Network (RPN): A complete guide. ListenData. https:\/\/www.listendata.com\/2022\/06\/region-proposal-network.html\nGe, Z., Liu, S., Wang, F., Li, Z., &amp; Sun, J. (2021). YOLOX: Exceeding YOLO series in 2021. arXiv. https:\/\/doi.org\/10.48550\/arXiv.2107.08430\nRedmon, J., &amp; Farhadi, A. (2018). YOLOv3: An incremental improvement. arXiv. https:\/\/doi.org\/10.48550\/arXiv.1804.02767\nTerven, J., &amp; Cordova-Esparza, D. (2023). A comprehensive review of YOLO architectures in computer vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Machine Learning and Knowledge Extraction, 5, 1680\u20131716. https:\/\/doi.org\/10.3390\/make5040083\n@article{alas2026, title = &#34;Reviewing YOLO: You Only Look Once&#34;, author = &#34;Al\u00e0s Cerc\u00f3s, Oriol&#34;, journal = &#34;oriolac.github.io&#34;, year = &#34;2026&#34;, month = &#34;April&#34;, url = &#34;https:\/\/oriolac.github.io\/posts\/20260501-yolo\/&#34; } ","permalink":"https:\/\/oriolac.github.io\/posts\/20260501-yolo\/","summary":"<p>Object detection is one of the most popular tasks in computer vision, since it can be applied to a wide range of\napplications: robotics, autonomous driving or fault detection. In this post, we will try to give a brief overview of\nthe YOLO algorithm and the components that make it work.<\/p>\n<p>To do that, I have classified the main components of the algorithm into three categories:<\/p>\n<ul>\n<li>Characteristics based on the <strong>model architecture<\/strong>, how YOLO-based models improved the performance by using a new\narchitecture and which are the improvements made.<\/li>\n<li>Strategies based on the <strong>model training<\/strong>, such as the function loss or data augmentation.<\/li>\n<li>Methods for <strong>post-processing the output<\/strong> of the model, such as the non-maximum suppression (NMS) and the\nconfidence threshold.<\/li>\n<\/ul>\n<h2 id=\"two-stage-vs-one-stage-detectors\">Two-stage vs One-stage Detectors<\/h2>\n<p>Before YOLO, SoTA detectors were based on a <strong>two-stage detector<\/strong>: the first stage is used to detect the bounding\nboxes,\nand the second stage is used to classify the bounding boxes. This kind of model is called region-based detectors,\nbecause they need the region to then run the classification.<\/p>","title":"Reviewing YOLO: You Only Look Once"},{"content":"When making the first steps with deep learning, we grasp the idea of using a neural network to learn a function that maps data to other data. We are often told that neural networks are a powerful tool in machine learning because of their non-linearity and their ability to learn complex functions from data, which results in minizing some loss function. In this post, we will explSore how the final-layer activations are dependent on the loss function of our problem.\nBefore diving into each loss function, here is a quick reference of which activation and loss function to use depending on your classification case:\nFig. 1. Recommended activation and loss function per classification case.\nActivation functions The activation function is a function that maps the output of a layer to another value. These functions are used to introduce non-linearity into the network, allowing it to learn more complex relationships between inputs and outputs. They are typically applied element-wise to the output of a layer before passing it to the next layer.\nIn this post, we will focus on the most common activation functions used in deep learning. Of course, there are many others! I encourage you to explore them and find the one that best suits your problem.\n1. ReLU function The Rectified Linear Unit activation function or ReLU for short is a piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero.\n\\[\\text{ReLU}(x) = \\text{max}(0, x)\\]Although ReLU looks like a linear function, it is a nonlinear function allowing complex relationships to be learned and is able to allow learning through all the hidden layers.\nFig. 2. ReLU function.\nThere are a lot of variants of the ReLU function, such as Leaky ReLU, Parametric ReLU, and Exponential Linear Unit ( ELU) used for GANs, smoother loss landscapes and faster model performance respectively.\n2. Sigmoid function The sigmoid function is a smooth, continuous function that maps real-valued inputs to the range \\([0, 1]\\). That means that the output of the sigmoid function is always between 0 and 1. Large negative numbers will become close to 0, while large positive numbers will become close to 1.\n\\[ \\sigma(x) = \\frac{1}{1 + e^{-x}} \\]As its range is between 0 and 1, it is ideal for predicting probabilities of an event.\nWe can understand a classification as a prediction of a probability, but putting a threshold to decide if the prediction is positive or negative.\nFig. 3. Sigmoid function.\nHowever, let&rsquo;s take a look at its derivative: \\[ \\frac{\\partial \\sigma(x)}{\\partial x} = \\sigma(x) (1 - \\sigma(x)) \\] Fig. 4. Derivative of Sigmoid function. We can see the value of the derivative of the sigmoid evaluated at x.\nWe can see that the gradients of the sigmoid function are really small when \\(x \\in [-inf, -3] \\cup [3, +inf]\\). This means that when the input of the neurons are relatively high, the gradients are tiny and the neurons are not able to learn. That is why this activation function is mainly suitable for final layers.\n3. Tanh function The tanh function is really similar to the sigmoid function, but its output has a range of \\([-1, -1]\\). Hence, tanh outputs are zero-centered, which leads to better convergence compared to sigmoid.\n\\[ \\tanh(x) = \\frac{e^x - e^{-x}}{e^x + e^{-x}} \\] Fig. 5. Derivative of Tanh function.\nThe derivatives of the tanh are larger than the derivatives of the sigmoid which help us minimize the cost function faster in values near \\([-3, 3]\\). However, like sigmoid, the gradient values become close to zero for wide range of values. Thus, the network either stops learning or learns at a very slow rate.\n\\[ \\frac{\\partial \\tanh(x)}{\\partial x} = 1 - \\tanh^2(x) \\] Fig. 6. Derivative of Tanh function. We can see the value of the derivative of the tanh evaluated at x.\nThe famous problem of having really small gradient values is known as vanishing gradient problem and it has been a problem for a long time.\n4. Softmax function We have seen functions like sigmoid and tanh, that are used to map the output of a neuron to a range of values. However, we can also use a function called softmax to map the output of a layer to a probability distribution.\n\\[ \\text{Softmax}(x) = \\frac{e^{x_i}}{\\sum_j e^{x_j}} \\]Softmax function can be imagined as a combination of multiple sigmoids which can return the probability for a datapoint belonging to each individual class in a multiclass classification problem. The sum of the output of all the probabilities is always 1, since they are normalized. This function is widely used in deep learning to map the output of a layer to a probability distribution and for multi-class one-label classification.\nTo know which class the neural network thinks the input belongs to, we can use argmax to get the class of the highest probability.\nLoss functions As you may already know, the goal of a neural network is to learn a function that maps data to another data. In order to make the network understand how far or near it is from the desired output, we need to define a loss function. Therefore, loss functions are used to measure the distance between the output of the network and the desired output and the function to optimize.\nIn this post, we will focus on the most common loss functions used in deep learning. Of course, there are many others! In fact, there are many different loss functions for different types of problems. In physics simulations, physic formulas are used to define loss functions.\n1. Mean-squared error (MSE) or L2 loss Mean-squared error (MSE) or L2 loss is a loss function that measures the average squared difference between the predicted values and the actual values. It is commonly used for regression problems, where the goal is to predict a continuous value. The formula for MSE is:\n\\[ \\text{MSE} = \\frac{1}{n} \\sum_{i=1}^{n} (y_i - \\hat{y}_i)^2 \\]When the expected output of the network is between a range of values, for example, \\(y \\in [0, 1]\\), the MSE loss function can work well with final-activation layers such as sigmoid or tanh. If, for example, the expected output is between \\([0, 1000]\\), we can simply scale the output accordignly, multiplying the last neuron output by 1000. However, when the expected output range is not clear or we have terrible problems with the vanishing gradient problem, we can consider using ReLU instead.\nOne of the problems of MSE is that it is not robust to outliers in the data and penalizes high and low predictions quadratically.\n2. Mean-absolute error (MAE) or L1 loss Mean-absolute error (MAE) or L1 loss is a loss function that measures the average absolute difference between the predicted values and the actual values.\n\\[ \\text{MAE} = \\frac{1}{n} \\sum_{i=1}^{n} |y_i - \\hat{y}_i| \\]It is used when the expected output is a continuous value and we do not want the model to be dominated by outliers. What does that mean? It does not imply that outliers are unimportant; rather, when we use MSE, the error grows quadratically, whereas MAE grows linearly. This means that a single extreme outlier can pull the model much more strongly when using MSE than when using MAE. In contrast, with MAE, outliers have a more limited influence on the training process, allowing the model to focus more on the bulk of the data.\nThe choice between MAE (Mean Absolute Error) and MSE (Mean Squared Error) is fundamentally about how you want to treat errors and how that impacts optimization.\n3. Binary cross-entropy loss BCE loss is the default loss function used for the binary classification tasks. It is a loss function that measures the probability of the predicted class versus the actual class. For that, it uses the logarithm of the probability of the predicted class. The formula for BCE loss is:\n\\[ \\text{BCE} = -\\text{y} \\log(\\hat{y}) - (1 - \\text{y}) \\log(1 - \\hat{y}) \\]where \\(y\\) is the actual class and \\(\\hat{y}\\) is the predicted class.\nBCELoss only requires one output layer (one neuron) to classify the data into two classes. The range of this neuron is between 0 and 1. Therefore, the appropiate activation function is sigmoid. As a con, the BCE Loss can only be used for binary classification.\n4. Categorical cross-entropy loss Categorical cross-entropy loss is a loss function that measures the probability of the predicted distribution class versus the actual distribution class. It is used for multi-class classification problems. The formula for Categorical cross-entropy loss is:\n\\[ \\text{CCE} = -\\sum_i \\text{y}_i \\log(\\hat{y}_i) \\]The main idea here is that we are not only considering one neuron, but the whole resulting output vector of probabilities of the network. Hence, each output neuron of the neural network must be between 0 and 1. But not only that! The sum of the output neurons must be equal to 1. If we have paid attention before, we have seen that the softmax function is used to map the output of a layer to a probability distribution. So, the best way to use this function is to use it as the final-activation layer of the network for this kind of problem.\nThis loss function is useful when we have multiple classes and we want to measure the probability of each class. For instance, if we want to do a multi-class classification having \\(K\\) classes but only one accepted class for each sample, we can use a softmax function to map the output of the network to this exact probability distribution.\n5. Sparse Categorical cross-entropy loss Sparse Categorical Cross-Entropy (SCCE) loss is a variant of categorical cross-entropy used for multi-class classification problems where each sample belongs to exactly one class, but the ground-truth labels are provided as integer indices instead of one-hot encoded vectors.\nFor example, if we have 4 classes, instead of representing the target as:\n\\[ [0, 0, 1, 0] \\]we can simply represent it as:\n\\[ y = 2 \\]where 2 is the index of the correct class.\nThe loss is defined as:\n\\[ \\text{SCCE} = - \\log(\\hat{y}_y) \\]where \\(y\\) is the true class index and \\(\\hat{y}_y\\) is the predicted probability assigned to that correct class.\nThis loss is mathematically equivalent to categorical cross-entropy, but it is more convenient when the labels are already encoded as integers, since we do not need to transform them into one-hot vectors. This can also reduce memory usage when dealing with a large number of classes.\nAs in categorical cross-entropy, the network output must represent a valid probability distribution. Therefore, the most common final activation function is softmax, which ensures that:\neach output value is between 0 and 1, the sum of all output values is equal to 1. Sparse categorical cross-entropy is commonly used in practice because many datasets already store labels as integers. In frameworks such as PyTorch, this behavior is the default for multi-class classification losses such as CrossEntropyLoss.\nIn short, if we have a multi-class classification problem with one valid class per sample:\nuse categorical cross-entropy when the labels are one-hot encoded, use sparse categorical cross-entropy when the labels are integer class indices. 6. Kullback-Leibler divergence Kullback-Leibler divergence, also known as KL divergence, is a measure of how different one probability distribution is from another. Instead of comparing a single predicted value against a target value, KL divergence compares two full probability distributions.\nIt is defined as:\n\\[ D_{KL}(P \\parallel Q) = \\sum_i P(i)\\log\\left(\\frac{P(i)}{Q(i)}\\right) \\]where:\n\\(P\\) is the true or reference probability distribution, \\(Q\\) is the predicted or approximated probability distribution. The intuition behind this formula is that it measures how much information is lost when we use \\(Q\\) to approximate \\(P\\). If both distributions are identical, the KL divergence is equal to 0. The more different they are, the higher the divergence becomes.\nKL divergence is not symmetric, which means that: \\[ > D_{KL}(P \\parallel Q) \\neq D_{KL}(Q \\parallel P) > \\] Therefore, changing the order of the distributions changes the result.\nKL divergence is strongly related to cross-entropy. In fact, cross-entropy can be decomposed as:\n\\[ H(P, Q) = H(P) + D_{KL}(P \\parallel Q) \\]where \\(H(P)\\) is the entropy of the true distribution. Since \\(H(P)\\) is constant with respect to the model, minimizing the cross-entropy is equivalent to minimizing the KL divergence between the true and predicted distributions.\nThis is why cross-entropy is such a natural choice for classification: it encourages the model to make its predicted distribution as close as possible to the target distribution.\nKL divergence is especially useful when the target is not just a single correct class, but a full distribution. Some common use cases are:\nVariational Autoencoders (VAEs), where the latent distribution is forced to be close to a prior distribution, Knowledge distillation, where a smaller model learns to imitate the soft probability outputs of a larger model, Probabilistic modeling, where comparing distributions is more meaningful than comparing scalar values. In practice, we can think of the difference as follows:\nCross-entropy losses focus on predicting the correct class, KL divergence focuses on matching the full probability distribution. Therefore, KL divergence is particularly useful when we care not only about the final decision, but also about the structure of the predicted probabilities.\n","permalink":"https:\/\/oriolac.github.io\/posts\/20260410-loss-functions-activations\/","summary":"<p>When making the first steps with deep learning, we grasp the idea of using a neural network to learn a function that\nmaps data to other data. We are often told that neural networks are a powerful tool in machine learning because of their\nnon-linearity and their ability to learn complex functions from data, which results in minizing some loss function. In\nthis post, we will explSore how the final-layer activations are dependent on the loss function of our problem.<\/p>","title":"Loss functions and their final-layer activations"},{"content":"Overview The InnWater Water Tariff Dashboard is an AI-augmented decision-support system designed to simulate, analyze, and optimize water pricing structures. Developed within the Horizon Europe InnWater project, it supports sustainable, equitable, and economically efficient tariff design in multi-level water governance contexts.\nI contributed as AI Engineer\/System Architect at Eurecat, integrating the AI Assistant layer and supporting backend architecture for economic simulation workflows.\n\ud83d\udca7 Water Tariff Simulation &amp; Assessment Water Tariff Dashboard.\nThe dashboard enables structured tariff analysis through:\nSimulation of alternative pricing schemes: Flat rates Increasing block tariffs Progressive consumption models Comparison between sanitation and non-sanitation subscribers Comparison between poor and non-poor groups Environmental and resource cost internalization The tool converts tariff design into a reproducible, scenario-based analytical workflow.\n\ud83e\udd16 AI Assistant \u2013 RAG-based Economic Interpretation Innwater AI Assistant Diagram\nThe Water Tariff Dashboard integrates a Retrieval-Augmented Generation (RAG) Agent architecture, shared across the InnWater platform.\nArchitecture Query treatment module Hierarchical retrieval over indexed project deliverables Embedding-based semantic search LLM-grounded response generation Logging &amp; evaluation via golden dataset benchmarking Capabilities Interprets tariff simulation outputs Explains affordability indicators in policy terms Highlights trade-offs between equity and cost recovery Suggests scenario adjustments Connects tariff outcomes with governance gaps and CGE model results Multilingual interaction The AI layer transforms the dashboard from a numerical simulator into an AI-assisted policy analysis tool.\n\ud83c\udfd7 System Architecture Frontend Angular 15 Bootstrap 5 Chart.js &amp; D3.js Interactive scenario comparison Backend FastAPI (Python) PostgreSQL SQLAlchemy ORM Pandas &amp; NumPy for economic modeling AI Layer RAG pipeline Semantic embeddings LLM response generation Bias and robustness evaluation Deployment Docker-based containerization Integrated within the InnWater Governance Platform Impact Operationalized water tariff theory into an AI-supported economic decision tool Enabled evidence-based pricing design for water utilities Bridged tariff modeling with governance and macroeconomic simulation Demonstrated responsible AI deployment in public-sector economic policy ","permalink":"https:\/\/oriolac.github.io\/projects\/innwater-tariff\/","summary":"AI-augmented economic simulation platform for sustainable and equitable water tariff design within the WEFE nexus.","title":"InnWater \u2013 Water Tariff Dashboard"},{"content":"The post of today is going to be a bit different. We have already talked about Variational Autoencoders (VAE) in the past, but today we are going to see how to implement it from scratch, train it on a dataset and see how it behaves with tabular data. Yes, VAEs can be used for tabular data as well. To do so, we will use the CRISP-DM framework to guide us through the process.\nBusiness Understanding The dataset used in this project is obtained from a public GitHub repository and contains multivariate measurements collected by a low-cost IoT air-quality station. These stations typically combine semiconductor gas sensors (e.g., MQ-series) with environmental sensors to capture methane concentration, humidity, temperature, and moisture, along with raw sensor resistance readings.\nFig. 1. Timeline of embedding models described in this post. Methane (CH\u2084) is a key atmospheric gas with strong implications for environmental monitoring, industrial safety, and public health. Sudden increases in methane concentration can indicate gas leaks, malfunctioning equipment, incomplete combustion, or abnormal environmental conditions.\nMethane is influenced by several environmental and sensor-derived variables, and therefore offers a realistic dependency structure. However, the goal of this project is not to build a production-ready predictive model, but rather to create a controlled scenario that allows us to understand how Variational AutoEncoders (VAE) behave when trained on multivariate IoT sensor data and how other regression models can improve their performance using synthetic data. To do this we will generate synthetic datasets that mimic real sensor readings and then define a lightweight regression task focused on methane concentration.\nData Understanding The dataset consists of eight environmental and sensor-derived variables together with a target variable representing methan concentration (in ppb). All variables are continuous and numeric but the dataset does not include timestamps. Therefore, we will assume (for this post) that the data is not a time-series dataset and the model cannot leverage temporal patterns, lagged dependencies, or sequence-based correlations. Maybe in future posts I will use this data considering the consecutive rows as next timestamps.\nimport pandas as pd df = pd.read_csv( &#34;https:\/\/raw.githubusercontent.com\/gungunpandey\/Synthetic-Data-Generation\/refs\/heads\/main\/Dataset%201.csv&#34;) Correlation matrix If we create the correlation matrix between all the variables, the first thing we see is how the two variable families are also split by their correlations between them:\nEnvironmental and context variables: Moisture, temperature and humidity Gas-sensor electrical signals: R2611E, R2600, R2602, R2611C, RMQ4 Target variable: Methane (ppb) import matplotlib.pyplot as plt import seaborn as sns plt.figure(figsize=(10, 8)) corr = df.corr() sns.heatmap( corr, annot=True, fmt=&#34;.2f&#34;, cmap=&#34;coolwarm&#34;, vmin=-1, vmax=1, square=True, linewidths=0.5, cbar_kws={&#34;shrink&#34;: 0.8} ) plt.title(&#34;Correlation Heatmap&#34;, fontsize=16) plt.tight_layout() plt.show() The matrix shows strong structure: the resistance channels move together, and they move opposite to temperature\/moisture. While temperature and moisture have strong positive correlation between them and strong negative correlation with gas-sensor electrical signals, humidity does not have such strong relationship with the variables.\nAlthough it seems a difficulty added to the data, it can be helpful since we are adding a new layer of complexity to understand how methane can behave. If humidity does not exist, might be better to use a PCA rather than VAE!\nFig. 2. Correlation matrix of the input variables. My interpretation of most of the resistance channels is that they are responding to a common driver (often a mix of gas exposure + environment). R2611C and RMQ4 are near-duplicates; R2602 may be the only one adding a meaningfully different dimension.\nRegarding our target variable, we can see that methane has moderate negative correlation with several resistance channels and almost independent from humidity and really weakly to moisture and temperature.\nVariable distributions I have used KDE plots to show the data distribution on each of the variables.\nimport matplotlib.ticker as ticker fig, axs = plt.subplots(3, 3, figsize=(16, 16)) fig.tight_layout(h_pad=3.5) for col, ax in zip(df.columns, axs.flatten()): sns.kdeplot(df[col], ax=ax) ax.yaxis.set_major_formatter(ticker.ScalarFormatter(useMathText=True)) ax.ticklabel_format(style=&#39;scientific&#39;, axis=&#39;y&#39;, scilimits=(0, 0)) ax.set_title(col) The KDE is strongly right-skewed: a sharp mode around the lower end (roughly ~2k ppb in your plot) followed by a long tail reaching much higher concentrations.\nThis is typical of a process with a baseline level plus episodic events (short bursts, leaks, plumes, or operational episodes).\nModeling implication: methane is unlikely to be well-behaved under Gaussian assumptions.\nFig. 3. Kernel Density Estimation plots of the variables. All resistance KDEs show multiple modes (some strongly), which is exactly what you expect when sensor resistance is responding to a combination of environmental and gas exposure regimes\nScatter plots I wanted only to know bivariate structures of environmental variables. In this section, I show scatter plots of the three context variables, their relationship and my interpretation about them.\ndef compare_variables(x, y, df, label=&#34;&#34;, title=&#34;&#34;): plt.scatter(df[x], df[y], alpha=0.3, label=label) plt.xlabel(x) plt.ylabel(y) plt.title(title) compare_variables(&#34;Temperature&#34;, &#34;Moisture&#34;, df, title=&#34;Scatter plot of temperature and moisture&#34;) plt.show() compare_variables(&#34;Temperature&#34;, &#34;Humidity&#34;, df, title=&#34;Scatter plot of temperature and humidity&#34;) plt.show() Regarding the relationship between temperature and moisture, the structure is much closer to a strong monotonic trend (consistent with the ~0.95 correlation), but we can still see clustered &ldquo;blocks&rdquo; at certain temperature ranges. This might be due to measuremant coupling, or that the sensors are sending in only specific hours. We will see how AE and VAE adapt relationship.\nFig. 4. Bivariate structure between temperature and moisture .\nRegarding the relationship between temperature and humidity, we can see that the point cloud is not a single curve. Instead, it forms loops\/arcs and vertical bands at certain temperatures. This is classic hysteresis behavior you see when plotting two variables that evolve over time (e.g., diurnal cycles): for the same temperature, humidity can take different values depending on whether the system is heating up or cooling down, or depending on the prevailing weather mass. That explains why the correlation is weak: the relationship is * *non-functional (one-to-many)**, not simply noisy.\nFig. 5. Bivariate structure between temperature and humidity.\nData Preparation I will be using finally StandardScaler for all the features since RobustScaler, although could be great due to the long-tailed distribution in Methane, did not give better solutions for AE and VAE.\nfrom sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split scaler = StandardScaler() X_scaled = scaler.fit_transform(df.values) X_train, X_test = train_test_split(X_scaled, test_size=0.2, random_state=42) Our training code will be based on pytorch since it provides a flexible and expressive API to develop the models and the training loop. Therefore, we will need to use Dataset and DataLoader to turn the data into tensors and load it in batches for the training loop.\nIn this case, we are using a custom Dataset since TensorDataset could lead to unnecessary memory overhead and reduced flexibility when handling large arrays or more complex data-loading logic. The dataset stores a reference to the original Numpy array rather than eagerly converting it to a tensor, which avoids duplicating the data in memory.\nfrom torch.utils.data import Dataset import torch class MemoryEfficientDataset(Dataset): &#34;&#34;&#34;Custom Dataset that loads data on-the-fly to avoid memory issues&#34;&#34;&#34; def __init__(self, data_source): if isinstance(data_source, np.ndarray): self.data = data_source self.length = len(data_source) else: raise ValueError(&#34;data_source must be a numpy array&#34;) def __len__(self): return self.length def __getitem__(self, idx): sample = self.data[idx] sample = torch.tensor(sample, dtype=torch.float32) return sample Modeling In the modeling stage, the objective is to analyze the differences between a standard autoencoder (AE) and a variational autoencoder (VAE), both from an architectural and a behavioral perspective. To ensure a fair and controlled comparison, we will first define and implement the autoencoder architecture and then extend it to the variational autoencoder.\nBoth models share two key hyperparameters: the input dimensionality and the dimensionality of the latent space. Defining these parameters upfront allows us to keep the architectures comparable and to isolate the effect of the variational formulation.\ninput_dim = X_train.shape[1] latent_dim = 8 Here, input_dim corresponds to the number of features in each input sample, while latent_dim controls the level of compression and the expressive capacity of the learned latent representation.\nAutoEncoder We start by defining a standard autoencoder architecture tailored for tabular data. The model follows the classical encoder-latent-decoder structure, where the encoder progressively compresses the input features into a low-dimensional latent representation, and the decoder attempts to reconstruct the original input from this compressed space. The objective is to learn a latent embedding that preserves as much information as possible while enforcing dimensionality reduction.\nimport torch import torch.nn as nn import torch.nn.functional as F from torch.utils.data import DataLoader from tqdm import tqdm import numpy as np class TabularAE(nn.Module): def __init__(self, input_dim, latent_dim=8, hidden_dims=(64, 32)): super().__init__() # Encoder encoder_layers = [] prev = input_dim for h in hidden_dims: encoder_layers += [nn.Linear(prev, h), nn.ReLU()] prev = h self.encoder = nn.Sequential(*encoder_layers) self.latent = nn.Linear(prev, latent_dim) # Decoder decoder_layers = [] prev = latent_dim for h in reversed(hidden_dims): decoder_layers += [nn.Linear(prev, h), nn.ReLU()] prev = h decoder_layers += [nn.Linear(prev, input_dim)] self.decoder = nn.Sequential(*decoder_layers) def encode(self, x): h = self.encoder(x) z = self.latent(h) return z def decode(self, z): return self.decoder(z) def forward(self, x): z = self.encode(x) x_rec = self.decode(z) return x_rec, z This autoencoder is deterministic: each input sample is mapped to a single point in the latent space, and reconstruction quality is optimized solely through a reconstruction loss. The forward method returns both the reconstructed input and the latent vector, which is useful for downstream analysis such as visualization, clustering, or anomaly detection.\nFig. 6. AutoEncoder architecture .\nOnce the architecture is defined, we implement a training routine that is reusable and consistent across experiments. The training loop handles data loading, optimization, validation, and early stopping.\nEarly stopping is a regularization technique used during training to prevent overfitting. It monitors performance on a validation set and stops training when the validation loss no longer improves for a predefined number of epochs. This ensures the model retains the best generalization performance while avoiding unnecessary training iterations.\ndef train_ae( X_train, X_test, input_dim, latent_dim=10, hidden_dims=(64, 32), batch_size=64, lr=1e-3, epochs=50, device=&#34;cpu&#34;, early_stopping=5, loss_fn=&#34;mse&#34;, ): model = TabularAE(input_dim, latent_dim, hidden_dims).to(device) optimizer = torch.optim.Adam(model.parameters(), lr=lr) print(&#34;Loading data to tensor...&#34;) train_dataset = MemoryEfficientDataset(X_train) print(&#34;Loading validation data to dataset...&#34;) val_dataset = MemoryEfficientDataset(X_test) print(&#34;Creating data loaders...&#34;) train_loader = DataLoader( train_dataset, batch_size=batch_size, shuffle=True, ) val_loader = DataLoader( val_dataset, batch_size=batch_size, shuffle=False, ) if loss_fn == &#34;mse&#34;: recon_criterion = lambda x_hat, x: F.mse_loss(x_hat, x, reduction=&#34;mean&#34;) elif loss_fn == &#34;mae&#34;: recon_criterion = lambda x_hat, x: F.l1_loss(x_hat, x, reduction=&#34;mean&#34;) else: raise ValueError(&#34;loss_fn must be &#39;mse&#39; or &#39;mae&#39;&#34;) train_losses = [] val_losses = [] best_val_loss = float(&#34;inf&#34;) best_epoch = 0 early_stopping_counter = 0 for epoch in range(epochs): # Training model.train() train_epoch_loss = 0.0 train_samples = 0 for xb in tqdm(train_loader, desc=f&#34;Training Epoch {epoch + 1}\/{epochs}&#34;): xb = xb.to(device) x_rec, _ = model(xb) loss = recon_criterion(x_rec, xb) optimizer.zero_grad() loss.backward() optimizer.step() train_epoch_loss += loss.item() * xb.size(0) train_samples += xb.size(0) # Validation model.eval() val_epoch_loss = 0.0 val_samples = 0 with torch.no_grad(): for xb in tqdm(val_loader, desc=f&#34;Validation Epoch {epoch + 1}\/{epochs}&#34;): xb = xb.to(device) x_rec, _ = model(xb) loss = recon_criterion(x_rec, xb) val_epoch_loss += loss.item() * xb.size(0) val_samples += xb.size(0) avg_train_loss = train_epoch_loss \/ train_samples avg_val_loss = val_epoch_loss \/ val_samples train_losses.append(avg_train_loss) val_losses.append(avg_val_loss) if avg_val_loss &lt; best_val_loss: best_val_loss = avg_val_loss best_epoch = epoch + 1 early_stopping_counter = 0 else: early_stopping_counter += 1 if early_stopping_counter &gt;= early_stopping: print(f&#34;Early stopping at epoch {epoch + 1}&#34;) break print(f&#34;Epoch {epoch + 1}\/{epochs}:&#34;) print(f&#34; Train Loss: {avg_train_loss:.4f}&#34;) print(f&#34; Val Loss: {avg_val_loss:.4f}&#34;) print(f&#34; Best Val: {best_val_loss:.4f} (Epoch {best_epoch})&#34;) print(&#34;-&#34; * 50) print(&#34;\\nTraining completed!&#34;) print(f&#34;Best validation loss: {best_val_loss:.4f} at epoch {best_epoch}&#34;) return { &#34;model&#34;: model, &#34;train_losses&#34;: train_losses, &#34;val_losses&#34;: val_losses, &#34;best_val_loss&#34;: best_val_loss, &#34;best_epoch&#34;: best_epoch, } results = train_ae(X_train, X_test, input_dim) Fig. 7. AutoEncoder traning and validation losses through epochs .\nEvaluation To evaluate how it worked the AutoEncoder to create new data, I created a function that takes the embedding vectors from a sample set and modifies slightly the vectors to generate the new dataset.\ndef get_latent_vectors_ae( model, X_minority, device=&#34;cpu&#34;, ): model.eval() X_tensor = torch.tensor(X_minority, dtype=torch.float32).to(device) with torch.no_grad(): z = model.encode(X_tensor) z_np = z.cpu().numpy() return z_np latent_space = get_latent_vectors_ae(model=results[&#39;model&#39;], X_minority=X_test) With the latent space, we can check their distribution with the KDE plot we did for the input variables.\nimport matplotlib.ticker as ticker fig, axs = plt.subplots(3, 3, figsize=(16, 16)) fig.tight_layout(h_pad=3.5) for col, ax in zip(range(latent_space.shape[1]), axs.flatten()): sns.kdeplot(latent_space[col], ax=ax) ax.yaxis.set_major_formatter(ticker.ScalarFormatter(useMathText=True)) ax.ticklabel_format(style=&#39;scientific&#39;, axis=&#39;y&#39;, scilimits=(0, 0)) ax.set_title(f&#34;Embedding dimension: {col}&#34;) fig.suptitle(&#34;KDE plots of the embedding vector&#34;, y=1.03) Surp risingly, it seems that all the latent dimensions follow a gaussian distribution, although it is not normalized. This will help us when creating new data samples.\nFig. 8. Kernel Density Estimation Plots of the latent space dimensions of the AutoEncoder.\nJust out of curiosity, we can also check if we need less dimensions in our latent space by checking the correlation matrix or doing a PCA.\nKeep into account that AE or VAE are not lineal models, so a PCA could give us wrong assumptions!\nFig. 9. Correlation matrix of the latent space dimensions of the AutoEncoder.\nIt is obvious that with 9 input variables and most of them correlated with each other, we can get rid of at least two or three latent dimensions out of 8. Even though, it is a great opportunity to see how it behaves with this latent dimension.\nUsing the fact that they follow a Gaussian distribution as an advantage, we can generate new samples getting the standard deviation from each of the latent dimensions and modify slightly the latent space.\ndef generate_synthetic_from_ae( model, X_minority, n_samples, std=None, device=&#34;cpu&#34;, ): model.eval() X_tensor = torch.tensor(X_minority, dtype=torch.float32).to(device) with torch.no_grad(): z = model.encode(X_tensor) z_np = z.cpu().numpy() # Re-muestreamos latentes de la minoritaria para oversampling idx = np.random.randint(0, z_np.shape[0], size=n_samples) z_base = z_np[idx] if std is None: std = z_np.std(axis=0) \/ 2 z_base = z_base + np.random.normal(0, std, size=z_base.shape) z_tensor = torch.tensor(z_base, dtype=torch.float32).to(device) with torch.no_grad(): X_synth = model.decode(z_tensor).cpu().numpy() return X_synth The function generate_synthetic_from_ae generates new synthetic samples by operating directly in the latent space learned by an autoencoder. We are perturbing latent representations by adding controlled noise to them before decoding.\nFirst, the samples are encoded into latent vectors using the encoder of the trained autoencoder. To perform oversampling, a set of latent vectors is randomly resampled (with replacement) from the minority latent space. This ensures that the synthetic samples remain close to the true minority manifold.\nThe std argument controls the amount of stochastic perturbation applied to these latent vectors. If not explicitly provided, it is automatically estimated as half of the empirical standard deviation of each latent dimension. Gaussian noise is then added independently to each dimension, introducing variability while preserving the overall structure of the latent space.\nWith the new synthetic data, we can see their distribution by doing a KDE plot.\nFig. 10. Kernel Density Estimation plots of the synthetic data of AE and real data.\nOverall, most variables show good first-order alignment (range and main modes), with some smoothing effects in the synthetic data, which is expected when decoding from a noisy latent space. Methane is the most critical variable to inspect, as its distribution is highly-skewed. The synthetic distribution has a lower and broader peak, showing higher entropy in the latent perturbation. All other variables seem to have better shape distribution.\nChecking the correlation matrix we do not see any critical changes.\nFig. 11. Correlation matrix comparison of the real data and the synthetic data from AutoEncoder.\nMoreover, we can check the bivariate relationship of the context variables with the auto std.\nFig. 12. Bivariate structures between humidity, temperature and moisture.\nHere we can see how the main correlations are preserved, although they are quite noised. This noise can be specially shown between temperature and humidity.\nOne of the main drawbacks of the AutoEncoder is that we do not know the distribution of the latent space until visualizing its representation. Other challenges of the current model is that it not able to generate skewed distributions or the generated data seem to generate data well correlated but with a bit of noise. In the next section, we will see how VAEs can tackle these challenge.\nVariational AutoEncoder While the standard AutoEncoder provides a compact and useful latent representation, it remains a deterministic model: each input sample is mapped to a single fixed point in latent space. This design is effective for reconstruction and representation learning, but it imposes limitations when the goal is robust data generation, especially under distributional uncertainty or class imbalance.\nTo address these limitations, we extend the AutoEncoder into a Variational AutoEncoder (VAE). The key conceptual change is that, instead of learning a single latent vector per input, the VAE learns a probabilistic latent representation. Each input is mapped to a distribution in latent space rather than a point estimate.\nFig. 13. Variational AutoEncoder model architecture.\nThe architecture closely follows that of the AutoEncoder, with a critical modification in the encoder head. The encoder does not output a latent vector directly. Instead, it predicts the parameters of a Gaussian distribution.\nSee also The Generative Trilemma: A quick overview \u2192 The transition from an AutoEncoder to a Variational AutoEncoder is motivated by data generation quality and robustness:\nAutoEncoder Variational AutoEncoder Deterministic latent space Probabilistic latent space Good reconstruction Explicit distributional assumptions Latent perturbations are heuristic Principled sampling mechanism import torch import torch.nn as nn import torch.nn.functional as F from torch.utils.data import DataLoader from tqdm import tqdm import numpy as np class TabularVAE(nn.Module): def __init__(self, input_dim, latent_dim=8, hidden_dims=(64, 32)): super().__init__() # Encoder encoder_layers = [] prev = input_dim for h in hidden_dims: encoder_layers += [nn.Linear(prev, h), nn.ReLU()] prev = h self.encoder = nn.Sequential(*encoder_layers) self.fc_mu = nn.Linear(prev, latent_dim) self.fc_logvar = nn.Linear(prev, latent_dim) # Decoder decoder_layers = [] prev = latent_dim for h in reversed(hidden_dims): decoder_layers += [nn.Linear(prev, h), nn.ReLU()] prev = h decoder_layers += [nn.Linear(prev, input_dim)] self.decoder = nn.Sequential(*decoder_layers) def encode(self, x): h = self.encoder(x) mu = self.fc_mu(h) logvar = self.fc_logvar(h) return mu, logvar def reparameterize(self, mu, logvar): std = torch.exp(0.5 * logvar) eps = torch.randn_like(std) return mu + eps * std def decode(self, z): return self.decoder(z) def forward(self, x): mu, logvar = self.encode(x) z = self.reparameterize(mu, logvar) x_rec = self.decode(z) return x_rec, mu, logvar, z Here we can see the reparameterize function that samples the point given the distribution given from the encoder. This formulation introduces controlled stochasticity while maintaining end-to-end differentiability.\nThe training loop for the VAE closely mirrors that of the AutoEncoder by adding the KL divergence loss.\nSee also Loss functions and their final-layer activations \u2192 def train_vae( X_train, X_test, input_dim, latent_dim=10, hidden_dims=(64, 32), batch_size=64, lr=1e-3, epochs=50, device=&#34;cpu&#34;, early_stopping=5, beta=1.0, ): model = TabularVAE(input_dim, latent_dim, hidden_dims).to(device) optimizer = torch.optim.Adam(model.parameters(), lr=lr) print(&#34;Loading data to tensor...&#34;) train_dataset = MemoryEfficientDataset(X_train) print(&#34;Loading validation data to dataset...&#34;) val_dataset = MemoryEfficientDataset(X_test) print(&#34;Creating data loaders...&#34;) train_loader = DataLoader( train_dataset, batch_size=batch_size, shuffle=True, ) val_loader = DataLoader( val_dataset, batch_size=batch_size, shuffle=False, ) train_losses = [] val_losses = [] best_val_loss = float(&#34;inf&#34;) best_epoch = 0 early_stopping_counter = 0 for epoch in range(epochs): # Training model.train() train_epoch_loss = 0.0 train_samples = 0 for xb in tqdm(train_loader, desc=f&#34;Training Epoch {epoch + 1}\/{epochs}&#34;): xb = xb.to(device) x_rec, mu, logvar, _ = model(xb) # Reconstruction loss (MSE por defecto) recon_loss = F.mse_loss(x_rec, xb, reduction=&#34;mean&#34;) # KL divergence (media por muestra) kl_loss = -0.5 * torch.mean( 1 + logvar - mu.pow(2) - logvar.exp() ) loss = recon_loss + beta * kl_loss optimizer.zero_grad() loss.backward() optimizer.step() train_epoch_loss += loss.item() * xb.size(0) train_samples += xb.size(0) # Validation model.eval() val_epoch_loss = 0.0 val_samples = 0 with torch.no_grad(): for xb in tqdm(val_loader, desc=f&#34;Validation Epoch {epoch + 1}\/{epochs}&#34;): xb = xb.to(device) x_rec, mu, logvar, _ = model(xb) recon_loss = F.mse_loss(x_rec, xb, reduction=&#34;mean&#34;) kl_loss = -0.5 * torch.mean( 1 + logvar - mu.pow(2) - logvar.exp() ) loss = recon_loss + beta * kl_loss val_epoch_loss += loss.item() * xb.size(0) val_samples += xb.size(0) avg_train_loss = train_epoch_loss \/ train_samples avg_val_loss = val_epoch_loss \/ val_samples train_losses.append(avg_train_loss) val_losses.append(avg_val_loss) if avg_val_loss &lt; best_val_loss: best_val_loss = avg_val_loss best_epoch = epoch + 1 early_stopping_counter = 0 else: early_stopping_counter += 1 if early_stopping_counter &gt;= early_stopping: break print(f&#34;Epoch {epoch + 1}\/{epochs}:&#34;) print(f&#34; Train Loss: {avg_train_loss:.4f}&#34;) print(f&#34; Val Loss: {avg_val_loss:.4f}&#34;) print(f&#34; Best Val: {best_val_loss:.4f} (Epoch {best_epoch})&#34;) print(&#34;-&#34; * 50) print(&#34;\\nTraining completed!&#34;) print(f&#34;Best validation loss: {best_val_loss:.4f} at epoch {best_epoch}&#34;) return { &#34;model&#34;: model, &#34;train_losses&#34;: train_losses, &#34;val_losses&#34;: val_losses, &#34;best_val_loss&#34;: best_val_loss, &#34;best_epoch&#34;: best_epoch, } Fig. 14. VAE training and validation losses through epochs.\nAdding the KL divergence loss adds some noise in the training that was not seen in the AutoEncoder training.\nEvaluation To get the latent space, we need to use the encoder and call the reparameterize function:\ndef get_latent_vectors_vae( model, X_minority, device=&#34;cpu&#34;, ): model.eval() X_tensor = torch.tensor(X_minority, dtype=torch.float32).to(device) with torch.no_grad(): mu, logvar = model.encode(X_tensor) z_post = model.reparameterize(mu, logvar) z_post = z_post.cpu().numpy() return z_post With the latent space from the VAE, we can see the difference of the latent distributions from the AutoEncoder.\nFig. 15. Kernel Density Estimation plots of the latent dimensions of VAE.\nWe can see that, although there are some skewed distributions, all of them share the same standard deviation due to the KL divergence loss. The last dimensions shows exactly this improvement from the AE.\nThe correlation matrix suggests that latent dimensions are largely uncorrelated at the linear level. While this does not rule out higher-order or non-linear dependencies, it indicates the absence of strong linear relationships between latent variables, with remaining correlations likely dominated by noise.\nFig. 16. Correlation matrix of the latent dimensions of the VAE.\nFrom a modeling and representation-learning perspective, this VAE latent-space correlation matrix is exactly what you would expect from a well-behaved VAE.\nA common misinterpretation is to equate \u201cuncorrelated\u201d with \u201cuninformative.\u201d That is not the case here.\nWhat this result actually tells is:\nEach latent dimension captures distinct aspects of variation in the data. Information is not redundantly encoded across multiple latent axes. Sampling each dimension independently is meaningful and safe. To generate the synthetic data, we will use a similar function from the autoencoder but adding the reparameterize function.\ndef generate_synthetic_from_vae( model, X_minority, n_samples, noise_std=0.1, device=&#34;cpu&#34;, ): model.eval() X_tensor = torch.tensor(X_minority, dtype=torch.float32).to(device) with torch.no_grad(): mu, logvar = model.encode(X_tensor) z_post = model.reparameterize(mu, logvar) z_post = z_post.cpu().numpy() idx = np.random.randint(0, z_post.shape[0], size=n_samples) z_base = z_post[idx] if noise_std &gt; 0: z_base = z_base + np.random.normal(0, noise_std, size=z_base.shape) z_tensor = torch.tensor(z_base, dtype=torch.float32).to(device) with torch.no_grad(): X_synth = model.decode(z_tensor).cpu().numpy() return X_synth We can see how VAE outputs can generate more real skewed distributions by looking the KDE plots from the following figure. While AE outputs cannot generate skewed distributions from Methane, VAE outputs can comprehend not only a great representation but also generate better skewed distributions.\nFig. 17. Kernel Density Estimation comparison between real, AE and VAE data.\nCheck that varying the noise can affect the KDE plots, although the standard deviation is more or less the same in AE and VAE. A deeper evaluation from all datasets may be made.\nIf we see the correlation matrix comparison from the real data and the synthetic data, we see that more or less there is no significant changes regarding the correlation of variables. We can appreciate a slightly decrease of correlations in VAE, but not significant at all.\nFig. 18. Matrix correlation comparison.\nRegarding the bivariate structure, we clearly see that VAE outputs share less spread synthetic outputs. Nevertheless, we can see that its mean weakness is over-smoothing: the synthetic cloud is more &ldquo;linear \/ averaged&rdquo; and does not reproduce some sharper curved structures and distinct bands visible in the real points.\nFig. 19. Bivariate structure of temperature, humidity and moisture from real and VAE synthetic data.\nRegarding the standard deviation when generating new samples, we clearly see an improvement in VAE, since they can also show bivariate relationships even though they are in nearer samples from reality. When the standard deviation of AE is small, we see that the new set of synthetic data does not change at all. When generating new data, we want to have data that have the same pattern but slightly change its outputs. Therefore, generating new samples from AE does not overcome this challenge at all.\nGreater standard deviations, that might create off-manifold samples or implausible values, even though it is clear the over-smoothing, at least the data is not widely spread adding noise to the dataset.\nFig. 20. Comparison of bivariate structures of temperature, humidity and moisture depending on the standard deviation (Auto, 0.1, 0.5 and 1 in columns) and model (AE and VAE in rows).\nConclusion Overall, we can see that Variational Auto-Encoders provide a principled generative framework better suited for synthetic tabular data, showing more realistic generation of skewed variables, particularly methane, without requiring heuristic scaling of latent noise. While VAEs clearly improved robustness and distributional fidelity, they also exhibited over-smoothing effects\nIn future work, this setup can be extended by introducing temporal structure, conditioning the generative process on methane regimes or evaluating the impact of synthetic data on downstream regression and anomaly-detection tasks.\n","permalink":"https:\/\/oriolac.github.io\/posts\/20251210-vae-tabular\/","summary":"<p>The post of today is going to be a bit different. We have already talked about <strong>Variational Autoencoders (VAE)<\/strong>\n<a href=\"http:\/\/oriolac.github.io\/posts\/20250710-starting-diffusion\/\" target=\"_blank\" rel=\"noopener\">in the past<\/a>, but today we are going to see how to\nimplement it from scratch, train it on a dataset and see how it behaves with <strong>tabular data<\/strong>. Yes, VAEs can be used for\ntabular data as well. To do so, we will use the <strong>CRISP-DM framework<\/strong> to guide us through the process.<\/p>","title":"Variational AutoEncoders (VAE) for Tabular Data"},{"content":"Embedding models are foundational in modern NLP, turning raw text into numerical vectors that preserve semantic significance. These representations power everything from semantic search to Retrieval-Augmented Generation or Prompt Engineering for LLM Agents. With growing demand for domain-specific applications, understanding which is the best fit for your system is more important than ever.\nIntroduction In modern NLP, a text embedding is a vector that represents a piece of text in a mathematical space. The magic of embeddings is that they encode semantic meaning: texts with similar meaning end up with vectors that are close together. For example, an embedding model might place &ldquo;How to change a tier&rdquo; near &ldquo;Steps to fix a flat tire&rdquo; in its vector space, even though the wording is different. This property makes embedding models incredibly useful for tasks like search, clustering or recommendation, where we care about semantic similarity rather than exact keyword matches. By converting text into vectors, embedding models allow computers to measure meaning and relevance via distances in vector space.\nHowever, not all embeddings are created equally, and using a generic embedding model for every task can be limiting. Many pre-trained embedding models are trained on broad internet text or general knowledge. If you application works in a specific domain (finance, medical, legal, etc.) those models might not capture the nuances or terminology that matter for your context. This is where fine-tuning comes in. By fine-tuning an embedding model on data from your domain, you can make it specialist rather than generalist, aligning the vector space with what&rsquo;s actually important for your documents and queries.\nThis post explores the landscape of embedding models, from their historical evolution (Word2Vec, GloVe) to modern transformer-based architectures (BERT, SBERT, E5, ColBERT). You\u2019ll learn how they differ, how to choose between them and improve your RAG-based system.\nThe importance of vectors that understand At a high level, an embedding model is a neural network that encodes text into a high-dimensional vector. The goal is to represent text in a numerical form that captures linguistic and semantic characteristics. Early examples include word2vec and GloVe, which learn static word embeddings (each word type in the vocabulary gets a fixed vector representation). Modern examples include transformers like BERT, RoBERTa or Sentence-BERT, which can produce embeddings for entire sentences or paragraphs and its output depends on the attention mechanism applied between tokens.\nSimilarity between texts can be measured by vector distances metrics such as cosine similarity or euclidean distance. In an embedding space, similar texts are designed to lie close together, while dissimilar texts are far apart. This enables semantic search: se can take a user query, embed it into a vector, and quickly find which documents have embeddings nearest to that&rsquo;s query embedding. This approach goes beyond keyword matching; it can retrieve information that uses different wording but conveys the same idea.\nFig. 1. Timeline of embedding models described in this post. Embeddings also shine in other tasks like clustering (grouping similar documents), classification (feeding embeddings into machine learning models) or anomaly detection. They condense the essential information of text into a numeric form that algorithms can easily work with. Additionally, embedding models are not just for text but for image (for example, vehicle re-identification), so the number of usages is exponentially greater than it is thought to be. Overall, embedding models are key to understanding text or image in ML systems, offering a vector-based representation that preserves meaning.\nTaxonomy of Text Embeddings Not all embedding models work the same way. We can categorize them along a few dimensions: how they treat context (static vs. contextual embeddings), what textual unit they embed (word-level vs. sentence-level), how are they trained ( unsupervised or supervised), the nature of the vector representations (dense vs. sparse embeddings), the interaction during scoring (single vector comparison vs. token-level max similarity, also called late interaction) and their type of purpose (symmetric vs. asymmetric). Understanding this taxonomy will clarify the landscape of embedding techniques and when to use each.\nStatic vs Contextual word embeddings Fig. 2. Side-by-side comparison of static (left) and contextual (right) embeddings. Static embeddings were the early wave of embedding models exemplified by Word2Vec, GloVe, and FastText. These models learn one vector per word in a fixed vocabulary by training on large corpora to capture general semantic relationships. The key characteristic is that each word has a single embedding no matter where it appears. The word \u201cpoint\u201d will have the same vector representation whether we\u2019re talking about a \u201cpoint of reference\u201d or a \u201cpoint on a graph\u201d. Static embeddings thus ignore context: they can&rsquo;t distinguish between different meanings of the same word in different sentences.\nContextual embeddings, on the other hand, produce vectors that depend on the surrounding context of each word. Transformer-based models output a contextualized embedding for each token (and often for entire sentences) by considering the whole sentence or paragraph. In a contextual model, the word \u201cpoint\u201d will have a different embedding in \u201cmake a point\u201d vs. \u201cpoint guard\u201d vs. \u201cpoint of intersection,\u201d because the model understands these are different usages based on context. This is achieved through mechanisms like self-attention that let every token influence the representation of others. Contextual embeddings were a major breakthrough because they capture nuances and resolve ambiguity that static embeddings cannot.\nIn practice, contextual models (like BERT) yield much stronger performance on language understanding tasks than static embeddings, since meaning often depends on context. The trade-off is that they are heavier to compute; generating contextual embeddings requires running a full transformer over the text, whereas static embeddings can be looked up from a table. Nonetheless, for most NLP applications today contextual embeddings are the default choice due to their accuracy and expressiveness. Static embeddings are still useful for quick, lightweight needs or as features in simpler models, but they lack the fidelity that modern tasks demand.\nWord-Level vs Sentence-Level embeddings Another way to classify embeddings is by the unit of text they represent. Traditional word embeddings (static or contextual) give you a vector for each word (or token) in a sequence. But we often need a single embedding for a whole sentence, paragraph, or document: for example, to compare a user&rsquo;s query with a candidate answer passage, it&rsquo;s handy to represent each passage as one vector.\nHow do we get a vector for an entire sentence or document? One approach is to take a contextual model like BERT and pool its token embeddings into a single vector (e.g. use the [CLS] token output or average the token vectors). However, out-of-the-box BERT wasn\u2019t trained to produce a single \u201csentence meaning\u201d vector, and in fact a naive use of BERT\u2019s [CLS] embedding can give subpar results for similarity tasks. Recognizing this, researchers developed models and training techniques specifically for sentence embeddings. A notable example is Sentence-BERT (SBERT), which fine-tunes BERT (or similar architectures) on sentence pair tasks so that it directly produces a meaningful sentence-level embedding. SBERT uses a siamese network setup and a contrastive loss so that similar sentences map to nearby vectors. The result is a model that can encode an entire sentence or paragraph into a single vector that is excellent for semantic similarity comparisons, clustering, etc.\nSo, we distinguish word-level vs. sentence-level embedding models. Word-level models (like the original BERT, although currently serves as a Sentence-level model) give flexibility: you can derive embeddings for any granularity (subword, word, sentence) but might need task-specific tuning to get good sentence representations. Sentence-level models are explicitly optimized to output one vector for a whole input text that captures its meaning. In RAG pipelines for QA, we typically need to compare questions and passages. Using a sentence-level embedding model (or more generally, a model that produces one vector per query or document) is thus a natural choice. Indeed, Sentence-BERT and similar models are popular for dense retrievers because they strike a balance: they leverage deep context understanding from transformers but output compact vectors for entire texts, making them directly usable in a retrieval setting.\nUnsupervised (self-supervised) Pre-Training vs Supervised Contrastive Learning Beyond architecture, embedding models are distinguished by how they are trained: the learning objective and data used greatly influence the resulting vector representations.\nMany embedding models are first trained without any human-labeled data, using self-generated signals from raw text (or other modalities) to learn representations. A common self-supervised objective is language modeling. For instance, BERT was trained with Masked Language Modeling (MLM) (randomly masking words in a sentence and training the model to predict them, thereby forcing it to absorb contextual semantics) and a secondary next-sentence prediction task. This kind of unsupervised training on large corpora (Wikipedia, Books, web text, etc.) yields a general-purpose language understanding capability. However, the resulting embeddings are not explicitly tuned for similarity or retrieval out-of-the-box. Indeed, a vanilla BERT\u2019s sentence embeddings needed further fine-tuning to be effective for semantic similarity. Unsupervised training uses massive unlabeled corpora. For BERT, this was billions of words of BooksCorpus and Wikipedia. For unsupervised SimCSE, it was a large generic text corpus (natural sentences). These models are often evaluated intrinsically by language modeling perplexity or downstream transfer performance.\nSupervised training for embeddings leverages labeled pairs or tuples of texts that indicate which items should be similar or dissimilar in the vector space. A prevalent approach is contrastive learning with positive and negative pairs: the model is trained to produce embeddings that are close (in cosine or dot-product space) for semantically related pairs, and far apart for unrelated pairs. This can be implemented via a contrastive loss or as a softmax cross-entropy over a batch where each input should be closest to its true match among the batch negatives. For example, Sentence-BERT (SBERT) was fine-tuned on Natural Language Inference (NLI) data where sentence pairs have labels Entailment\/Neutral\/Contradiction. It took sentence pairs as input to a Siamese network and trained with a classification loss that implicitly makes entailment pairs come closer in embedding space and contradiction pairs repel. Another example is Dense Passage Retrieval (DPR) for question answering: DPR used question\u2013answer pairs from Wikipedia; it trained dual BERT encoders for question and document, with a loss that pushes the question embedding near its correct passage\u2019s embedding and away from other passages.\nDense vs Sparse embeddings A third key distinction is dense versus sparse embeddings. This refers to the nature of the vector itself.\nDense embeddings are the kind we\u2019ve been mostly discussing (continuous vectors in a dimensional space). Every value in the vector is typically a non-zero real number (after training), and the information is distributed across the dimensions. Neural network models naturally produce dense vectors. For example, a BERT-based sentence embedding might be a 768-dimensional dense vector with values like [0.5, -0.8764, ..., 0.065]. Dense vectors excel at capturing semantic similarity through all those learned dimensions; two pieces of text about the same topic will have correspondingly similar coordinates along those axes.\nSparse embeddings are high-dimensional vectors where most entries are zero. Traditional bag-of-words representations, like TF-IDF vectors, are classic sparse representations: you might have a 100,000-dimensional vector where each dimension corresponds to a vocabulary term, and you set a value (e.g. a TF-IDF weight) in those dimensions where the term is present in the document. These vectors are sparse because any given document uses only a few words out of the whole vocabulary, so it might have non-zero values in, say, 100 out of 100,000 dimensions (the rest being zero). Sparse embeddings thus align closely with lexical representations: they emphasize exact term matching and frequency.\nIn the context of retrieval, dense and sparse methods have different strengths. Dense embeddings (a.k.a. &ldquo;semantic embeddings&rdquo;) capture meaning, even if wording differs. They can retrieve relevant texts that don\u2019t share any exact keywords with the query, by focusing on semantic closeness. Sparse methods (lexical retrieval like BM25) excel at precision for the exact query terms (if a query contains a rare term, a sparse method will almost certainly find docuemnts containing that term). But sparse methods won\u2019t generalize to synonyms or rephrasings (if you search for \u201cvehicle tire\u201d, a pure BM25 search might miss a document that only says \u201cautomobile wheel\u201d). Dense methods might catch that thanks to semantic training, but dense methods can sometimes retrieve things that are topically related yet not exact, which could be irrelevant if not carefully filtered.\nModern best practice often uses a hybrid approach: combine dense and sparse retrieval to get the best of both worlds. For example, you can rank documents by a weighted sum of semantic similarity (dense vector dot product) and lexical similarity (BM25 score). This can improve accuracy because some queries are best served by semantic matching, and some by precise keyword matches. There are also neural models like SPLADE that try to bridge the gap by producing sparse vectors in a learned way. Essentially, SPLADE tries to predict which vocabulary terms should be given weight for a document, combining the idea of expansion terms with a neural model. Other approaches like BM42, use IDF and attention from dense embedding models to calculate their similarity score, trying to improve the query inference speed from SPLADE and large documents accuracy. However, many teams stick to joining both techniques separately and then using a re-ranker to join both results. In summary, dense embeddings vs. sparse embeddings is not a competition with a single winner; they are complementary tools. Knowing their differences helps in choosing an appropriate retrieval strategy for your application.\nSingle vector comparison vs Late Interaction models Late interaction computes relevance by comparing fine-grained token-level embeddings, rather than comparing two global sentence\/document embeddings. Sentence-level models (e.g., SBERT, E5) are designed for global semantic similarity, not fine-grained token alignment. Late interaction models, on the other hand, like ColBERT retain token-level embeddings and compute relevance scores by aggregating interactions between individual tokens of the query and document.\nFig 2. Side-by-side comparison of single vector comparison (left) and late interaction (right) models. Late interaction defeats the purpose of sentence-level embeddings, which aim to summarize a whole text span into a single dense vector for fast retrieval (e.g., via approximate nearest neighbor). ColBERT doesn\u2019t pool token embeddings into a single vector, it keeps all token-level BERT embeddings and then, during retrieval computes fine-grained similarity between every query token and all document tokens, also called maximum cosine similarity. Late interaction is slower and more computationally expensive due to per-token comparison. This cost makes little sense for short texts like single sentences.\nOverall, what is important to understand is that the key difference between BERT and late interaction embedding models lies in how they use BERT\u2019s outputs, not in the architecture itself. Some hybrid approaches may use sentence-level embeddings for fast coarse filtering, then perform token-level re-ranking (but that\u2019s post-retrieval, not part of the embedding model). It is especially useful in RAG where precise grounding improves generation.\nIn this this post from Qdrant, you can see how you can turn single-vector dense embedding models into late interaction models.\nSymmetric vs. Asymmetric embedding models Symmetric vs. asymmetric embedding models represent two different approaches for encoding queries and documents. In symmetric embedding architectures, both inputs (e.g. a query and a document) are processed using the same model or encoder pipeline. In other words, the query and document are handled identically. For example, Sentence-BERT (SBERT) encodes two sentences using the same Transformer network and produces embeddings that can be directly compared. This works well when the inputs are homogeneous. Therefore, symmetric models are natural for tasks like finding duplicate questions, matching similar product description or clustering semanticallu similar texts.\nBy contrast, asymmetric embedding architectures use different encoders or processing for the query vs. the document. This design is typical when the inputs differ in format, length, or role. The motivation for asymmetric models arises when queries and documents have inherently different distributions or functions. A user\u2019s query is usually short, may omit context, and represents an information need or intent, whereas a document passage is longer, detailed, and represents content that might satisfy that need. Here an asymmetric model might use a lightweight query encoder and a heavier document encoder optimized for content, projecting both into a shared vector space. The key idea is role specialization: each encoder can focus on encoding its input (query or doc) in the most effective way, rather than one model trying to serve both roles.\nConcretely, using a symmetric embedding for a QA task can lead to suboptimal results. Such models tend to emphasize semantic similarity between query and document text, rather than relevance of a document as an answer. For example, a symmetric model trained for general sentence similarity might, given the query \u201cWhat is Python?\u201d, rank another question like \u201cWhat is Python used for?\u201d highly, because those two questions are lexically and semantically similar. An experiment comparing models illustrates this: a paraphrase-trained (symmetric) MiniLM model was tested vs. an MS MARCO-trained (asymmetric) MiniLM model on the query &ldquo;What is Python?\u201d&rdquo; The symmetric model ranked similar questions highest, whereas the MS MARCO model (fine-tuned on question\u2013answer pairs) gave a much higher score to the actual answer passage.\nThe reason is that asymmetric training can decouple \u201cintent\u201d vs. \u201ccontent.\u201d An asymmetric embedder can learn a specialized query representation that encodes the information need (e.g. focusing on the key question terms), and a document representation that encodes content (the factual answer, even if paraphrased). The two embeddings might not be extremely similar in raw semantics (a question and its answer have different wording and meaning), but the model learns to make them compatible in the shared vector space.\nAnother motivation is distributional differences like length and vocabulary. Queries are often just a few words, documents have many. A single encoder might have trouble handling both extremes. An asymmetric approach can use a specialized query encoder (perhaps one that is simpler and emphasizes keywords) and a separate doc encoder (that fully encodes the description). In general, allowing asymmetry lets each side play to its strengths: the query encoder can be optimized for short, context-poor inputs, and the document encoder for long, rich text.\nAsymmetric models intentionally introduce differences in the encoders or encoding process between the query side and the document side. The simplest form of this is to have two different neural encoders \u2013 one dedicated to queries and another to documents. Each may have its own parameters, or even a different architecture, though they are usually coordinated to output comparable vectors (often of the same dimension) for similarity scoring. The output embeddings reside in a shared vector space, but how they are produced can differ.\nA clear example is E5 (Embeddings from bidirectional Encoder representations, 5 stands for five E\u2019s in the acronym). On the surface, E5 uses a single Transformer encoder for both query and passage text, which sounds symmetric. However, it introduces an ingenious asymmetry: a special token prefix on each input to indicate its type (e.g. prepend &quot;query: &quot; to questions and &quot;passage: &quot; to passages). Thus, although the weights are shared, the model learns to handle the two roles differently based on the prefix cue. During pretraining on a huge weakly-supervised dataset, E5 sees query\u2013passage pairs (like search query and relevant result, or question and its answer) and uses a contrastive dual-encoder objective to bring the query embedding close to its paired passage embedding.\nThe use of role-specific prefixes is not strictly required, but \u201coften helps in IR settings\u201d, and users are advised to include them for best performance. n effect, E5 behaves as an asymmetric model: if you encode some text as a \u201cquery,\u201d the embedding lives in the same vector space but on a slightly different manifold than if you encode it as a \u201cpassage.\u201d This helps the model capture that, for example, the word \u201cPython\u201d in a query might mean the user is asking about Python (intent), whereas \u201cPython\u201d in a passage likely indicates the content (definition or usage of Python).\nA case in point is DPR (already talked in Supervised Contrastive Learning), a popular model for open-domain QA. DPR consists of two BERT-base encoders (one for questions and one for passages) which are initialized with the same architecture but are independently learned (no weight sharing). DPR consists of two BERT-base encoders (one for questions and one for passages) which are initialized with the same architecture but are independently learned (no weight sharing). Some people consider these models as symmetric even if they don&rsquo;t share weights, since DPR still projects questions and passages into a single common vector space, enabling similarity search. This means it can be seen as an un-tied symmetric model (the encoding function form is the same for queries and docs, though optimized separately). However, what DPR models try to solve is the same as asymmetrical embedding models: match the intent with the content. So, it seems more like a philosophical consideration rather than a pragmatical point of view.\nIn summary, asymmetric architectures can be implemented by: (1) distinct encoder networks for each side, (2) augmenting a shared encoder with input-specific signals or just (3) train or fine-tune the model with query - passage dataset without considering paraphrased pairs.\nKey Takeaways and Future Directions Embedding models have transformed how machines represent and reason about meaning. From early static vectors like Word2Vec to modern transformer-based systems like Sentence-BERT and E5, embeddings now serve as the semantic interface between human language and machine reasoning.\nUnderstanding their taxonomy (static vs. contextual, word-level vs. sentence-level, dense vs. sparse, symmetric vs. asymmetric) is crucial for choosing the right model for your task. While dense embeddings capture deep semantics, sparse ones preserve interpretability and precision. Similarly, symmetric models work best for homogeneous text comparisons, while asymmetric architectures excel in information retrieval and RAG.\nIn practice, the future of embeddings lies in hybrid systems: combining dense and sparse methods, using domain-adapted fine-tuning, and integrating late-interaction models for fine-grained relevance. These advances are not just technical upgrades: they reshape how AI systems understand, search, and generate knowledge.\nWhether you are building a semantic search engine, a retrieval-enhanced chatbot, or a domain-specific RAG pipeline, mastering embedding models means mastering the core bridge between text and meaning.\n@article{alas2025, title = &#34;From Words to Vectors: A Dive into Embedding Model Taxonomy.&#34;, author = &#34;Al\u00e0s Cerc\u00f3s, Oriol&#34;, journal = &#34;oriolac.github.io&#34;, year = &#34;2025&#34;, month = &#34;October&#34;, url = &#34;https:\/\/oriolac.github.io\/posts\/20251025-embedding-models\/&#34; } ","permalink":"https:\/\/oriolac.github.io\/posts\/20251025-embedding-models\/","summary":"<p>Embedding models are foundational in modern NLP, turning raw text into numerical vectors that preserve semantic\nsignificance. These representations power everything from semantic search to Retrieval-Augmented Generation or Prompt\nEngineering for LLM Agents. With growing demand for domain-specific applications, understanding which is the best fit\nfor your system is more important than ever.<\/p>\n<h1 id=\"introduction\">Introduction<\/h1>\n<p>In modern NLP, a <em>text embedding<\/em> is a vector that represents a piece of text in a mathematical space. The magic of\nembeddings is that they encode semantic meaning: texts with similar meaning end up with vectors that are close together.\nFor example, an embedding model might place &ldquo;How to change a tier&rdquo; near &ldquo;Steps to fix a flat tire&rdquo; in its vector space,\neven though the wording is different. This property makes embedding models incredibly useful for tasks like search,\nclustering or recommendation, where we care about <em>semantic similarity<\/em> rather than exact keyword matches. By converting\ntext into vectors, embedding models allow computers to measure meaning and relevance via distances in vector space.<\/p>","title":"From Words to Vectors: A Dive into Embedding Model Taxonomy"},{"content":"Overview InnWater is a Horizon Europe project promoting social innovation in multi-level water governance within the WEFE nexus (Water\u2013Energy\u2013Food\u2013Ecosystem). I contributed as AI Engineer\/System Architect at Eurecat, designing the Water Governance Diagnosis Tool and AI Assistant integrated into the platform.\n\ud83d\udd0d Water Governance Diagnosis Tool Water Governance Diagnosis Results and assessment.\nDigitized the OECD Water Governance Principles into a structured analytics pipeline:\nInteractive questionnaire for governance assessment Quantitative scoring engine with classification: Governance Gap (&lt; 1.75) Moderate Governance (&lt; 2.70) Strong Governance (&gt; 2.70) Results dashboard with interpretability layer Integration with CGE economic simulation model The tool converts qualitative governance inputs into reproducible, policy-oriented outputs.\n\ud83e\udd16 AI Assistant \u2013 RAG-based Support Water Governance Assistant.\nDesigned and implemented a Retrieval-Augmented Generation (RAG) architecture to support governance navigation:\nArchitecture:\nQuery treatment module Hierarchical retrieval over indexed project deliverables Embedding-based semantic search LLM response generation grounded in validated documents Logging &amp; evaluation (golden dataset benchmarking) Capabilities:\nExplains OECD governance principles Interprets governance scores Suggests policy improvements Connects governance gaps with economic scenarios Multilingual interaction \u2696 Ethical &amp; Trustworthy AI Implemented safeguards aligned with EU AI Act, GDPR, and ALTAI framework:\nCitation-grounded generation Bias validation datasets Toxicity classifier Multilingual fairness evaluation Transparent AI disclaimers Impact Operationalized governance theory into AI-supported digital infrastructure Enabled non-expert stakeholders to interpret complex water governance data Bridged governance assessment with economic modeling Demonstrated responsible AI deployment in public-sector decision support ","permalink":"https:\/\/oriolac.github.io\/projects\/innwater-wg\/","summary":"AI-powered decision-support system for water governance assessment and policy optimization within the WEFE nexus.","title":"InnWater \u2013 AI-Augmented Water Governance Platform"},{"content":"Generative models are a class of machine learning that learn a representation of the data trained on and they model the data itself.\nIdeally, generative models should satisfy the following key requirements in a real environment:\nHigh quality samples refers to those samples that captures the underlying patterns and structures present in the data making them indistinguishable from human observers. Fast Sampling is about the efficiency of image generation and the computational overhead that can cause generative models. Mode Coverage\/Diversity points out how the model is able to generate a full range of mods and diverse patterns present in the training data Fig. 1. The Generative Learning Trilemma In this post we will see that Generative Adversarial Networks (GAN) have serious problems with mode collapse, which affects the diversity of synthetic data. While Denoising Diffusion Model (DDM) can cover a wide spectrum of possibilities, they suffer from thousands of network evaluations respectively. On the last term, Variational Auto-Encoders (VAE) and likelihood-based models are fast samplers but are limited when creating diverse patterns not present in the training data.\nMost of the current deep generative learning models focus on high-quality definition, although all the requirements are highly important and key factors in a real environment. All of them have their advantages and drawbacks - and this is called the generative learning trilemma.\nWarning! This post is a re-written part from my master&rsquo;s thesis that you can also find here.\nGenerative Adversarial Network A GAN is an unsupervised model made up of two neural networks: the generator and the discriminator. The idea is based on a game theoretic scenario in which the generator network must compete against an adversary. While the generator network produces samples, the aim of the discriminator is to distinguish between the real samples and the drawn by the generator. The discriminator is a binary classifier trying not to be fooled.\nFig. 2. Generative Adversarial Network Diagram The generator loss is calculated using the discriminator as a reference of how much far is from real images while the discriminator loss is calculated by how much accurate is discerning between the synthesized data and the real one. The standard function can be known as the min-max loss:\n\\[Loss_D(D) = E_x[log(D(x)]\\] \\[Loss_G(G) = E_y[log(1 - D(G(y)))]\\] \\[Loss_{GAN}(G, D) = Loss_D(D) + Loss_G(G)\\]During training, both networks constantly try to outsmart each other, in a zero-sum game. At some point of the training, the game may end up in a state that game theorists call a Nash equilibrium, when no player would be better off changing their own strategy, assuming the other players do not change theirs. GANs can only reach one possible Nash equilibrium: when the generator produces so realistic images that the discriminator is forced to guess by 50 % probability that the image is real or not.\nNevertheless, the training process not always can converge to that equilibrium. There are several factors that make the training hard to reach the desired state. For instance, there is a possibility that the discriminator always outsmarts the generator so that it can clearly distinguish between fake and real images. As it never fails, the generator is stuck trying to produce better images as it cannot learn from the errors of the discriminator.\nPossible solutions can be carried out such as making the discriminator less powerful, decreasing the learning rate or adding noise to the discriminator target. Another big obstacle is when the generator becomes less diverse, and it learns only to perfectly generate realistic images of a single class, so it forgets about the others. This is called mode collapse.\nAt some point, the discriminator can learn how to beat the generator, but then, the latter is forced to do the same but in another class, cycling between classes never becoming good at any of them. A popular technique to avoid is experience replay, which consists in storing synthetic images at each iteration in a replay buffer.\nThere is a lot of literature of obstacles and solutions to improve GAN training and it is still very active, as it is in its applications too. The tuning of hyper-parameters and the design of the model will be a key to pursue the Nash equilibrium.\nFig. 3. Conditional Generative Adversarial Network Diagram For instance, there is a variant called cGAN (conditional GAN). Traditionally, the generative network only produces the image from a random vector as an input, which is also called latent vector since it cannot be manipulated or with prior convictions of how will be. Unfortunately, this only allows to generate a random image from the domain of the latent space, which is hard to map to the generated images. However, cGAN can be trained so that both generator and discriminator models can be conditioned to some class labels or multi-dimensional vectors and produce synthetic images from a specific domain. In the following figure, it can be seen a cGAN diagram where the generator is conditioned by some inputs as well as the real data have the condition vector in each sample.\nVariational Auto-Encoders Likelihood-based models are alternatives to GANs that can cover a wide range of possibilities since they focus on estimating the likelihood or probability distribution of the data while training. There are several types of likelihood-based models although we will focus on VAE. The architecture of autoencoders are quite simple, they consist of an encoder, a smaller feature vector also called latent vector \\(z\\) and a decoder , as it can be seen in Figure 5. Therefore, the main goal of the encoder is to comprise the input vector \\(x\\) while the decoder attempts to perform the conversion from lower to higher dimensional data, being the output vector \\(\\bar{x}\\). Hence, the best purpose of autoencoders is dimensionality reduction given its architecture. Then, the main purpose is to find the best set of encoder\/decoder that keeps the maximum information with the less reconstruction error while decoding. In fact, one of the most popular usages of autoencoders in computer vision is the image reconstruction due to their architecture.\nFig. 5. AutoEncoder Diagram Let&rsquo;s assume an autoencoder that both encoder and decoder have only one layer without non-linearity. In that sense, we can see clearly a link with PCA since we are looking for the best linear subspace to project data on with as few information loss as possible.\nHowever, deep neural networks comes with also non-linear layers (non-linear layers allow to learn complex relationships. Some of them are used as activation functions). The more complex of autoencoders architecture, the more they can proceed to a high dimensionality reduction while keeping the loss low. In an ideal case, if the encoder and the decoder have enough degrees of freedom and infinite power, the latent vector could be reduced to one. Nevertheless, the dimensionality reduction comes with a price.\nFirst, we will lack of interpretable structures in the latent space. Secondly, the major part of the data structure information will not be in a reduced representation but in arbitrary one without any context of the patterns that the autoencoder could infer. Therefore, it is important to control and adjust the depth and the latent space dimensions depending on the final purpose.\nAt this point, we have learned the surface from autoencoders, but how do they fit in image generation? Once the autoencoder has been trained, we have both encoder and decoder with the weights to reconstruct the data. At first, we might think that changing the latent vector we could take a point randomly from the space and decode it to get a new content. Although that could work, the regularity of the latent space for autoencoders is hard since it depends on several factors such as the data distribution, the latent space dimension and the architecture of the encoder. To sum up, there will be several biases that we are not able to control.\nIn order to make the latent vector regular and continuous, VAEs try to solve this problem by mapping the inputs to a normal probability distribution, so they introduce explicit regularization during the training process. Hence, the latent vector will be sampled from that distribution, being the decoder more robust at decoding latent vectors as a result. A VAE is an architecture composed of both encoder and decoder too. However, instead of encoding a latent vector, they encode it as a distribution over the latent space. The following enumeration details the process:\nThe input is encoded as a distribution over the latent space. A point is generated given the distribution encoded. The sampled point is decoded so the reconstruction and regularization error can be computed. Fig. 5. Variational AutoEncoder Diagram The encoded distributions are chosen to be normal so that the encoder can be trained to return the mean and covariance matrix. This makes a way of both local and global regularization of the latent space respectively. Hence,the loss will not be only about the reconstruction of the data in the last layer, called reconstruction term, but also how the latent space is organized by making the latent space close to a standard normal distribution, called the regularization term. The latter will be calculated by how distant is from the gaussian distribution. Kullback-Leibler Divergence, is a measure of how one probability distribution differs from a second, reference probability distribution. It&rsquo;s commonly used in various fields, including information theory, statistics, and machine learning. Using the KL divergence, which is expressed in terms of the means and the covariance matrices of the two distributions, the variety of outputs are better represented in the latent space in VAEs than traditional autoencoders:\n\\[L_{\\text{VAE}} = \\text{Reconstruction term} +\\qquad \\text{Regularization term}\\] \\[\\text{Reconstruction term} = (\\frac{1}{N} \\sum_{i=1}^{N} \\| x_i - \\bar{x}_i \\|^2)\\] \\[ \\text{Regularization term} = \\sum_{j=1}^{J} \\left(\\mu_j^2 + \\sigma_j^2 - \\log(\\sigma_j) - 1 \\right)\\]The loss in VAEs is also called the variational lower bound or ELBO as a way to estimate the likelihood of the observed data given the model&rsquo;s parameters and latent variables. To demonstrate the differences we can compare an autoencoder and a VAE trained by MNIST data, although it is a really balanced dataset. We can see in that the range of values in the latent vector of the latter is much smaller and centralized, representing better any class, whereas autoencoder has the images more sparsed in the latent domain.\nFig. 6. Comparison between non-probabilistic autoencoder and VAE. Moreover, one particularity that makes VAE architectures good in business and industry related problems or even in processing images (one problem that VAE solve really well is in image reconstruction since it gets the main and reduced idea of an image) is that they are really fast when sampling due to its bottleneck and the dimensional reduction feature. Nevertheless, the reduction of the bottleneck also makes that the quality of the samples produced by VAE are lower than other generative models. In particular, when they are compared to DDMs.\nDenoising Diffusion Models Denoising Diffusion Models (DDMs) are a type of generative model that operates through a different framework than VAEs, although their basis is also probabilistic. While VAEs focus on encoding data into a latent space and then decoding it to generate reconstructions, DDMs revolves around modeling the diffusion process of data. This process involves the gradual transition from a noisy version of the data to the actual clean data. The process of training a DDM involves optimizing the parameters of the diffusion process so that it can effectively recreate the observed data distribution. But, what is this process about? We can assume that all data comes from a distribution. Basically, any dataset is sampled from the real distribution. The goal of generative models is to learn that distribution so they could sample from it and get another data point that looks like is from the dataset trained. The way that DDM learns that distribution is by trying to convert well-known and simple base distribution (like gaussian) to the target data iteratively, with small steps, using a Markov chain, treating the output of the markov chain as the model&rsquo;s approximation for the learned distribution. Hence, the diffusion process can be split into two parts: forward and reverse diffusion processes.\nFig. 7. Forward diffusion process mapping. In the forward process \\(q(x_{t}|x_{t-1})\\), the model slowly and iteratively add noise to the images so that they move out from their existing subspace.\nFig. 7. Forward diffusion process. Since it is proved that doing it infinitely the image would eventually end up into a point from a normal distribution \\(q(x_T|x_0) \\thickapprox \\mathcal{N}(0,1)\\), the final corrupted image would loss all information from original sample. So, what it is aimed is to convert the unknown and complex distribution of the dataset into one that it is easier to sample a point from and understand. The forward process takes the form of a markov chain, where the distribution at a particular time step only depends on the sample of the immediately previous step. Therefore, it is easy to write out the distribution of corrupted samples conditioned on the initial data point \\(x_0\\) as the product of successive single step conditionals:\n\\[q(x_{1:T} | x_0) = \\prod_{t=1}^{T} q(x_t|x_{t-1})\\]In the case of continuous data, each transition is parameterized as a diagonal gaussian probability distribution function and that is the reason why approximating to \\(T\\) would end up to a gaussian distribution centered at 0, since the parameter \\(\\beta\\) will always approach to \\(1\\) and, then, \\(\\sqrt{1 - \\beta_T}\\) theoretically will approach to 0: \\[q(x_{t}|x_{t-1}) = \\mathcal{N}(x_t; \\sqrt{1 - \\beta_t} x_{t-1}, \\beta_t I)\\] \\[ \\beta_1 < \\beta_2 < ... < \\beta_T \\quad ; \\quad \\beta_t \\in (0,1) \\] Although taking small steps has a cost, learning to undo the steps of the forward process would be less difficult. Adding little noise at each step, there would be less ambiguity about determining the probability density function of the last step in the reverse process.\nThe goal of the reverse process is to undo the forward and to learn the denoising process in order to generate new data from random noise. Unfortunately, infinite paths can be taken starting from the corrupted image but only few of them will turn the noisy image into a data from the desired subspace. Hence, DDM takes small iterative steps during the forward diffusion processes and take those steps which the probability distribution function satisfies that the corrupted images differs slightly at each step.\nFig. 8. Forward and diverse diffusion processes. Like the forward process, the reverse process is set up as a markov chain, being the pure noise distribution \\(p(x_T) = \\mathcal{N}(x_T; 0, 1)\\): \\[p_{\\theta}(x_{0:T}) = p(x_T) \\prod_{t=1}^{T} p_{\\theta}(x_{t-1}| x_t)\\] \\cite{feller1950} shows that theoratically the true reverse process will have the same functional form as the forward process. Therefore, the each learned reverse step will be also a diagonal gaussian distribution:\n\\[p_{\\theta}(x_{t-1}|x_t) = \\mathcal{N}(x_{t-1}; \\mu_{\\theta}(x_t; t), \\sigma_{\\theta}(x_t, t))\\]The forward objective is to push a sample off the data manifold turning into noise and the reverse process is trained to produce a trajectory back to the data manifold, resulting in a reasonable sample. The objective that we attempt to optimize is not about optimizing the maximum likelihood objective to turn a point to \\(x_0\\), because \\(p_{\\theta}(x_0) = \\int p_{\\theta}(x_{0:T}) dx_{1:T})\\) includes all the possible trajectories, all the ways a noisy point could have arrived at \\(x_0\\). If we compare VAE with DDM, the encoding part would be the forward process while the decoder would be the reverse model. Hence, the latent variables would be \\(\\{x_i | i \\in \\{1, 2, ..., T\\}\\}\\) while the observed variable, \\(x_0\\). Unlike a VAE, the encoder part is typically fixed but the reversed process is the one being focused while training, so a single network needs to be trained. When we have a model with observations and latent variable, we can use the ELBO to maximize the expected density assigned to the data while the KL divergence encourages the approximate posterior \\(q(z|x)\\) to be similar to the prior on the latent variable \\(p_{\\theta}(z)\\).\nTherefore, any step forward and back would include a loss of the divergence between both distributions. Nevertheless, although at training time any term of this objective can be obtained without having to simulate an entire chain, different trajectories may visit different samples at time \\(t-1\\) on the way to hitting \\(x_t\\), the setup can have high variance. limiting training efficiency. To help with this, the objective must be arranged as follows.\n\\[E_q[-D_{KL}(q(x_{T}|x_0) \\; || \\; p(x_T)) - \\sum_{t > 1} D_{KL}(q(x_{t-1} | x_t, x_0) \\; || \\; p_{\\theta}(x_{t-1} | x_t)) + log \\, p_{\\theta}(x_0| x_1)]\\]The first part it compares the noise distribution (The start of the reverse process) \\(p(x_T)\\) with the forward process \\(q(x_T|x_0)\\), which both are fixed. The second component is a sum of divergences each between a reverse step and a forward process posterior conditioned on \\(x_0\\). When the \\(x_0\\) is treated as known like it is during training, all \\(q\\) terms are actually gaussians. Then, the divergences is comparing two gaussians and helps reducing variance during the training process.\nIn order to add conditional inputs in the model is to feed the conditioning variable \\(y\\) as an additional input during training \\(p_{theta}(x_0 | y)\\). Moreover, adding a separate trained classifier can help guiding the diffusion process in the direction of the gradient of the target label probability with respect of the current noise image.\nComparing DDMs and VAEs, DDMs can capture complex dependencies and distributions in the data due to their inherent modeling of the diffusion process. Estimating and analyzing small step sizes is more tractable than describing a single non-normalizable step from random noise to the learned distributions, which is what VAEs and GANs do. However, they can be computationally more intensive to train due to the intricacies involved in estimating the diffusion process parameters accurately.\nConclusion The exploration of Generative Adversarial Networks (GANs), Variational Auto-Encoders (VAEs), and Denoising Diffusion Models (DDMs) illustrates the inherent trade-offs within the Generative Learning Trilemma: the balance between sample quality, sampling speed, and mode coverage\/diversity.\nEach model type occupies a distinct position within this trilemma:\nGANs excel at generating visually realistic, high-quality samples but struggle with mode collapse, often failing to represent the full diversity of the data. Their adversarial training is also notoriously unstable, requiring careful balancing between generator and discriminator dynamics. VAEs prioritize efficient sampling and structured latent spaces, offering fast generation and interpretability. However, their reconstructions tend to be overly smooth or blurry, reflecting the limitations of their probabilistic assumptions and the imposed regularization on the latent space. DDMs, in contrast, achieve exceptional diversity and fidelity by explicitly modeling the data generation process as a gradual denoising sequence. Their main drawback lies in computational cost, as thousands of iterative steps are needed for both training and sampling. In essence, these three paradigms represent different compromises among realism, efficiency, and coverage.\nThe future of generative modeling likely resides in hybrid architectures that integrate the strengths of each approach. As computational power and architectural innovations continue to evolve, these generative models will converge toward systems capable of high-quality, diverse, and efficient generation, moving closer to resolving the generative trilemma that defines this fascinating field.\n","permalink":"https:\/\/oriolac.github.io\/posts\/20250710-starting-diffusion\/","summary":"<p>Generative models are a class of machine learning that learn a representation of the data trained on and they model the\ndata itself.<\/p>\n<p>Ideally, generative models should satisfy the following key requirements in a real environment:<\/p>\n<ul>\n<li><strong>High quality samples<\/strong> refers to those samples that captures the underlying patterns and\nstructures present in the data making them indistinguishable from human observers.<\/li>\n<li><strong>Fast Sampling<\/strong> is about the efficiency of image generation and the computational overhead\nthat can cause generative models.<\/li>\n<li><strong>Mode Coverage\/Diversity<\/strong> points out how the model is able to generate a full range of\nmods and diverse patterns present in the training data<\/li>\n<\/ul>\n<p>\n<figure>\n  <img loading=\"lazy\" src=\"\/posts\/2025\/gen_tril\/gen_tril.png#center\" alt=\"alt text\"  title=\"Fig. 1. The Generative Learning Trilemma\"  \/> \n  <figcaption\n    style=\"\n      font-size: 15px;\n      color: #7a7a7a;\n      margin-top: 0.5em;\n      text-align: center;\n      font-weight: 100;\n    \"\n  >\n  Fig. 1. The Generative Learning Trilemma\n\n  <\/figcaption>\n  \n<\/figure>\n<\/p>","title":"The Generative Trilemma: A quick overview"},{"content":"Overview Presentation on vehicle recognition and traffic reconstruction using the NebulOus cloud platform. The talk covers the application of computer vision techniques for traffic analysis and the deployment of ML models in cloud infrastructure.\nKey Topics Vehicle detection and tracking Traffic pattern reconstruction NebulOus cloud platform architecture Real-time processing challenges Event TechMeeting is a technical meetup in Lleida focused on emerging technologies and their practical applications.\n","permalink":"https:\/\/oriolac.github.io\/talks\/techmeeting-nebulous\/","summary":"<h2 id=\"overview\">Overview<\/h2>\n<p>Presentation on vehicle recognition and traffic reconstruction using the NebulOus cloud platform. The talk covers the application of computer vision techniques for traffic analysis and the deployment of ML models in cloud infrastructure.<\/p>\n<h2 id=\"key-topics\">Key Topics<\/h2>\n<ul>\n<li>Vehicle detection and tracking<\/li>\n<li>Traffic pattern reconstruction<\/li>\n<li>NebulOus cloud platform architecture<\/li>\n<li>Real-time processing challenges<\/li>\n<\/ul>\n<h2 id=\"event\">Event<\/h2>\n<p>TechMeeting is a technical meetup in Lleida focused on emerging technologies and their practical applications.<\/p>","title":"Reconeixement de Vehicles i Reconstrucci\u00f3 de Tr\u00e0nsit amb NebulOus"},{"content":"Transformers have demonstrated excellent capabilities and they overcome challenges such NLP, Text-To-Image Generation or Image Completion with large datasets, great model size and enough compute. Talking about transformers nowadays is as casual as talking about CNNs, MLPs or Linear Regressions. Why not take a glance through this state-of-the-art architecture?\nIn this post, we\u2019ll introduce the Sequence-to-Sequence (Seq2Seq) paradigm, explore the attention mechanism, and provide a detailed, step-by-step explanation of the components that make up transformer architectures.\nSequence-to-sequence paradigm Seq2Seq was initially introduced in Recurrent Neural Networks (RNN) and later enhaced by Long Short-Term Memory (LSTM) networks. This architecture splits the task into two primary components:\nEncoder. It processes and compresses the input sequence into a fixed-length vector, commonly referred to as the context vector or hidden state. Decoder. Sequentially generates the target output using the information encoded in the context vector. In essence, we encode the input language and decode the language of translation. For example, English uses a Subject-Verb-Object (SVO) order, while Japanese goes with Subject-Object-Verb (SOV) and often skips the subject altogether.This flexibility lets Seq2Seq models adapt to these quirks and do a great job of capturing the meaning and flow of translations.\nFig. 1. The encoder model processes each token of the input sentence (How are you?), updating its hidden state with each step. Upon encountering the End of Sequence (&lt;EOS&gt;) token, the final hidden state is passed to the decoder model. The decoder then generates the output sequence (\u304a\u5143\u6c17\u3067\u3059\u304b) token by token, starting with the Start of Sequence (&lt;SOS&gt;) token and continuing until the End of Sequence (EOS) token is reached. Sequence-to-sequence with attention What are the next steps before talking about transformers? You may already know that transformers&rsquo; basis is the attention mechanism. But, what is attention?\nOne of the main challenges with Seq2Seq models is the fixed-length context vector passed from the encoder to the decoder. Since its fixed-length, the resulting context vector might have more information about the last tokens than the first ones. Hence, the decoder cannot focus on specific parts of the input sentence. For longer sentences, this bottleneck can result in loss of information, making translations less accurate or meaningful.\nAttention tries to sort out this issue by allowing the decoder to focus on specific parts of the input sequence at each step of the generation process. Instead of relying only on the resulting context vector, the attention mechanism calculates a weighted combination of all encoder hidden states. This ensures that the decoder has access to the most relevant information.\nAttention is a mechanism that allows a model to focus on the most relevant parts of an input when making a prediction. We can say that it calculates, from a token, the weights (importance) of the other tokens on the fly.\nFig. 2. The encoder model processes each token of the input sentence (How are you?), updating its hidden state with each step. The hidden states are stored at each encoding step until encountering the End of Sequence token. The decoder then generates the output sequence (\u304a\u5143\u6c17\u3067\u3059\u304b) token by token, starting with the Start of Sequence (&lt;SOS&gt;) token and continuing until the End of Sequence (EOS) token is reached. The hidden state passed to the decoder is built using the attention mechanism at each step using the hidden states of the encoder model and the previous hidden state of the decoder model. Alignment. At each decoding step, the attention mechanism calculates a score to determine the relevance of each encoder hidden state to the current decoder state. There are a great amount of alignments but the most popular are Bahdanau (Additive Attention) and Scaled-Dot Product Attention. Weighting. Theses scores are normalized using softmax to generate a set of attention weights. Context vector. The attention weights are used to compute a weighted sum of the encoder hidden states, producing a context vector specific to the current output generation. Output Generation. The context vector is then combined with the decoder&rsquo;s state to generate the next token. The attention mechanism is useful in tasks like translation, where alignment between input and output sequences is important. Also, the selective focus makes the model more interpretable, since it can provide insights into which parts of the input the model considers relevant, offering a form of explainability.\nNevertheless, the attention mechanism requires computing attention scores between each decoder step and all encoder outputs and for long input sentences, this results in a large number of computations, scaling computation time and memory consumption.\nFinally, it&rsquo;s important to note that the decoder in Seq2Seq architecture operates in an autoregressive manner, generating tokens one at a time. The sequential process limits parallelization during decoding, resulting into slower inference times compared to non-autoregressive models, which can generate multiple tokens simultaneously.\nYou may find more information in the tag seq2seq.\nTransformer architecture Transformers [1] emerged as a way to built encoder-decoder architectures to solve machine translation problems. While RNNs and LSTMs use recurrent steps and can suffer more from vanishing gradients and limited parallelization, transformers bypass this by processing sequences in parallel.\nThe transformer neural network is composed by an encoder-decoder architecture much like RNN. However, the difference is that the input sequence can be passed in parallel by passing also the positional encoder zipped with, as the input might have different meaning depending on its position.\nFig. 3. Transformer model architecture Attention is all you need.\nThe input and the positional encoding are passed into the encoder block. The job of the encoder is to map all input sequence into abstract continuous representation that holds the learned information for that entire sequence. The encoder block has \\(N\\) identical encoder layers. The main objective of the encode is to capture the attention between tokens in both ways, also called self-attention or bidirectional attention. This means that in this part we attempt to capture each token&rsquo;s relevant parts from all the tokens of the sentence (although they are after the token). Hence, the encoder part is non-autorregressive.\nRegarding the decoder block, it has several similarities with the encoder block. They both have \\(N\\) identical layers and a position encoding at first of all. However, multi-head attention layers of the decoder block have different job compared to the encoder. The decoder is auto-regressive and it takes the previous outputs from itself and the encoder output vector as inputs. This is because the encoder can use all the elements of the input sentence but the decoder can only use the previous elements of the sentence. The attention captured in decoder blocks is called casual attention.\nPositional Encoding Positional encoding is the process of producing a vector that gives context based on position of the element in a sentence, so we will end up with a matrix of encoded positions. We could have only uni-dimensional vector of natural numbers like \\([1, 2, ..., n]\\). But one of the reasons we want positional encoding is not only feed the positions but their relationships. Therefore, they came up with a way of capture both absolute and relative positions with smooth representation of the position information (taking into account that the difference between 1000 and 1001 is &ldquo;smaller&rdquo; than the difference between 1 and 2) and providing better high-dimensional contextual information.\n\\[ PE_{(pos, 2i)} = \\sin\\left(\\frac{pos}{n^{\\frac{2i}{d_{model}}}}\\right)\\] \\[PE_{(pos, 2i+1)} = \\cos\\left(\\frac{pos}{n^{\\frac{2i}{d_{model}}}}\\right)\\]where\n\\(pos\\) is the position index of the token \\(i\\) is the index of the encoding dimension \\(d_{model}\\) is the dimensionality of the model&rsquo;s embedding space \\(n\\) is the base of the frequency scaling factor, being set up to \\(10000\\) For every odd step, they create the vector using the cosine function while for every even time step, they use the sine function. These functions have linear properties the model can easily learn to attend to when adding these vectors to their corresponding vector. The result of these functions will be concatenated to the input embedding vector.\nThe most difficult part to understand from this formula might be the denominator. This part ensures that different frequency scales across dimensions. While lower dimensions of the positional encoding captures higher frequency variations, higher dimensions capture lower frequency variations, allowing them to encode larger positional distances smoothly. The following figure might help to understand the magic of it.\nFig. 4. Example of positional encoding values of the sinus in some encoding dimensions. import numpy as np import matplotlib.pyplot as plt def plot_sinus(k, ax, d=512, n=10000): x = np.arange(0, 200, 1) denominator = np.power(n, 2*x\/d) y = np.sin(k\/denominator) ax.plot(x, y) ax.set_title(&#39;k = &#39; + str(k)) ax.set_xlabel(&#34;Dimension&#34;) ax.set_ylim([-1, 1]) ax.set_xlim([0, 200]) fig, axs = plt.subplots(1, 4, figsize=(16, 4)) fig.tight_layout() for i, ax in enumerate(axs.flat): plot_sinus(i*4, ax) Scaled Dot-Product Attention There are several attention mechanisms: additive, content-base, badhanau [2]&hellip; Transformers introduced a new mechanisme called Scaled Dot-Product Attention. An attention function can operate using queries (Q), keys (K) and values (V).\nFig. 5. Scaled Dot-Product Attention Attention is all you need.\nThe query is a vector related with what we encode, the key is a vector related with what we use as input to output and the value is the learned vector as a result of calculations but related with the input. In other words, the query represents what we are looking for, the key represents possible matches in the input and the value represents the actual information associated with each key.\n\\[Attention(Q,K,V) = softmax(\\frac{QK^T}{\\sqrt{d_k}}) \\cdot V \\]where \\(d_k\\) is the dimensionality of the key vectors. The idea is practically the same: the result is a weighted sum of values, where more relevant elements contribute more to the output.\nMulti-Head Attention Layer To give the encoder model more representation power of the self-attention, they created the Multi-Head Attention Layer. Instead of computing a single attention function, the Scaled Dot-Product Attention is splitted into several blocks called heads, each running in parallel. Each head independently computes attention and is then concatenated to form the final output.\nFig. 6. Multi-Head Attention Layer Attention is all you need.\nThe input of each head is first fed into three distinct fully connected layers to create the query, key and value vectors. These transformations allow the network to learn different types of relationships between tokens. The idea is that the attention block must map the query against a set of keys to then present the best attention, which will be embedded to the values.\nFig. 7. Example of Multi-Head Attention Layer mechanism.\nAfter computing attention in each head, the results are concatenated and passed through a linear projection layer. This ensures that the output has the same dimensionality as the input, allowing for seamless integration with subsequent layers.\nMasked Multi-Head Attention block The encoder block has two sub-layers: a multi-head attention layer and a feed forward layer. Both sub-layers have a residual connection and a layer normalization next to their output vector. The residual connections helps the network to train by allowing gradients to flow directly through the network while the normalization is used to stabilize the network.\nThe decoder block consists of three sub-layers: two multi-head attention layer and one feed-forward layer. The first multi-head attention layer in the decoder is masked to prevent it from attending to future tokens. This is achieved by applying a mask to the attention score matrix before computing softmax, ensuring that predictions for a given token do not depend on future tokens.\nIn the second multi-head attention layer, the queries and keys come from the encoder\u2019s output, while the values are derived from the output of the first attention layer in the decoder. This mechanism enables the decoder to integrate information from the encoder while maintaining the structure of previously generated tokens. The final output is then processed by a feed-forward layer before being passed to a linear layer and a softmax function, which converts it into a probability distribution over possible output tokens.\nCross-Attention The interaction between the encoder and decoder is facilitated by cross-attention. In the second multi-head attention layer of the decoder, the queries originate from the decoder\u2019s previous output, while the keys and values come from the encoder\u2019s output. This allows the decoder to focus on relevant parts of the input sequence when generating each token in the output. Cross-attention is essential for tasks such as machine translation, where the output sequence depends heavily on the input sequence.\nReferences [1] Ashish Vaswani, et al. Attention is all you need, NIPS 2017\n[2] Bahdanau et al. Neural Machine Translation by Jointly Learning to Align and Translate, ICLR 2015\nCitation Al\u00e0s Cerc\u00f3s, Oriol. (Feb 2025). Introduction to Attention Mechanism and Transformers. https:\/\/oriolac.github.io\/posts\/2024-10-29-attention\/.\n@article{alas2025, title = &#34;Introduction to Attention Mechanism and Transformers.&#34;, author = &#34;Al\u00e0s Cerc\u00f3s, Oriol&#34;, journal = &#34;oriolac.github.io&#34;, year = &#34;2025&#34;, month = &#34;February&#34;, url = &#34;https:\/\/oriolac.github.io\/posts\/2024-10-29-attention\/&#34; } ","permalink":"https:\/\/oriolac.github.io\/posts\/20241029-attention\/","summary":"<p>Transformers have demonstrated excellent capabilities and they overcome challenges such <em>NLP<\/em>, <em>Text-To-Image Generation<\/em> or <em>Image Completion<\/em>\nwith large datasets, great model size and enough compute.\nTalking about transformers nowadays is as casual as talking about <em>CNNs<\/em>, <em>MLPs<\/em> or <em>Linear Regressions<\/em>. Why not take a glance through this state-of-the-art architecture?<\/p>\n<p>In this post, we\u2019ll introduce the Sequence-to-Sequence (Seq2Seq) paradigm, explore the attention mechanism, and provide a detailed,\nstep-by-step explanation of the components that make up transformer architectures.<\/p>","title":"Introduction to Attention Mechanism and Transformers"},{"content":"Problem The local Nativity Scene Contest (Concurs de Pessebres) in Artesa de Segre needed stronger collaboration with local shops to increase visibility and participation. The baseline was 14 participating shops the previous year, and the goal was to grow that number by creating a more compelling, town-wide experience that benefits both the association and the local economy.\nConstraints Low-friction participation: the experience had to work on any phone, with no app-store install required. Real-world deployment: QR codes and physical dioramas needed to be placed in shop windows and remain usable throughout the campaign. Fairness and maintainability: shops needed a consistent way to participate while keeping the initiative scalable year-over-year. Cultural authenticity: the content had to reflect Artesa\u2019s identity (local traits, references, songs), not a generic Christmas storyline. Multi-stakeholder coordination: align an association, shop owners, and the local high school\u2014each with different incentives and constraints. My role I acted as the main organizer and product owner of the initiative, defining the concept, rules, and rollout strategy. I was also responsible for the technical design and development of the platform, including system architecture and implementation. In addition, I designed the user experience and visual identity of the challenge and led its promotion through appearances on local television and in local newspapers to explain the project and encourage participation.\nArchitecture At a high level, the system combines a QR-driven user journey with a lightweight web platform:\nUser journey (mobile-first) Visitors walk through the town and scan QR codes displayed in the windows of participating shops. Each scan unlocks a fragment of an original story that narrates the Nativity of Artesa de Segre, incorporating local cultural traits and traditional songs. The experience is progressive and encourages movement across different areas of the town, transforming the contest into an exploratory, town-wide activity rather than a single-location event.\nBeyond the digital story, a key component of the project was the creation of unique physical nativity scenes ( diorames) for each participating shop. These were developed in collaboration with the local high school, which strengthened ties between the association and younger participants, some of whom are between 8 and 16 years old. This collaboration not only increased the number and diversity of dioramas, but also positioned students as active contributors to a cultural initiative within their own town.\nCore components Frontend: Angular web app (mobile-friendly) for scanning, progress tracking, and story reading. : contentReference Backend: Django REST API for users, progress, story chapters, and shop content management. : contentReference Auth: OAuth-based login to reduce friction and support secure administration. Shop portal: a dedicated panel where each shop can upload and maintain its description (and associated content), keeping operations decentralized and sustainable. Results \/ metrics Local business engagement doubled: we achieved a 2\u00d7 increase in the number of local shops interested in promoting the contest (from the prior baseline of 14 shops to significantly more, 30). Stronger town-wide visibility: by embedding the challenge into the Christmas program and shop windows, the contest shifted from a \u201csingle-event\u201d dynamic to a distributed, repeatable experience across the holiday period. : contentReference Lessons learned Gamification works best when it is aligned with real incentives: tying story completion to visiting (and supporting) shops created a clearer value exchange than \u201cscan just for points.\u201d Operational tooling matters as much as the app: the shop panel reduced bottlenecks and made participation maintainable without constant central coordination. Physical + digital (\u201cphygital\u201d) beats digital-only for local culture: the dioramas turned QR scanning into a meaningful on-street experience, not just a link to content. Youth partnerships are leverage: involving the high school improved both production capacity (more dioramas) and community buy-in. Blending physical experiences with lightweight digital interactions proved far more effective than a purely online solution for a local cultural event. Giving shops autonomy through simple management tools was essential for scalability and long-term sustainability. Finally, involving younger participants as co-creators, rather than passive attendees, significantly improved community engagement and strengthened the social impact of the project.\n","permalink":"https:\/\/oriolac.github.io\/projects\/repte-pessebre\/","summary":"Gamified QR-based Christmas challenge for local commerce.","title":"QR Nativity Challenge"},{"content":"Traditional computer vision techniques involve methods and algorithms that do not rely on deep learning or neural networks. Instead, these approaches are not data-driven and they use classical approaches to process and analyze images. So, in this post, we&rsquo;ll explore three thresholding techniques!\nThresholding When the task is to distinguish the background from the foreground, thresholding provides a straightforward solution. We will use this image as an example.\nThis technique segments an image by assigning one value (typically white) to all pixels above a specified threshold and another value (usually black) to the remaining pixels. Thresholding is a simple yet effective method for separating objects from the background, especially when the background is not complicated.\nCode of the histogram:\nimg = cv2.imread(&#34;images\/document.PNG&#34;) img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) plt.hist(img.reshape(-1), bins=255, range=(0, 255), zorder=3, color=&#34;teal&#34;) plt.xlim([0,255]) plt.ylim(plt.ylim()) plt.vlines(130, *plt.ylim(), label=&#39;Threshold&#39;, linestyles=&#39;dotted&#39;, color=&#39;red&#39;, zorder=3) plt.grid(alpha=0.3, zorder=1) plt.title(&#34;Histogram of the image&#34;) Binary Thresholding Binary thresholding is the most simple method of thresholding. Each pixel in the image is compared to a threshold value: if the pixel value is higher than the threshold, it is set to the maximum value (white); otherwise, it is set to the minimum value (black).\nIn opencv, we almost always use the method cv2.threshold to apply umbralization over an image. In this case, we are setting the threshold to 130 and the maximum value of the image to 255. cv2.THRESH_BINARY is the flag to indicate which kind of thresholding method we want.\nthreshold = 130 threshold, binary_image = cv2.threshold(image, threshold, 255, cv2.THRESH_BINARY) # The first element of the response is worthless here. In our case, the image should turn into:\nOtsu Thresholding But&hellip; how to find the best threshold? Well, a japanese person named Otsu did a greatjob. Otsu&rsquo;s thresholding is an automatic method that calculates the optimal threshold value by minimizing the intra-class variance. This technique is particularly useful for images with bimodal histograms.\notsu_threshold, otsued_img = cv2.threshold(img ,0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU) In our case, the image should turn into:\nAdaptive Thresholding What if we have different lightning conditions? Adaptive thresholding is particularly useful in cases where background intensity vary across the image. Instead of a single global threshold, this technique calculates different threshold values for different regions of the image.\nAdaptive thresholding typically uses either the mean or Gaussian weighted sum of the neighborhood of each pixel (typical, right?). In OpenCV, we can specify the type of adaptive method with cv2.ADAPTIVE_THRESH_MEAN_C or cv2.ADAPTIVE_THRESH_GAUSSIAN_C. Additionally, a constant C is subtracted from the calculated threshold value to further adjust the results.\nadaptive_threshold_img = cv2.adaptiveThreshold( img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2 ) In this case, adaptative thresholding success thresholding by avoiding background variations.\nfig, axs = plt.subplots(3, 1, figsize=(6, 10)) for ax in axs: ax.axis(&#39;off&#39;) axs[0].imshow(img, cmap=&#39;gray&#39;); axs[0].set_title(&#34;Original image&#34;) ret, thresh = cv2.threshold(img, 120, 255, cv2.THRESH_BINARY) axs[1].imshow(thresh, cmap=&#39;gray&#39;); axs[1].set_title(&#34;Binary thresholding&#34;) thresh2 = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 20) axs[2].imshow(thresh2, cmap=&#39;gray&#39;); axs[2].set_title(&#34;Adaptive thresholding&#34;); ","permalink":"https:\/\/oriolac.github.io\/posts\/cv-techniques\/20240615-cv-techniques\/","summary":"<p><strong>Traditional computer vision techniques<\/strong> involve methods and algorithms that do not rely on deep learning or neural networks. Instead, these approaches are not data-driven and they use classical approaches to process and analyze images. So, in this post, we&rsquo;ll explore <strong>three thresholding techniques!<\/strong><\/p>\n<h1 id=\"thresholding\">Thresholding<\/h1>\n<p>When the task is to distinguish the background from the foreground, thresholding provides a straightforward solution. We will use this image as an example.<\/p>\n<p>\n<figure>\n  <img loading=\"lazy\" src=\"https:\/\/raw.githubusercontent.com\/Oriolac\/oriolac.github.io\/refs\/heads\/main\/content\/posts\/cv-techniques\/imgs\/text_image.png?raw=true#center\" alt=\"Image content\"  \/> \n<\/figure>\n<\/p>","title":"Thresholding, filtering and morphological operations"},{"content":"Overview Comparative analysis of Large Language Models (LLMs) and Small Language Models (SLMs) in the context of generative AI. Explanation of different capabilities to customize and fine-tune models and limitations of deployment.\nKey Topics Architecture differences between LLMs and SLMs Capabilities for customization and fine-tuning Deployment considerations and resource requirements Real-world applications and use case selection Event The AI &amp; Big Data Congress is one of Spain&rsquo;s conferences on artificial intelligence and data science, bringing together researchers, practitioners, and industry leaders.\n","permalink":"https:\/\/oriolac.github.io\/talks\/ai-big-data-congress\/","summary":"<h2 id=\"overview\">Overview<\/h2>\n<p>Comparative analysis of Large Language Models (LLMs) and Small Language Models (SLMs) in the context of generative AI.\nExplanation of different capabilities to customize and fine-tune models and limitations of deployment.<\/p>\n<h2 id=\"key-topics\">Key Topics<\/h2>\n<ul>\n<li>Architecture differences between LLMs and SLMs<\/li>\n<li>Capabilities for customization and fine-tuning<\/li>\n<li>Deployment considerations and resource requirements<\/li>\n<li>Real-world applications and use case selection<\/li>\n<\/ul>\n<h2 id=\"event\">Event<\/h2>\n<p>The AI &amp; Big Data Congress is one of Spain&rsquo;s conferences on artificial intelligence and data science, bringing\ntogether researchers, practitioners, and industry leaders.<\/p>","title":"IA Generativa: LLMs vs SLMs"},{"content":"Overview Comprehensive masterclass on Retrieval Augmented Generation (RAG) systems, covering the fundamentals of combining retrieval mechanisms with large language models to improve accuracy and reduce hallucinations.\nKey Topics RAG architecture and design patterns Embedding models and vector databases Chunking strategies and retrieval optimization Evaluation metrics and best practices Hands-on implementation examples Event CIDAI (Centre d&rsquo;Innovaci\u00f3 Digital i Artificial Intelligence) masterclass series focused on advanced AI techniques for professionals and researchers.\n","permalink":"https:\/\/oriolac.github.io\/talks\/cidai-masterclass-rag\/","summary":"<h2 id=\"overview\">Overview<\/h2>\n<p>Comprehensive masterclass on Retrieval Augmented Generation (RAG) systems, covering the fundamentals of combining retrieval mechanisms with large language models to improve accuracy and reduce hallucinations.<\/p>\n<h2 id=\"key-topics\">Key Topics<\/h2>\n<ul>\n<li>RAG architecture and design patterns<\/li>\n<li>Embedding models and vector databases<\/li>\n<li>Chunking strategies and retrieval optimization<\/li>\n<li>Evaluation metrics and best practices<\/li>\n<li>Hands-on implementation examples<\/li>\n<\/ul>\n<h2 id=\"event\">Event<\/h2>\n<p>CIDAI (Centre d&rsquo;Innovaci\u00f3 Digital i Artificial Intelligence) masterclass series focused on advanced AI techniques for professionals and researchers.<\/p>","title":"Generative AI: Retrieval Augmented Generation (RAG) systems"},{"content":"Overview Introductory workshop on Machine Learning and Deep Learning fundamentals designed for hackathon participants. The session covers core concepts, practical algorithms, and hands-on implementation using popular frameworks.\nKey Topics Supervised vs unsupervised learning Neural network fundamentals Common ML\/DL architectures Practical tools and frameworks (scikit-learn, PyTorch, TensorFlow) Tips for rapid prototyping in hackathons Event HackEPS is the largest hackathon in Catalonia, organized by students at the University of Lleida, bringing together hundreds of participants to build innovative projects in 24 hours.\n","permalink":"https:\/\/oriolac.github.io\/talks\/hackeps-ml-dl\/","summary":"<h2 id=\"overview\">Overview<\/h2>\n<p>Introductory workshop on Machine Learning and Deep Learning fundamentals designed for hackathon participants. The session covers core concepts, practical algorithms, and hands-on implementation using popular frameworks.<\/p>\n<h2 id=\"key-topics\">Key Topics<\/h2>\n<ul>\n<li>Supervised vs unsupervised learning<\/li>\n<li>Neural network fundamentals<\/li>\n<li>Common ML\/DL architectures<\/li>\n<li>Practical tools and frameworks (scikit-learn, PyTorch, TensorFlow)<\/li>\n<li>Tips for rapid prototyping in hackathons<\/li>\n<\/ul>\n<h2 id=\"event\">Event<\/h2>\n<p>HackEPS is the largest hackathon in Catalonia, organized by students at the University of Lleida, bringing together hundreds of participants to build innovative projects in 24 hours.<\/p>","title":"Introduction to ML & DL"},{"content":"During HackUPC 2022, we worked on a time-series forecasting challenge (McKinsey case) to predict sales for multiple product groups. The goal was to support stock planning and cost reduction by generating accurate short-term forecasts from historical sales signals and product information.\nProblem Retail sales exhibit trend shifts, volatility, and category-specific dynamics. In the provided data, several product groups show long periods of low activity followed by abrupt regime changes and sustained growth. This makes purely linear\/statistical modeling brittle unless carefully tuned per category.\nWe framed the task as: predict sales for a target date given product group, price, and previous sales history, and benchmarked classical forecasting against deep sequence modeling.\nConstraints Hackathon timebox: we needed a working, validated pipeline quickly. Heterogeneous categories: each product group behaves differently (scale, spikes, growth rate). Windowing decisions: defining lookback length and supervision setup was a key difficulty (sequence-to-one forecasting). Approach Pipeline \/ Architecture Data understanding &amp; EDA\nVisualized sales trajectories per product group to detect scale differences and regime changes. Preprocessing\nAggregation by product group. We simplified the problem since a lot of detailed patterns were underneath the data. Train\/test split consistent with time-series forecasting (no random shuffling) Normalization to stabilize training across categories Sales in each group of products.\nModeling\nARIMA per group (baseline) RNN\/LSTM with configurable lookback window (deep model) Evaluation\nMetric: MSE Qualitative inspection via predicted-vs-real plots per category. Export\nGenerated predictions and saved outputs for submission (e.g., response.csv in the reference implementation). Modelling We implemented two complementary forecasting tracks:\n1) ARIMA We used ARIMA as a fast, interpretable baseline to establish \u201cminimum viable\u201d performance and to expose failure modes ( e.g., sensitivity to abrupt changes). In our internal evaluation, the ARIMA approach achieved MSE \u2248 109.3 on the test setting we reported. ARIMA provided a simple baseline but behaved like a &ldquo;recent-pattern replicator&rdquo;, struggling to generalize when the underlying dynamics shifted.\nArima forecasting in test.\n2) RNN\/LSTM forecaster We trained an RNN\/LSTM-based model using sliding windows over historical sales (and available signals such as price\/category), optimized for MSE. This model handled non-linear dynamics and regime shifts better in our experiments, reaching MSE \u2248 0.02 in the reported test evaluation.\nRNN forecasting in test.\nResults The ARIMA baseline struggled to fully track abrupt changes and complex dynamics in some categories (reported MSE 109.3). The RNN\/LSTM produced substantially tighter fits in our reported evaluation (reported MSE 0.02) and visually tracked the series more closely across time. What I\u2019d improve next Walk-forward validation (rolling-origin evaluation) to reduce the risk of optimistic splits. A single model that learns across all groups. Instead of training separate small models (or heavily category-tuned configurations), train a unique global RNN that learns shared temporal patterns across categories and uses category embeddings (and other metadata) to specialize per group. Probabilistic forecasts (prediction intervals) for stock decisions, not just point estimates. Exogenous drivers (promotions, holidays, weather, store signals) if available. Hierarchical forecasting: enforce coherence between product-level and group-level totals. Modern sequence models (Temporal CNNs or Transformers) as an upgrade path beyond RNNs. Try architectures that are typically stronger and easier to scale for forecasting or Modern forecasting libraries\/models designed for heterogeneous series. ","permalink":"https:\/\/oriolac.github.io\/projects\/sales-prediction-mckinsey\/","summary":"Time-series sales forecasting project combining statistical baselines (ARIMA) with an RNN\/LSTM model to predict sales by product category","title":"Sales Forecasting with ARIMA + RNN"},{"content":"Overview Introduction to homomorphic encryption and its applications in privacy-preserving computation. The talk explores the mathematical foundations, practical implementations, and real-world use cases for computing on encrypted data.\nKey Topics Fundamentals of homomorphic encryption Different schemes (partial, somewhat, fully homomorphic) Performance considerations and limitations Use cases in finance, healthcare, and cloud computing Event TechMeeting is a technical meetup in Lleida focused on emerging technologies and their practical applications.\n","permalink":"https:\/\/oriolac.github.io\/talks\/techmeeting-homomorphic\/","summary":"<h2 id=\"overview\">Overview<\/h2>\n<p>Introduction to homomorphic encryption and its applications in privacy-preserving computation. The talk explores the mathematical foundations, practical implementations, and real-world use cases for computing on encrypted data.<\/p>\n<h2 id=\"key-topics\">Key Topics<\/h2>\n<ul>\n<li>Fundamentals of homomorphic encryption<\/li>\n<li>Different schemes (partial, somewhat, fully homomorphic)<\/li>\n<li>Performance considerations and limitations<\/li>\n<li>Use cases in finance, healthcare, and cloud computing<\/li>\n<\/ul>\n<h2 id=\"event\">Event<\/h2>\n<p>TechMeeting is a technical meetup in Lleida focused on emerging technologies and their practical applications.<\/p>","title":"Homomorphic Encryption"},{"content":"Whether it&rsquo;s about a post, a project collaboration, a research question, or just to say hi; the best way to reach me is by email:\noriolalascercos@gmail.com\nElsewhere You can also find me on:\nGitHub \u2014 github.com\/oriolac LinkedIn \u2014 linkedin.com\/in\/oriolac Google Scholar \u2014 scholar.google.com Substack \u2014 oriolac.substack.com I&rsquo;ll do my best to reply within a few days.\n","permalink":"https:\/\/oriolac.github.io\/contact\/","summary":"<p>Whether it&rsquo;s about a post, a project collaboration, a research question, or just to say hi; the best way to reach me is\nby email:<\/p>\n<p><strong><a href=\"mailto:oriolalascercos@gmail.com\">oriolalascercos@gmail.com<\/a><\/strong><\/p>\n<h2 id=\"elsewhere\">Elsewhere<\/h2>\n<p>You can also find me on:<\/p>\n<ul>\n<li><strong>GitHub<\/strong> \u2014 <a href=\"https:\/\/github.com\/oriolac\/\" target=\"_blank\" rel=\"noopener\">github.com\/oriolac<\/a><\/li>\n<li><strong>LinkedIn<\/strong> \u2014 <a href=\"https:\/\/www.linkedin.com\/in\/oriolac\/\" target=\"_blank\" rel=\"noopener\">linkedin.com\/in\/oriolac<\/a><\/li>\n<li><strong>Google Scholar<\/strong> \u2014 <a href=\"https:\/\/scholar.google.com\/citations?user=UeUC0gEAAAAJ\" target=\"_blank\" rel=\"noopener\">scholar.google.com<\/a><\/li>\n<li><strong>Substack<\/strong> \u2014 <a href=\"https:\/\/oriolac.substack.com\/subscribe\" target=\"_blank\" rel=\"noopener\">oriolac.substack.com<\/a><\/li>\n<\/ul>\n<p>I&rsquo;ll do my best to reply within a few days.<\/p>","title":"Contact"},{"content":"Overview This site is a personal blog. It uses analytics tools to understand how content is being read, and cookies to remember your preferences. This page explains what data is collected, why, and how you can control it.\nCookies A cookie is a small text file stored in your browser. This site uses the following cookies:\nCookie Purpose Duration cookie-consent Stores your analytics consent choice (accepted\/rejected) 365 days _ga, _ga_* Google Analytics \u2014 only set if you accept analytics cookies 2 years No cookies are set before you make a choice in the banner, except for cookie-consent itself once you click Accept or Decline.\nAnalytics Plausible Analytics (always active) This site uses Plausible Analytics, a privacy-friendly, cookieless analytics tool. Plausible:\nDoes not use cookies or any persistent identifiers. Does not collect personal data or IP addresses. Is GDPR, CCPA, and PECR compliant by design. Plausible is loaded on every page visit regardless of your cookie consent choice because it does not require it.\nGoogle Analytics (consent-gated) If you click Accept in the cookie banner, Google Analytics (via Google Tag Manager) is also loaded. It collects anonymous usage data (pages visited, session duration, approximate location) to help understand site traffic. IP anonymisation is enabled (anonymize_ip: true).\nIf you click Decline, Google Analytics is never loaded during your visit and no tracking cookies are set.\nYou can change your choice at any time by clearing your browser cookies for this site, which will show the consent banner again on your next visit.\nThird-party embeds Some posts may include embedded content (e.g., YouTube videos, GitHub Gists). These embeds may set their own cookies when you interact with them, subject to the respective third party&rsquo;s privacy policy.\nYour rights Under GDPR and equivalent regulations, you have the right to:\nKnow what data is collected about you. Withdraw consent for optional analytics at any time (clear site cookies to reset your choice). Contact me with any questions. Contact If you have any questions about this policy, you can reach me via the social links on the home page.\nLast updated: May 2026\n","permalink":"https:\/\/oriolac.github.io\/privacy-policy\/","summary":"<h2 id=\"overview\">Overview<\/h2>\n<p>This site is a personal blog. It uses analytics tools to understand how content is being read, and cookies to remember your preferences. This page explains what data is collected, why, and how you can control it.<\/p>\n<h2 id=\"cookies\">Cookies<\/h2>\n<p>A cookie is a small text file stored in your browser. This site uses the following cookies:<\/p>\n<table>\n  <thead>\n      <tr>\n          <th>Cookie<\/th>\n          <th>Purpose<\/th>\n          <th>Duration<\/th>\n      <\/tr>\n  <\/thead>\n  <tbody>\n      <tr>\n          <td><code>cookie-consent<\/code><\/td>\n          <td>Stores your analytics consent choice (accepted\/rejected)<\/td>\n          <td>365 days<\/td>\n      <\/tr>\n      <tr>\n          <td><code>_ga<\/code>, <code>_ga_*<\/code><\/td>\n          <td>Google Analytics \u2014 only set if you accept analytics cookies<\/td>\n          <td>2 years<\/td>\n      <\/tr>\n  <\/tbody>\n<\/table>\n<p>No cookies are set before you make a choice in the banner, except for <code>cookie-consent<\/code> itself once you click Accept or Decline.<\/p>","title":"Privacy & Cookie Policy"}]