{"version":"https:\/\/jsonfeed.org\/version\/1","title":"eujing.github.io","home_page_url":"https:\/\/eujing.github.io\/","feed_url":"https:\/\/eujing.github.io\/feed.json","description":"My blog mainly about data science projects I work on","icon":"https:\/\/eujing.github.io\/apple-touch-icon.png","favicon":"https:\/\/eujing.github.io\/favicon.ico","expired":false,"author":{"name":"Eu Jing Chua","url":null,"avatar":null},"items":[{"id":"https:\/\/eujing.github.io\/2020\/11\/08\/graph-layout","title":"Graph Layouts with Gradient Descent","summary":"Optimizing graph layouts with PyTorch, and visualizations of the process","content_text":"Graphs, Embeddings, LayoutsRecently, I got asked a pretty interesting modeling question.Given an undirected graph \\(G = (V, E)\\) with \\(|V| = N\\) vertices, can I produce embeddings of dimension \\(d\\) for each vertex such that:  Adjacent vertices have more similar embeddings  Non-adjacent vertices have more different embeddingsI was quite thrown off by the mentions of graphs and embeddings at first, thinking of something like a Graph Convolutional Network.However in this case, things are much simpler and there are no features associated with each vertex.Embeddings are points in some d-dimensional space, but vertices and edges are abstract relationships with no inherent ties to coordinates.This is actually a case of a graph layout algorithm!What we really want is to find a nice assignment of coordinates to each vertex that satisfies the above conditions.For example if \\(d = 2\\), then this would be like a task of trying to draw out the graph on a flat surface.I think force-directed graph drawing is a really cool approach to generating graph layouts.The graph is modeled with spring-like forces of attraction and charge-like forces of repulsion depending on adjacency.It is then run in a physical simulation to let the forces come to an equilibrium.These usually produce good-looking layouts, but could get stuck in sub-optimal layouts depending on the random initialization of positions.In this particular case, the challenge was to formulate this as an optimization problem and solve it with something like PyTorch.The same autograd feature that makes training of deep networks declarative also makes it easy to use many flavors of gradient descent to try solve any differentiable optimization problem.SetupIn order to use PyTorch to solve this, we need to define 3 main aspects of the optimization problem:  Parameters  Target quantity  LossParametersParameters refer to the set of unknown variables we are trying to estimate, and search for through optimization like gradient descent.In this case, the unknown variables we are concerned with are the coordinates for each vertex in the graph.We shall denote the \\(N\\) unknown d-dimensional coordinates as the matrix \\(\\mathbf{X} \\in \\mathbb{R}^{N \\times d}\\).With our gradient descent approach, we need an initial guess to the parameters to start optimizing from.A reasonable way would be to just sample some standard normals as the initial guess:X = torch.randn((N, d), requires_grad=True)The reason why we need requires_grad is this tells PyTorch to keep track of the computations starting from \\(X\\) so it can perform autograd and back-propagate to our parameters \\(X\\).Target QuantityI think there is a lot of flexibility in the target quantity, depending on how we want to solve this.I really like the physical-based approach of attraction and repulsion, and so went with that.The central quantity to this would then be the pairwise relative distance between each vertex in the graph.Once again, there could be various measures of distance like Manhattan distance, Cosine similarity, etc.But sticking to the theme of physical approaches, I chose Euclidean distance (squared).We can calculate pairwise (squared) Euclidean distances as a \\(D \\in \\mathbb{R}^{N \\times N}\\) matrix as so:diffs = X.view(1, N, d) - X.view(N, 1, d)D = (diffs**2).sum(dim=2)The above uses broadcasting to create an intermediate \\(N \\times N \\times d\\) tensor of the differences, \\(\\text{diffs}_{ijk} = x_{i,k} - x_{j,k}\\).Then the pairwise Euclidean distance is \\(D_{ij} = \\sum_{k=1}^d \\text{diffs}_{ijk}^2\\).LossThe loss is the quantity we want to minimize in our optimization.It is the one number we use to judge how good the current parameter estimates are, where lower is better.Sticking to the theme, I chose to breakdown the loss into 3 components:  Attraction: \\(L_{att} \\propto \\sum_{i, j \\in Adj} D_{ij}\\)  Repulsion: \\(L_{rep} \\propto \\sum_{i , j \\not \\in Adj} \\frac{1}{D_{ij}}\\)  Centralization: \\(L_{cent} \\propto \\sum_{i} \\|x_i\\|_2^2 = \\|X\\|_F^2\\)The first two describe the forces of attraction and repulsion, which we can say are proportional to squared distances and inverse-squared distances respectively.The third acts as a prior for our parameters, in that it reflects how I am going to prefer coordinates that are closer to the origin and penalize coordinates that are further away.This helps to address the translation invariance in possible solutions.The final loss is then \\(L = \\lambda_{att} L_{att} + \\lambda_{rep} L_{rep} + \\lambda_{cent} L_{cent}\\), which lets us weight the different components separately.eps = 1e-6attract = (adj * dists).sum()repel = ((1 - adj) * (1 \/ (dists + eps))).sum()cent = (X**2).sum()loss = lamb_att * attract + lamb_rep * repel + lamb_cent * centResultsPutting all these together, we now have defined our optimization problem in PyTorch!We can now sit back and let the autograd do the rest of the work for us:X = torch.randn((N, d), requires_grad=True)opt = torch.optim.Adam([X], lr=0.001)for i in range(10000):    opt.zero_grad()    # Calculate squared distances    D = ...    # Calculate loss    loss = ...    # Back-propagate and optimize    loss.backwards()    opt.step()Taking a step back, lets analyze what we are doing at a high level.In a force-directed approach, we are simulating a physical system of springs and charges over small timesteps and then seeing where it comes to rest.In this setup, we are instead directly searching over the parameter space for configurations of least potential energy.Our method of searching happens to be gradient descent with some learning rates.Maybe some analogies could be drawn between how we either take derivatives wrt time or space.However, the extra features of momentum and adaptive learning rates with optimizers like Adam make it less like a true simulation and more of an optimization.When visualized, it does happen to look like a physical simulation!However as usual with gradient descent in non-convex optimization, we are not guaranteed to find the global optimal solution.Here is an example of the same graph with different random initializations, converging to different solutions.                                                                                                                  As you can see, the solution on the right has a higher loss at the end, is kind of weird, and is just not a result that we would expect from this graph.ConclusionI thought it was quite interesting to use autograd in PyTorch to solve optimization problems outside of machine learning.Also, half of the time was actually spent generating these visualizations in matplotlib, recording them and figuring out how to make GIFs.But also more could definitely be done to improve the aesthetic results of these embeddings.For example, running this on complete graphs would just collapse all the embeddings to the origin as every vertex is every other vertex\u2019s neighbor.Maybe this could be solved by adding minimum distance constraints between vertices to our optimization, then converting it back into an unconstrained problem using Lagrange Multipliers.","content_html":"<h2 id=\"graphs-embeddings-layouts\">Graphs, Embeddings, Layouts<\/h2><p>Recently, I got asked a pretty interesting modeling question.Given an undirected graph \\(G = (V, E)\\) with \\(|V| = N\\) vertices, can I produce embeddings of dimension \\(d\\) for each vertex such that:<\/p><ol>  <li>Adjacent vertices have more similar embeddings<\/li>  <li>Non-adjacent vertices have more different embeddings<\/li><\/ol><p>I was quite thrown off by the mentions of graphs and embeddings at first, thinking of something like a Graph Convolutional Network.However in this case, things are much simpler and there are no features associated with each vertex.<\/p><p>Embeddings are points in some d-dimensional space, but vertices and edges are abstract relationships with no inherent ties to coordinates.This is actually a case of a graph layout algorithm!What we really want is to find a nice assignment of coordinates to each vertex that satisfies the above conditions.For example if \\(d = 2\\), then this would be like a task of trying to draw out the graph on a flat surface.<\/p><p><img src=\"\/assets\/graph-layout-final.png\" alt=\"Example of a graph layout\" \/><\/p><p>I think <a href=\"https:\/\/en.wikipedia.org\/wiki\/Force-directed_graph_drawing\">force-directed graph drawing<\/a> is a really cool approach to generating graph layouts.The graph is modeled with spring-like forces of attraction and charge-like forces of repulsion depending on adjacency.It is then run in a physical simulation to let the forces come to an equilibrium.These usually produce good-looking layouts, but could get stuck in sub-optimal layouts depending on the random initialization of positions.<\/p><p>In this particular case, the challenge was to formulate this as an optimization problem and solve it with something like PyTorch.The same autograd feature that makes training of deep networks declarative also makes it easy to use many flavors of gradient descent to try solve any differentiable optimization problem.<\/p><h2 id=\"setup\">Setup<\/h2><p>In order to use PyTorch to solve this, we need to define 3 main aspects of the optimization problem:<\/p><ol>  <li>Parameters<\/li>  <li>Target quantity<\/li>  <li>Loss<\/li><\/ol><h3 id=\"parameters\">Parameters<\/h3><p>Parameters refer to the set of unknown variables we are trying to estimate, and search for through optimization like gradient descent.In this case, the unknown variables we are concerned with are the <strong>coordinates<\/strong> for each vertex in the graph.We shall denote the \\(N\\) unknown d-dimensional coordinates as the matrix \\(\\mathbf{X} \\in \\mathbb{R}^{N \\times d}\\).<\/p><p>With our gradient descent approach, we need an initial guess to the parameters to start optimizing from.A reasonable way would be to just sample some standard normals as the initial guess:<\/p><div class=\"language-python highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"n\">X<\/span> <span class=\"o\">=<\/span> <span class=\"n\">torch<\/span><span class=\"p\">.<\/span><span class=\"n\">randn<\/span><span class=\"p\">((<\/span><span class=\"n\">N<\/span><span class=\"p\">,<\/span> <span class=\"n\">d<\/span><span class=\"p\">),<\/span> <span class=\"n\">requires_grad<\/span><span class=\"o\">=<\/span><span class=\"bp\">True<\/span><span class=\"p\">)<\/span><\/code><\/pre><\/div><\/div><p>The reason why we need <code class=\"language-plaintext highlighter-rouge\">requires_grad<\/code> is this tells PyTorch to keep track of the computations starting from \\(X\\) so it can perform autograd and back-propagate to our parameters \\(X\\).<\/p><h3 id=\"target-quantity\">Target Quantity<\/h3><p>I think there is a lot of flexibility in the target quantity, depending on how we want to solve this.I really like the physical-based approach of attraction and repulsion, and so went with that.The central quantity to this would then be the <strong>pairwise relative distance<\/strong> between each vertex in the graph.Once again, there could be various measures of distance like Manhattan distance, Cosine similarity, etc.But sticking to the theme of physical approaches, I chose Euclidean distance (squared).We can calculate pairwise (squared) Euclidean distances as a \\(D \\in \\mathbb{R}^{N \\times N}\\) matrix as so:<\/p><div class=\"language-python highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"n\">diffs<\/span> <span class=\"o\">=<\/span> <span class=\"n\">X<\/span><span class=\"p\">.<\/span><span class=\"n\">view<\/span><span class=\"p\">(<\/span><span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"n\">N<\/span><span class=\"p\">,<\/span> <span class=\"n\">d<\/span><span class=\"p\">)<\/span> <span class=\"o\">-<\/span> <span class=\"n\">X<\/span><span class=\"p\">.<\/span><span class=\"n\">view<\/span><span class=\"p\">(<\/span><span class=\"n\">N<\/span><span class=\"p\">,<\/span> <span class=\"mi\">1<\/span><span class=\"p\">,<\/span> <span class=\"n\">d<\/span><span class=\"p\">)<\/span><span class=\"n\">D<\/span> <span class=\"o\">=<\/span> <span class=\"p\">(<\/span><span class=\"n\">diffs<\/span><span class=\"o\">**<\/span><span class=\"mi\">2<\/span><span class=\"p\">).<\/span><span class=\"nb\">sum<\/span><span class=\"p\">(<\/span><span class=\"n\">dim<\/span><span class=\"o\">=<\/span><span class=\"mi\">2<\/span><span class=\"p\">)<\/span><\/code><\/pre><\/div><\/div><p>The above uses broadcasting to create an intermediate \\(N \\times N \\times d\\) tensor of the differences, \\(\\text{diffs}_{ijk} = x_{i,k} - x_{j,k}\\).Then the pairwise Euclidean distance is \\(D_{ij} = \\sum_{k=1}^d \\text{diffs}_{ijk}^2\\).<\/p><h3 id=\"loss\">Loss<\/h3><p>The loss is the quantity we want to minimize in our optimization.It is the one number we use to judge how good the current parameter estimates are, where lower is better.Sticking to the theme, I chose to breakdown the loss into 3 components:<\/p><ol>  <li><strong>Attraction<\/strong>: \\(L_{att} \\propto \\sum_{i, j \\in Adj} D_{ij}\\)<\/li>  <li><strong>Repulsion<\/strong>: \\(L_{rep} \\propto \\sum_{i , j \\not \\in Adj} \\frac{1}{D_{ij}}\\)<\/li>  <li><strong>Centralization<\/strong>: \\(L_{cent} \\propto \\sum_{i} \\|x_i\\|_2^2 = \\|X\\|_F^2\\)<\/li><\/ol><p>The first two describe the forces of attraction and repulsion, which we can say are proportional to squared distances and inverse-squared distances respectively.The third acts as a prior for our parameters, in that it reflects how I am going to prefer coordinates that are closer to the origin and penalize coordinates that are further away.This helps to address the translation invariance in possible solutions.The final loss is then \\(L = \\lambda_{att} L_{att} + \\lambda_{rep} L_{rep} + \\lambda_{cent} L_{cent}\\), which lets us weight the different components separately.<\/p><div class=\"language-python highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"n\">eps<\/span> <span class=\"o\">=<\/span> <span class=\"mf\">1e-6<\/span><span class=\"n\">attract<\/span> <span class=\"o\">=<\/span> <span class=\"p\">(<\/span><span class=\"n\">adj<\/span> <span class=\"o\">*<\/span> <span class=\"n\">dists<\/span><span class=\"p\">).<\/span><span class=\"nb\">sum<\/span><span class=\"p\">()<\/span><span class=\"n\">repel<\/span> <span class=\"o\">=<\/span> <span class=\"p\">((<\/span><span class=\"mi\">1<\/span> <span class=\"o\">-<\/span> <span class=\"n\">adj<\/span><span class=\"p\">)<\/span> <span class=\"o\">*<\/span> <span class=\"p\">(<\/span><span class=\"mi\">1<\/span> <span class=\"o\">\/<\/span> <span class=\"p\">(<\/span><span class=\"n\">dists<\/span> <span class=\"o\">+<\/span> <span class=\"n\">eps<\/span><span class=\"p\">))).<\/span><span class=\"nb\">sum<\/span><span class=\"p\">()<\/span><span class=\"n\">cent<\/span> <span class=\"o\">=<\/span> <span class=\"p\">(<\/span><span class=\"n\">X<\/span><span class=\"o\">**<\/span><span class=\"mi\">2<\/span><span class=\"p\">).<\/span><span class=\"nb\">sum<\/span><span class=\"p\">()<\/span><span class=\"n\">loss<\/span> <span class=\"o\">=<\/span> <span class=\"n\">lamb_att<\/span> <span class=\"o\">*<\/span> <span class=\"n\">attract<\/span> <span class=\"o\">+<\/span> <span class=\"n\">lamb_rep<\/span> <span class=\"o\">*<\/span> <span class=\"n\">repel<\/span> <span class=\"o\">+<\/span> <span class=\"n\">lamb_cent<\/span> <span class=\"o\">*<\/span> <span class=\"n\">cent<\/span><\/code><\/pre><\/div><\/div><h2 id=\"results\">Results<\/h2><p>Putting all these together, we now have defined our optimization problem in PyTorch!We can now sit back and let the autograd do the rest of the work for us:<\/p><div class=\"language-python highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"n\">X<\/span> <span class=\"o\">=<\/span> <span class=\"n\">torch<\/span><span class=\"p\">.<\/span><span class=\"n\">randn<\/span><span class=\"p\">((<\/span><span class=\"n\">N<\/span><span class=\"p\">,<\/span> <span class=\"n\">d<\/span><span class=\"p\">),<\/span> <span class=\"n\">requires_grad<\/span><span class=\"o\">=<\/span><span class=\"bp\">True<\/span><span class=\"p\">)<\/span><span class=\"n\">opt<\/span> <span class=\"o\">=<\/span> <span class=\"n\">torch<\/span><span class=\"p\">.<\/span><span class=\"n\">optim<\/span><span class=\"p\">.<\/span><span class=\"n\">Adam<\/span><span class=\"p\">([<\/span><span class=\"n\">X<\/span><span class=\"p\">],<\/span> <span class=\"n\">lr<\/span><span class=\"o\">=<\/span><span class=\"mf\">0.001<\/span><span class=\"p\">)<\/span><span class=\"k\">for<\/span> <span class=\"n\">i<\/span> <span class=\"ow\">in<\/span> <span class=\"nb\">range<\/span><span class=\"p\">(<\/span><span class=\"mi\">10000<\/span><span class=\"p\">):<\/span>    <span class=\"n\">opt<\/span><span class=\"p\">.<\/span><span class=\"n\">zero_grad<\/span><span class=\"p\">()<\/span>    <span class=\"c1\"># Calculate squared distances<\/span>    <span class=\"n\">D<\/span> <span class=\"o\">=<\/span> <span class=\"p\">...<\/span>    <span class=\"c1\"># Calculate loss<\/span>    <span class=\"n\">loss<\/span> <span class=\"o\">=<\/span> <span class=\"p\">...<\/span>    <span class=\"c1\"># Back-propagate and optimize<\/span>    <span class=\"n\">loss<\/span><span class=\"p\">.<\/span><span class=\"n\">backwards<\/span><span class=\"p\">()<\/span>    <span class=\"n\">opt<\/span><span class=\"p\">.<\/span><span class=\"n\">step<\/span><span class=\"p\">()<\/span><\/code><\/pre><\/div><\/div><p>Taking a step back, lets analyze what we are doing at a high level.In a force-directed approach, we are simulating a physical system of springs and charges over small timesteps and then seeing where it comes to rest.In this setup, we are instead directly searching over the parameter space for configurations of least potential energy.Our method of searching happens to be gradient descent with some learning rates.<\/p><p>Maybe some analogies could be drawn between how we either take derivatives wrt time or space.However, the extra features of momentum and adaptive learning rates with optimizers like Adam make it less like a true simulation and more of an optimization.<\/p><p>When visualized, it does happen to look like a physical simulation!<\/p><p><img src=\"\/assets\/graph-layout-1.gif\" alt=\"\" \/><img src=\"\/assets\/graph-layout-2.gif\" alt=\"\" \/><\/p><p>However as usual with gradient descent in non-convex optimization, we are not guaranteed to find the global optimal solution.Here is an example of the same graph with different random initializations, converging to different solutions.<\/p><figure>  <div style=\"display:flex\">              <div style=\"flex:1.5; padding:0 3% 0 0\">              <img src=\"\/assets\/graph-layout-2.gif\" alt=\"Larger\" \/>          <\/div>                    <div style=\"flex:1.5\">              <img src=\"\/assets\/graph-layout-bad.gif\" alt=\"Bad Larger\" \/>          <\/div>          <\/div>      <div style=\"display:flex\">            <\/div>  <\/figure><p>As you can see, the solution on the right has a higher loss at the end, is kind of weird, and is just not a result that we would expect from this graph.<\/p><h2 id=\"conclusion\">Conclusion<\/h2><p>I thought it was quite interesting to use autograd in PyTorch to solve optimization problems outside of machine learning.Also, half of the time was actually spent generating these visualizations in matplotlib, recording them and figuring out how to make GIFs.<\/p><p>But also more could definitely be done to improve the aesthetic results of these embeddings.For example, running this on complete graphs would just collapse all the embeddings to the origin as every vertex is every other vertex\u2019s neighbor.Maybe this could be solved by adding minimum distance constraints between vertices to our optimization, then converting it back into an unconstrained problem using Lagrange Multipliers.<\/p>","url":"https:\/\/eujing.github.io\/2020\/11\/08\/graph-layout","tags":["pytorch","optimization","visualization","graphs"],"date_published":"2020-11-08T00:00:00+00:00","date_modified":"2020-11-08T00:00:00+00:00","author":{"name":"Eu Jing Chua","url":null,"avatar":null}},{"id":"https:\/\/eujing.github.io\/2020\/09\/28\/memory-issues","title":"Issue Generation & Memory Utilization","summary":"Notes about the hurdles faced in developing a migration tool and how it evolved","content_text":"What are Issues?In the previous post, I wrote about issues, what they represented and their potential impact on modeling.To recap, a timeseries signal with time \\(t\\) for location \\(s\\) has typically has values \\(x_{s,t}\\).However when we account for phenomenon like back-fill, each of these values can have multiple issues \/ versions, which would then be more like \\(x_{s,t,u}, u \\ge t\\) where \\(u\\) is another dimension of time that describes the modification date.By keeping track of issues, we can have nice features like as-of views of the data.So tracking issues is great, but what if we did not do that from the start?We could start tracking issues for newly-ingested data, but existing data would require some sort of migration.This post aims to document some of the (mainly memory) hurdles I came across when trying to develop a tool to assist in such a migration.Context: What Do We Have?It was fortunate that at Delphi, daily SQL database backups were stored which allowed us to approximately reconstruct issues for existing data.The main idea is:  For some timeseries \\(x_{s,t}\\), we have a backups of it at times \\(u-1\\) and \\(u\\).  Find the differences from backup \\(u-1\\) to \\(u\\).  If \\(x_{s,t}\\) is identical between both backups, then there probably was no new issue of it.  If \\(x_{s,t}\\) differs between both backups, then there probably was an issue \\(x_{s,t,u}\\) with value from the latter backup.We can keep repeating this process between adjacent days of backups to produce our issues!The problem was that each of these backups were large, compressed text files (in SQL INSERT INTO format) that got bigger the more recent they got.There were several months of backup files to churn through, and no readily available clusters to run Hadoop or Spark on (at that time).The migration tool eventually managed to process all the backup files within a day on a single remote server with a few cores and 64 GB of RAM, but not without some extensive tuning!Here are some of the things I learnt while working on this tool.Stick to IteratorsOne of the steps in the tool\u2019s pipeline was to partition large CSV files into smaller subsets based on a particular categorical column, signal source.Such partitioning allows us to process the data at a more granular level, controlling how many subsets we want to process in parallel or which subsets to omit totally.A simplified version of this task involves just filtering down a CSV by a value for a specific column.A common list-based approach might look like this:def filter_csv_list(    csv_in: str, csv_out: str,    col_idx: int, col_val: Any):    with open(csv_in, \"r\") as f_in:        with open(csv_out, \"w\") as f_out:             lines = f_in.readlines()            for line in lines:                value = line.split(\",\")[col_idx]                if value == col_val:                    f_out.write(line)If we profile the memory usage on a ~200 MB CSV file, we get:%memit filter_csv_list(\"data\/cc-est2018-alldata.csv\", \"data\/cc_list.csv\", 1, \"Pennsylvania\")&gt;&gt;&gt; peak memory: 254.37 MiB, increment: 205.34 MiBWhat we are focusing on is the increment, or how much memory this particular line used.It seems like filtering this ~200 MB CSV file required ~200 MB of memory, but can we do better?def filter_csv_iter(    csv_in: str, csv_out: str,    col_idx: int, col_val: Any):    with open(csv_in, \"r\") as f_in:        with open(csv_out, \"w\") as f_out:             for line in f_in:                value = line.split(\",\")[col_idx]                if value == col_val:                    f_out.write(line)The change is minor. We omit using readlines(), a commonly used function that returns a list of lines, and directly use the file iterator f_in instead, which returns an iterator of lines. Profiling this similarly, we get:%memit filter_csv_iter(\"data\/cc-est2018-alldata.csv\", \"data\/cc_list.csv\", 1, \"Pennsylvania\")&gt;&gt;&gt; peak memory: 49.89 MiB, increment: 0.00 MiBNow increment is ~O MB, a great difference!This sounds more like it, as we really do not need the whole file in memory to filter it down.We can immediately decide for each row whether to keep it or not.The change is also really subtle, as Python makes using either iterators or lists really seamless, which could be a good or bad thing.In the spirit of keeping memory utilization down, this meant really carefully sticking to using iterators as much as possible and really only loading data as it is needed.The same ideas were used for CSV partitioning here in the tool.Save memory by preferring generators, using the built-in itertools, and verifying you are using iterator-based functions!CSV DifferencingcsvdiffThe core of this tool performs differencing between CSV files (reformatted SQL backup files).Being the step with the bulk of the processing and memory utilization, optimizing this is of particular importance.For prototyping, I initially looked at existing CSV differencing tools to not re-invent the wheel if possible.csvdiff was great, and I ended up using it quite frequently in other areas when dealing with CSVs.It finds the indexed differences quickly and returns JSON describing adds, changes, and deletes like: (credit: csvdiff docs){  \"_index\": [ \"id\" ],  \"added\": [    { \"amount\": \"81\", \"id\": \"5\", \"name\": \"mira\" },    ...  ],  \"changed\": [    {      \"fields\": {        \"amount\": { \"from\": \"20\", \"to\": \"23\" }      },      \"key\": [\"1\"]    },    ...  ],  \"removed\": [    { \"amount\": \"63\", \"id\": \"2\", \"name\": \"eva\" },    ...  ]}This is a very detailed report and is great for parsing with tools like jq.However, we are really only interested in the adds and changes, and in particular the to value of the changes only.It was fast to create a prototype tool with csvdiff and it worked well on small partitions.However, it quickly exploded in memory utilization on larger partitions and frequently became a victim of the OOM killer.Why?Probably because the internal data structures and output JSON format of csvdiff are mainly Python ones like List and Dict.When the instance count of structures like List and Dict scale with the data set, we incur increasing memory overhead for the growing number of Python objects.PandasPandas is a great library for working with indexed data, and manages memory utilization well.I ended up implementing my own CSV differ in Pandas, which allowed me to perform further memory-usage optimizations elaborated on further on.To recap, values are roughly indexed in each backup by the signal name \\(x\\), time \\(t\\) and location \\(s\\) to give rise to values \\(x_{s,t}\\).Between the backups for times \\(u-1\\) and \\(u\\), we are only interested in added values (new times or locations) or changed values (back-filled values) in backup \\(u\\), which gives rise to issue values \\(x_{s,t,u}\\).In order to just find the added and changed values, we can do an index-aligned data-frame comparison to select such rows with something like:def pd_csvdiff(before_csv, after_csv, index_cols):    # Load data and set indices    df_before = pd.read_csv(before_csv, ...)    df_after = pd.read_csv(after_csv, ...)    df_before.set_index(index_cols, inplace=True)    df_after.set_index(index_cols, inplace=True)    ...    # Align and compare    same_mask = (df_before.reindex(df_after.index) == df_after)    is_diff = ~(same_mask.all(axis=1))    # Extract the different rows only    return df_after.loc[is_diff, :]Now why did we bother recreating this functionality in Pandas?Because Pandas provides good control over how we want to store the data in memory, in the form of column dtypes.Specifically, huge memory savings came from representing categorical columns as the Pandas categorical type instead of individual Python objects (strings in this case).For example, this toy example shows ~60x less memory usage just by using dtype=category:colors = np.random.choice([\"red\", \"blue\", \"yellow\"], 1000, replace=True)s1 = pd.Series(colors)s2 = pd.Series(colors, dtype=\"category\")print(f\"Bytes used: {s1.memory_usage(deep=True)} B\")print(f\"Bytes used: {s2.memory_usage(deep=True)} B\")&gt;&gt;&gt; Bytes used: 613560 B    Bytes used: 10392 BWhen using categoricals, Pandas only has to store small integers for each entry, along with a mapping going from these small integers to the original object or string.Instead of storing many copies of duplicate strings, we only have to store the unique ones in this mapping.Thus we save on the number of Python objects we have in memory.This was especially useful for representing columns like signal name, state, county, etc.Edge ConditionsFirst, missing values are present in the data as NaN values.However, a naive comparison of NaN == NaN is actually false-y by convention.In our case, this actually represents a missing value before that is still missing, so we do not want these to flag up as changed values.Secondly, there could be a mis-match of category values.The category values are inferred from the individual CSVs, and new values may appear across time.Pandas has to know about the complete set of values when doing comparisons between categoricals, otherwise we are effectively comparing \u201cdifferent\u201d sets.We have need to union the category values together before comparisons to prevent such errors.The full implementation along with handling for these edge cases can be found hereMultiprocessing: Setup &amp; ChallengesBy now, I had nicely partitioned up the data and differencing within these partitions no longer caused memory utilization to explode.There were many partitions to churn through, but we were doing them one at a time with cores and memory to spare.I took the simple route of parallelizing this tool by using Python\u2019s built-in multiprocessing module, specifically Pool.starmap, which very easily lets us map a function of multiple arguments across a list of arguments, in a parallel manner!# Seriallysplitted = starmap(split_csv_by_col, split_args)# In parallelwith Pool(ncpu) as pool:    splitted = pool.starmap(split_csv_by_col, split_args)However, it turns out that processing some partitions required much more memory than others due to the nature of the signals.Having too large of a ncpu value would still result in OOM kills when too many of these large partitions got processed together.The simple approach I took to fixing this was to more granular ncpu settings for each stage of the pipeline, having large ncpu for stages like CSV partitioning that used very little memory, and smaller ncpu counts for memory-intensive stages like CSV differencing.This was probably not the most efficient way of tackling the issue, but I think it was an acceptable trade-off between being able to kick-start tool on the backups, and spending more time developing a better solution.In hindsight, I would have probably tried to implement a simple scheduler that estimated the memory needed for each partition from file sizes and then scheduled more small partitions together.I also came across deadlock problems when trying using logging with multiprocessing, but managed to solve it with the multiprocessing_logging package.See here for more details on how I incorporated all these with the rest of the pipeline.ConclusionI learnt a lot about processing large data, optimizing for memory usage and multiprocessing in the development of this tool.It was made even harder as I only could test it on small sample back-ups and did not have access to the real server it would run on!I am glad it was able to churn through all the backups in the end to perform the migration, after several days of optimization and debugging.I tried to include detailed documentation in the tool\u2019s source code.It is my hope that if you are tackling a similar problem without some Hadoop or Spark cluster, that these reflections and documentation will come in useful.","content_html":"<h2 id=\"what-are-issues\">What are Issues?<\/h2><p>In the previous <a href=\"\/2020\/09\/01\/issues-migration\">post<\/a>, I wrote about issues, what they represented and their potential impact on modeling.To recap, a timeseries signal with time \\(t\\) for location \\(s\\) has typically has values \\(x_{s,t}\\).However when we account for phenomenon like back-fill, each of these values can have multiple issues \/ versions, which would then be more like \\(x_{s,t,u}, u \\ge t\\) where \\(u\\) is another dimension of time that describes the modification date.By keeping track of issues, we can have nice features like as-of views of the data.<\/p><p>So tracking issues is great, but what if we did not do that from the start?We could start tracking issues for newly-ingested data, but existing data would require some sort of migration.This post aims to document some of the (mainly memory) hurdles I came across when trying to develop a tool to assist in such a migration.<\/p><h2 id=\"context-what-do-we-have\">Context: What Do We Have?<\/h2><p>It was fortunate that at Delphi, daily SQL database backups were stored which allowed us to approximately reconstruct issues for existing data.The main idea is:<\/p><ol>  <li>For some timeseries \\(x_{s,t}\\), we have a backups of it at times \\(u-1\\) and \\(u\\).<\/li>  <li>Find the differences from backup \\(u-1\\) to \\(u\\).<\/li>  <li>If \\(x_{s,t}\\) is identical between both backups, then there probably was no new issue of it.<\/li>  <li>If \\(x_{s,t}\\) differs between both backups, then there probably was an issue \\(x_{s,t,u}\\) with value from the latter backup.<\/li><\/ol><p>We can keep repeating this process between adjacent days of backups to produce our issues!The problem was that each of these backups were <em>large<\/em>, compressed text files (in SQL <code class=\"language-plaintext highlighter-rouge\">INSERT INTO<\/code> format) that got bigger the more recent they got.<\/p><p>There were several months of backup files to churn through, and no readily available clusters to run Hadoop or Spark on (at that time).The migration tool eventually managed to process all the backup files within a day on a single remote server with a few cores and 64 GB of RAM, but not without some extensive tuning!Here are some of the things I learnt while working on this tool.<\/p><h2 id=\"stick-to-iterators\">Stick to Iterators<\/h2><p>One of the steps in the tool\u2019s pipeline was to partition large CSV files into smaller subsets based on a particular categorical column, signal source.Such partitioning allows us to process the data at a more granular level, controlling how many subsets we want to process in parallel or which subsets to omit totally.<\/p><p>A simplified version of this task involves just filtering down a CSV by a value for a specific column.A common list-based approach might look like this:<\/p><div class=\"language-python highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"k\">def<\/span> <span class=\"nf\">filter_csv_list<\/span><span class=\"p\">(<\/span>    <span class=\"n\">csv_in<\/span><span class=\"p\">:<\/span> <span class=\"nb\">str<\/span><span class=\"p\">,<\/span> <span class=\"n\">csv_out<\/span><span class=\"p\">:<\/span> <span class=\"nb\">str<\/span><span class=\"p\">,<\/span>    <span class=\"n\">col_idx<\/span><span class=\"p\">:<\/span> <span class=\"nb\">int<\/span><span class=\"p\">,<\/span> <span class=\"n\">col_val<\/span><span class=\"p\">:<\/span> <span class=\"n\">Any<\/span><span class=\"p\">):<\/span>    <span class=\"k\">with<\/span> <span class=\"nb\">open<\/span><span class=\"p\">(<\/span><span class=\"n\">csv_in<\/span><span class=\"p\">,<\/span> <span class=\"s\">\"r\"<\/span><span class=\"p\">)<\/span> <span class=\"k\">as<\/span> <span class=\"n\">f_in<\/span><span class=\"p\">:<\/span>        <span class=\"k\">with<\/span> <span class=\"nb\">open<\/span><span class=\"p\">(<\/span><span class=\"n\">csv_out<\/span><span class=\"p\">,<\/span> <span class=\"s\">\"w\"<\/span><span class=\"p\">)<\/span> <span class=\"k\">as<\/span> <span class=\"n\">f_out<\/span><span class=\"p\">:<\/span>             <span class=\"n\">lines<\/span> <span class=\"o\">=<\/span> <span class=\"n\">f_in<\/span><span class=\"p\">.<\/span><span class=\"n\">readlines<\/span><span class=\"p\">()<\/span>            <span class=\"k\">for<\/span> <span class=\"n\">line<\/span> <span class=\"ow\">in<\/span> <span class=\"n\">lines<\/span><span class=\"p\">:<\/span>                <span class=\"n\">value<\/span> <span class=\"o\">=<\/span> <span class=\"n\">line<\/span><span class=\"p\">.<\/span><span class=\"n\">split<\/span><span class=\"p\">(<\/span><span class=\"s\">\",\"<\/span><span class=\"p\">)[<\/span><span class=\"n\">col_idx<\/span><span class=\"p\">]<\/span>                <span class=\"k\">if<\/span> <span class=\"n\">value<\/span> <span class=\"o\">==<\/span> <span class=\"n\">col_val<\/span><span class=\"p\">:<\/span>                    <span class=\"n\">f_out<\/span><span class=\"p\">.<\/span><span class=\"n\">write<\/span><span class=\"p\">(<\/span><span class=\"n\">line<\/span><span class=\"p\">)<\/span><\/code><\/pre><\/div><\/div><p>If we profile the memory usage on a ~200 MB CSV file, we get:<\/p><div class=\"language-sh highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>%memit filter_csv_list<span class=\"o\">(<\/span><span class=\"s2\">\"data\/cc-est2018-alldata.csv\"<\/span>, <span class=\"s2\">\"data\/cc_list.csv\"<\/span>, 1, <span class=\"s2\">\"Pennsylvania\"<\/span><span class=\"o\">)<\/span><span class=\"o\">&gt;&gt;&gt;<\/span> peak memory: 254.37 MiB, increment: 205.34 MiB<\/code><\/pre><\/div><\/div><p>What we are focusing on is the <code class=\"language-plaintext highlighter-rouge\">increment<\/code>, or how much memory this particular line used.It seems like filtering this ~200 MB CSV file required ~200 MB of memory, but can we do better?<\/p><div class=\"language-python highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"k\">def<\/span> <span class=\"nf\">filter_csv_iter<\/span><span class=\"p\">(<\/span>    <span class=\"n\">csv_in<\/span><span class=\"p\">:<\/span> <span class=\"nb\">str<\/span><span class=\"p\">,<\/span> <span class=\"n\">csv_out<\/span><span class=\"p\">:<\/span> <span class=\"nb\">str<\/span><span class=\"p\">,<\/span>    <span class=\"n\">col_idx<\/span><span class=\"p\">:<\/span> <span class=\"nb\">int<\/span><span class=\"p\">,<\/span> <span class=\"n\">col_val<\/span><span class=\"p\">:<\/span> <span class=\"n\">Any<\/span><span class=\"p\">):<\/span>    <span class=\"k\">with<\/span> <span class=\"nb\">open<\/span><span class=\"p\">(<\/span><span class=\"n\">csv_in<\/span><span class=\"p\">,<\/span> <span class=\"s\">\"r\"<\/span><span class=\"p\">)<\/span> <span class=\"k\">as<\/span> <span class=\"n\">f_in<\/span><span class=\"p\">:<\/span>        <span class=\"k\">with<\/span> <span class=\"nb\">open<\/span><span class=\"p\">(<\/span><span class=\"n\">csv_out<\/span><span class=\"p\">,<\/span> <span class=\"s\">\"w\"<\/span><span class=\"p\">)<\/span> <span class=\"k\">as<\/span> <span class=\"n\">f_out<\/span><span class=\"p\">:<\/span>             <span class=\"k\">for<\/span> <span class=\"n\">line<\/span> <span class=\"ow\">in<\/span> <span class=\"n\">f_in<\/span><span class=\"p\">:<\/span>                <span class=\"n\">value<\/span> <span class=\"o\">=<\/span> <span class=\"n\">line<\/span><span class=\"p\">.<\/span><span class=\"n\">split<\/span><span class=\"p\">(<\/span><span class=\"s\">\",\"<\/span><span class=\"p\">)[<\/span><span class=\"n\">col_idx<\/span><span class=\"p\">]<\/span>                <span class=\"k\">if<\/span> <span class=\"n\">value<\/span> <span class=\"o\">==<\/span> <span class=\"n\">col_val<\/span><span class=\"p\">:<\/span>                    <span class=\"n\">f_out<\/span><span class=\"p\">.<\/span><span class=\"n\">write<\/span><span class=\"p\">(<\/span><span class=\"n\">line<\/span><span class=\"p\">)<\/span><\/code><\/pre><\/div><\/div><p>The change is minor. We omit using <code class=\"language-plaintext highlighter-rouge\">readlines()<\/code>, a commonly used function that returns a <strong>list<\/strong> of lines, and directly use the file iterator <code class=\"language-plaintext highlighter-rouge\">f_in<\/code> instead, which returns an <strong>iterator<\/strong> of lines. Profiling this similarly, we get:<\/p><div class=\"language-sh highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>%memit filter_csv_iter<span class=\"o\">(<\/span><span class=\"s2\">\"data\/cc-est2018-alldata.csv\"<\/span>, <span class=\"s2\">\"data\/cc_list.csv\"<\/span>, 1, <span class=\"s2\">\"Pennsylvania\"<\/span><span class=\"o\">)<\/span><span class=\"o\">&gt;&gt;&gt;<\/span> peak memory: 49.89 MiB, increment: 0.00 MiB<\/code><\/pre><\/div><\/div><p>Now <code class=\"language-plaintext highlighter-rouge\">increment<\/code> is ~O MB, a great difference!This sounds more like it, as we really do not need the whole file in memory to filter it down.We can immediately decide for each row whether to keep it or not.<\/p><p>The change is also really subtle, as Python makes using either <strong>iterators<\/strong> or <strong>lists<\/strong> really seamless, which could be a good or bad thing.In the spirit of keeping memory utilization down, this meant really carefully sticking to using <strong>iterators<\/strong> as much as possible and really only loading data as it is needed.The same ideas were used for CSV partitioning <a href=\"https:\/\/github.com\/eujing\/diff-sql-backups\/blob\/78696100fcffaf1d2fdfc9c8da4de1f58cd371fe\/proc_db_backups_pd.py#L301-L362\">here<\/a> in the tool.<\/p><p>Save memory by preferring generators, using the built-in <a href=\"https:\/\/docs.python.org\/3.8\/library\/itertools.html\">itertools<\/a>, and verifying you are using <strong>iterator<\/strong>-based functions!<\/p><h2 id=\"csv-differencing\">CSV Differencing<\/h2><h3 id=\"csvdiff\">csvdiff<\/h3><p>The core of this tool performs differencing between CSV files (reformatted SQL backup files).Being the step with the bulk of the processing and memory utilization, optimizing this is of particular importance.<\/p><p>For prototyping, I initially looked at existing CSV differencing tools to not re-invent the wheel if possible.<a href=\"https:\/\/pypi.org\/project\/csvdiff\/\">csvdiff<\/a> was great, and I ended up using it quite frequently in other areas when dealing with CSVs.It finds the indexed differences quickly and returns JSON describing adds, changes, and deletes like: (credit: <a href=\"https:\/\/pypi.org\/project\/csvdiff\/\">csvdiff docs<\/a>)<\/p><div class=\"language-javascript highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"p\">{<\/span>  <span class=\"dl\">\"<\/span><span class=\"s2\">_index<\/span><span class=\"dl\">\"<\/span><span class=\"p\">:<\/span> <span class=\"p\">[<\/span> <span class=\"dl\">\"<\/span><span class=\"s2\">id<\/span><span class=\"dl\">\"<\/span> <span class=\"p\">],<\/span>  <span class=\"dl\">\"<\/span><span class=\"s2\">added<\/span><span class=\"dl\">\"<\/span><span class=\"p\">:<\/span> <span class=\"p\">[<\/span>    <span class=\"p\">{<\/span> <span class=\"dl\">\"<\/span><span class=\"s2\">amount<\/span><span class=\"dl\">\"<\/span><span class=\"p\">:<\/span> <span class=\"dl\">\"<\/span><span class=\"s2\">81<\/span><span class=\"dl\">\"<\/span><span class=\"p\">,<\/span> <span class=\"dl\">\"<\/span><span class=\"s2\">id<\/span><span class=\"dl\">\"<\/span><span class=\"p\">:<\/span> <span class=\"dl\">\"<\/span><span class=\"s2\">5<\/span><span class=\"dl\">\"<\/span><span class=\"p\">,<\/span> <span class=\"dl\">\"<\/span><span class=\"s2\">name<\/span><span class=\"dl\">\"<\/span><span class=\"p\">:<\/span> <span class=\"dl\">\"<\/span><span class=\"s2\">mira<\/span><span class=\"dl\">\"<\/span> <span class=\"p\">},<\/span>    <span class=\"p\">...<\/span>  <span class=\"p\">],<\/span>  <span class=\"dl\">\"<\/span><span class=\"s2\">changed<\/span><span class=\"dl\">\"<\/span><span class=\"p\">:<\/span> <span class=\"p\">[<\/span>    <span class=\"p\">{<\/span>      <span class=\"dl\">\"<\/span><span class=\"s2\">fields<\/span><span class=\"dl\">\"<\/span><span class=\"p\">:<\/span> <span class=\"p\">{<\/span>        <span class=\"dl\">\"<\/span><span class=\"s2\">amount<\/span><span class=\"dl\">\"<\/span><span class=\"p\">:<\/span> <span class=\"p\">{<\/span> <span class=\"dl\">\"<\/span><span class=\"s2\">from<\/span><span class=\"dl\">\"<\/span><span class=\"p\">:<\/span> <span class=\"dl\">\"<\/span><span class=\"s2\">20<\/span><span class=\"dl\">\"<\/span><span class=\"p\">,<\/span> <span class=\"dl\">\"<\/span><span class=\"s2\">to<\/span><span class=\"dl\">\"<\/span><span class=\"p\">:<\/span> <span class=\"dl\">\"<\/span><span class=\"s2\">23<\/span><span class=\"dl\">\"<\/span> <span class=\"p\">}<\/span>      <span class=\"p\">},<\/span>      <span class=\"dl\">\"<\/span><span class=\"s2\">key<\/span><span class=\"dl\">\"<\/span><span class=\"p\">:<\/span> <span class=\"p\">[<\/span><span class=\"dl\">\"<\/span><span class=\"s2\">1<\/span><span class=\"dl\">\"<\/span><span class=\"p\">]<\/span>    <span class=\"p\">},<\/span>    <span class=\"p\">...<\/span>  <span class=\"p\">],<\/span>  <span class=\"dl\">\"<\/span><span class=\"s2\">removed<\/span><span class=\"dl\">\"<\/span><span class=\"p\">:<\/span> <span class=\"p\">[<\/span>    <span class=\"p\">{<\/span> <span class=\"dl\">\"<\/span><span class=\"s2\">amount<\/span><span class=\"dl\">\"<\/span><span class=\"p\">:<\/span> <span class=\"dl\">\"<\/span><span class=\"s2\">63<\/span><span class=\"dl\">\"<\/span><span class=\"p\">,<\/span> <span class=\"dl\">\"<\/span><span class=\"s2\">id<\/span><span class=\"dl\">\"<\/span><span class=\"p\">:<\/span> <span class=\"dl\">\"<\/span><span class=\"s2\">2<\/span><span class=\"dl\">\"<\/span><span class=\"p\">,<\/span> <span class=\"dl\">\"<\/span><span class=\"s2\">name<\/span><span class=\"dl\">\"<\/span><span class=\"p\">:<\/span> <span class=\"dl\">\"<\/span><span class=\"s2\">eva<\/span><span class=\"dl\">\"<\/span> <span class=\"p\">},<\/span>    <span class=\"p\">...<\/span>  <span class=\"p\">]<\/span><span class=\"p\">}<\/span><\/code><\/pre><\/div><\/div><p>This is a very detailed report and is great for parsing with tools like <a href=\"https:\/\/stedolan.github.io\/jq\/\">jq<\/a>.However, we are really only interested in the adds and changes, and in particular the <code class=\"language-plaintext highlighter-rouge\">to<\/code> value of the changes only.It was fast to create a prototype tool with <code class=\"language-plaintext highlighter-rouge\">csvdiff<\/code> and it worked well on small partitions.However, it quickly exploded in memory utilization on larger partitions and frequently became a victim of the <a href=\"https:\/\/www.kernel.org\/doc\/gorman\/html\/understand\/understand016.html\">OOM killer<\/a>.<\/p><p>Why?Probably because the internal data structures and output JSON format of <code class=\"language-plaintext highlighter-rouge\">csvdiff<\/code> are mainly Python ones like <code class=\"language-plaintext highlighter-rouge\">List<\/code> and <code class=\"language-plaintext highlighter-rouge\">Dict<\/code>.When the instance count of structures like <code class=\"language-plaintext highlighter-rouge\">List<\/code> and <code class=\"language-plaintext highlighter-rouge\">Dict<\/code> scale with the data set, we incur increasing memory overhead for the growing number of Python objects.<\/p><h3 id=\"pandas\">Pandas<\/h3><p>Pandas is a great library for working with indexed data, and manages memory utilization well.I ended up implementing my own CSV differ in Pandas, which allowed me to perform further memory-usage optimizations elaborated on further on.<\/p><p>To recap, values are roughly indexed in each backup by the signal name \\(x\\), time \\(t\\) and location \\(s\\) to give rise to values \\(x_{s,t}\\).Between the backups for times \\(u-1\\) and \\(u\\), we are only interested in added values (new times or locations) or changed values (back-filled values) in backup \\(u\\), which gives rise to issue values \\(x_{s,t,u}\\).<\/p><p>In order to just find the added and changed values, we can do an index-aligned data-frame comparison to select such rows with something like:<\/p><div class=\"language-python highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"k\">def<\/span> <span class=\"nf\">pd_csvdiff<\/span><span class=\"p\">(<\/span><span class=\"n\">before_csv<\/span><span class=\"p\">,<\/span> <span class=\"n\">after_csv<\/span><span class=\"p\">,<\/span> <span class=\"n\">index_cols<\/span><span class=\"p\">):<\/span>    <span class=\"c1\"># Load data and set indices<\/span>    <span class=\"n\">df_before<\/span> <span class=\"o\">=<\/span> <span class=\"n\">pd<\/span><span class=\"p\">.<\/span><span class=\"n\">read_csv<\/span><span class=\"p\">(<\/span><span class=\"n\">before_csv<\/span><span class=\"p\">,<\/span> <span class=\"p\">...)<\/span>    <span class=\"n\">df_after<\/span> <span class=\"o\">=<\/span> <span class=\"n\">pd<\/span><span class=\"p\">.<\/span><span class=\"n\">read_csv<\/span><span class=\"p\">(<\/span><span class=\"n\">after_csv<\/span><span class=\"p\">,<\/span> <span class=\"p\">...)<\/span>    <span class=\"n\">df_before<\/span><span class=\"p\">.<\/span><span class=\"n\">set_index<\/span><span class=\"p\">(<\/span><span class=\"n\">index_cols<\/span><span class=\"p\">,<\/span> <span class=\"n\">inplace<\/span><span class=\"o\">=<\/span><span class=\"bp\">True<\/span><span class=\"p\">)<\/span>    <span class=\"n\">df_after<\/span><span class=\"p\">.<\/span><span class=\"n\">set_index<\/span><span class=\"p\">(<\/span><span class=\"n\">index_cols<\/span><span class=\"p\">,<\/span> <span class=\"n\">inplace<\/span><span class=\"o\">=<\/span><span class=\"bp\">True<\/span><span class=\"p\">)<\/span>    <span class=\"p\">...<\/span>    <span class=\"c1\"># Align and compare<\/span>    <span class=\"n\">same_mask<\/span> <span class=\"o\">=<\/span> <span class=\"p\">(<\/span><span class=\"n\">df_before<\/span><span class=\"p\">.<\/span><span class=\"n\">reindex<\/span><span class=\"p\">(<\/span><span class=\"n\">df_after<\/span><span class=\"p\">.<\/span><span class=\"n\">index<\/span><span class=\"p\">)<\/span> <span class=\"o\">==<\/span> <span class=\"n\">df_after<\/span><span class=\"p\">)<\/span>    <span class=\"n\">is_diff<\/span> <span class=\"o\">=<\/span> <span class=\"o\">~<\/span><span class=\"p\">(<\/span><span class=\"n\">same_mask<\/span><span class=\"p\">.<\/span><span class=\"nb\">all<\/span><span class=\"p\">(<\/span><span class=\"n\">axis<\/span><span class=\"o\">=<\/span><span class=\"mi\">1<\/span><span class=\"p\">))<\/span>    <span class=\"c1\"># Extract the different rows only<\/span>    <span class=\"k\">return<\/span> <span class=\"n\">df_after<\/span><span class=\"p\">.<\/span><span class=\"n\">loc<\/span><span class=\"p\">[<\/span><span class=\"n\">is_diff<\/span><span class=\"p\">,<\/span> <span class=\"p\">:]<\/span><\/code><\/pre><\/div><\/div><p>Now why did we bother recreating this functionality in Pandas?Because Pandas provides good control over <em>how<\/em> we want to store the data in memory, in the form of column <code class=\"language-plaintext highlighter-rouge\">dtypes<\/code>.Specifically, huge memory savings came from representing categorical columns as the Pandas <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/user_guide\/categorical.html\">categorical type<\/a> instead of individual Python objects (strings in this case).<\/p><p>For example, this toy example shows ~60x less memory usage just by using <code class=\"language-plaintext highlighter-rouge\">dtype=category<\/code>:<\/p><div class=\"language-python highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"n\">colors<\/span> <span class=\"o\">=<\/span> <span class=\"n\">np<\/span><span class=\"p\">.<\/span><span class=\"n\">random<\/span><span class=\"p\">.<\/span><span class=\"n\">choice<\/span><span class=\"p\">([<\/span><span class=\"s\">\"red\"<\/span><span class=\"p\">,<\/span> <span class=\"s\">\"blue\"<\/span><span class=\"p\">,<\/span> <span class=\"s\">\"yellow\"<\/span><span class=\"p\">],<\/span> <span class=\"mi\">1000<\/span><span class=\"p\">,<\/span> <span class=\"n\">replace<\/span><span class=\"o\">=<\/span><span class=\"bp\">True<\/span><span class=\"p\">)<\/span><span class=\"n\">s1<\/span> <span class=\"o\">=<\/span> <span class=\"n\">pd<\/span><span class=\"p\">.<\/span><span class=\"n\">Series<\/span><span class=\"p\">(<\/span><span class=\"n\">colors<\/span><span class=\"p\">)<\/span><span class=\"n\">s2<\/span> <span class=\"o\">=<\/span> <span class=\"n\">pd<\/span><span class=\"p\">.<\/span><span class=\"n\">Series<\/span><span class=\"p\">(<\/span><span class=\"n\">colors<\/span><span class=\"p\">,<\/span> <span class=\"n\">dtype<\/span><span class=\"o\">=<\/span><span class=\"s\">\"category\"<\/span><span class=\"p\">)<\/span><span class=\"k\">print<\/span><span class=\"p\">(<\/span><span class=\"s\">f\"Bytes used: <\/span><span class=\"si\">{<\/span><span class=\"n\">s1<\/span><span class=\"p\">.<\/span><span class=\"n\">memory_usage<\/span><span class=\"p\">(<\/span><span class=\"n\">deep<\/span><span class=\"o\">=<\/span><span class=\"bp\">True<\/span><span class=\"p\">)<\/span><span class=\"si\">}<\/span><span class=\"s\"> B\"<\/span><span class=\"p\">)<\/span><span class=\"k\">print<\/span><span class=\"p\">(<\/span><span class=\"s\">f\"Bytes used: <\/span><span class=\"si\">{<\/span><span class=\"n\">s2<\/span><span class=\"p\">.<\/span><span class=\"n\">memory_usage<\/span><span class=\"p\">(<\/span><span class=\"n\">deep<\/span><span class=\"o\">=<\/span><span class=\"bp\">True<\/span><span class=\"p\">)<\/span><span class=\"si\">}<\/span><span class=\"s\"> B\"<\/span><span class=\"p\">)<\/span><span class=\"o\">&gt;&gt;&gt;<\/span> <span class=\"n\">Bytes<\/span> <span class=\"n\">used<\/span><span class=\"p\">:<\/span> <span class=\"mi\">613560<\/span> <span class=\"n\">B<\/span>    <span class=\"n\">Bytes<\/span> <span class=\"n\">used<\/span><span class=\"p\">:<\/span> <span class=\"mi\">10392<\/span> <span class=\"n\">B<\/span><\/code><\/pre><\/div><\/div><p>When using categoricals, Pandas only has to store small integers for each entry, along with a mapping going from these small integers to the original object or string.Instead of storing many copies of duplicate strings, we only have to store the unique ones in this mapping.Thus we save on the number of Python objects we have in memory.This was especially useful for representing columns like signal name, state, county, etc.<\/p><h3 id=\"edge-conditions\">Edge Conditions<\/h3><p>First, <strong>missing values<\/strong> are present in the data as <code class=\"language-plaintext highlighter-rouge\">NaN<\/code> values.However, a naive comparison of <code class=\"language-plaintext highlighter-rouge\">NaN == NaN<\/code> is actually false-y by convention.In our case, this actually represents a missing value before that is still missing, so we do not want these to flag up as changed values.<\/p><p>Secondly, there could be a mis-match of <strong>category values<\/strong>.The category values are inferred from the individual CSVs, and new values may appear across time.Pandas has to know about the complete set of values when doing comparisons between categoricals, otherwise we are effectively comparing \u201cdifferent\u201d sets.We have need to union the category values together before comparisons to prevent such errors.<\/p><p>The full implementation along with handling for these edge cases can be found <a href=\"https:\/\/github.com\/eujing\/diff-sql-backups\/blob\/78696100fcffaf1d2fdfc9c8da4de1f58cd371fe\/proc_db_backups_pd.py#L384-L448\">here<\/a><\/p><h2 id=\"multiprocessing-setup--challenges\">Multiprocessing: Setup &amp; Challenges<\/h2><p>By now, I had nicely partitioned up the data and differencing within these partitions no longer caused memory utilization to explode.There were many partitions to churn through, but we were doing them one at a time with cores and memory to spare.I took the simple route of parallelizing this tool by using Python\u2019s built-in multiprocessing module, specifically <a href=\"https:\/\/docs.python.org\/3\/library\/multiprocessing.html#multiprocessing.pool.Pool.starmap\"><code class=\"language-plaintext highlighter-rouge\">Pool.starmap<\/code><\/a>, which very easily lets us map a function of multiple arguments across a list of arguments, in a parallel manner!<\/p><div class=\"language-python highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"c1\"># Serially<\/span><span class=\"n\">splitted<\/span> <span class=\"o\">=<\/span> <span class=\"n\">starmap<\/span><span class=\"p\">(<\/span><span class=\"n\">split_csv_by_col<\/span><span class=\"p\">,<\/span> <span class=\"n\">split_args<\/span><span class=\"p\">)<\/span><span class=\"c1\"># In parallel<\/span><span class=\"k\">with<\/span> <span class=\"n\">Pool<\/span><span class=\"p\">(<\/span><span class=\"n\">ncpu<\/span><span class=\"p\">)<\/span> <span class=\"k\">as<\/span> <span class=\"n\">pool<\/span><span class=\"p\">:<\/span>    <span class=\"n\">splitted<\/span> <span class=\"o\">=<\/span> <span class=\"n\">pool<\/span><span class=\"p\">.<\/span><span class=\"n\">starmap<\/span><span class=\"p\">(<\/span><span class=\"n\">split_csv_by_col<\/span><span class=\"p\">,<\/span> <span class=\"n\">split_args<\/span><span class=\"p\">)<\/span><\/code><\/pre><\/div><\/div><p>However, it turns out that processing some partitions required much more memory than others due to the nature of the signals.Having too large of a <code class=\"language-plaintext highlighter-rouge\">ncpu<\/code> value would still result in OOM kills when too many of these large partitions got processed together.<\/p><p>The simple approach I took to fixing this was to more granular <code class=\"language-plaintext highlighter-rouge\">ncpu<\/code> settings for each stage of the pipeline, having large <code class=\"language-plaintext highlighter-rouge\">ncpu<\/code> for stages like CSV partitioning that used very little memory, and smaller <code class=\"language-plaintext highlighter-rouge\">ncpu<\/code> counts for memory-intensive stages like CSV differencing.<\/p><p>This was probably not the most efficient way of tackling the issue, but I think it was an acceptable trade-off between being able to kick-start tool on the backups, and spending more time developing a better solution.In hindsight, I would have probably tried to implement a simple scheduler that estimated the memory needed for each partition from file sizes and then scheduled more small partitions together.<\/p><p>I also came across deadlock problems when trying using logging with multiprocessing, but managed to solve it with the <code class=\"language-plaintext highlighter-rouge\">multiprocessing_logging<\/code> <a href=\"https:\/\/pypi.org\/project\/multiprocessing-logging\/\">package<\/a>.See <a href=\"https:\/\/github.com\/eujing\/diff-sql-backups\/blob\/78696100fcffaf1d2fdfc9c8da4de1f58cd371fe\/proc_db_backups_pd.py#L196-L217\">here<\/a> for more details on how I incorporated all these with the rest of the pipeline.<\/p><h2 id=\"conclusion\">Conclusion<\/h2><p>I learnt a lot about processing large data, optimizing for memory usage and multiprocessing in the development of this tool.It was made even harder as I only could test it on small sample back-ups and did not have access to the real server it would run on!I am glad it was able to churn through all the backups in the end to perform the migration, after several days of optimization and debugging.<\/p><p>I tried to include detailed documentation in the <a href=\"https:\/\/github.com\/eujing\/diff-sql-backups\">tool\u2019s source code<\/a>.It is my hope that if you are tackling a similar problem without some Hadoop or Spark cluster, that these reflections and documentation will come in useful.<\/p>","url":"https:\/\/eujing.github.io\/2020\/09\/28\/memory-issues","tags":["covid19","memory","data"],"date_published":"2020-09-28T00:00:00+00:00","date_modified":"2020-09-28T00:00:00+00:00","author":{"name":"Eu Jing Chua","url":null,"avatar":null}},{"id":"https:\/\/eujing.github.io\/2020\/09\/01\/issues-migration","title":"What Back-fill Looks Like","summary":"Visualizations of back-fill across time, and its impact on training models","content_text":"What is Back-fill?In most traditional forecasting problems like weather forecasting, data like temperatures and wind speeds from the past up to present are measured with reasonable accuracy across time and space.However in the context of COVID-19 forecasting, the current pandemic has shown a lack of similar infrastructure to produce accurate measurements of important metrics.For example, it would be great if we knew the true COVID-19 case incidence on a daily basis. However, one major obstacle in achieving that is how testing lag tends to increase with demand, especially when a region unexpectedly becomes a hotspot without ramping up testing infrastructure beforehand.This generally manifests in data as the phenomenon of back-fill, where measurements are retrospectively modified (as far as several months in the past).This happens when hospitals, labs and institutions take more time to process large backlogs of data with limited resources, or even when definitions change.Being able to account for uncertainty due to the back-fill process allows us to make better forecasts (or nowcasts) to assess the state of the current pandemic.Issues and As-ofsOne of the areas I worked on at Delphi this summer was to keep track of back-fill behavior, and make it easily accessible through the COVIDcast API.Back-fill gets its name primarily from how count-based data, like number of new positive cases in a day, tends to get retrospectively \u201cfilled\u201d or increased towards its true count, only sometimes decreasing due to rare events like false-positives or human error.However once we start using these counts to derive other metrics like case positivity rate for the day, then the notion of \u201cfilling\u201d no longer makes so much sense and the values simply vary.At Delphi, the more general term issue is used to refer to how a measurement can be issued multiple times.For example, the percentage of COVID-19-related doctor visits on June 1, 2020 in Kansas was retrospectively issued more than 10 times:                    \u00a0        geo_value        time_value        issue        value                            0        KS        2020-06-24        2020-06-30        11.1994                    1        KS        2020-06-24        2020-07-01        9.7082                    2        KS        2020-06-24        2020-07-02        9.22323                    3        KS        2020-06-24        2020-07-03        9.09493                    4        KS        2020-06-24        2020-07-04        8.60064                    5        KS        2020-06-24        2020-07-05        8.56377                    6        KS        2020-06-24        2020-07-06        8.55982                    7        KS        2020-06-24        2020-07-07        8.41921                    8        KS        2020-06-24        2020-07-08        8.27452                    9        KS        2020-06-24        2020-07-09        8.04732            This brings us to the next idea: as-of, which is really just what the data looked like as-of a certain date.Using the example above, the percentage of doctor visits in Kansas on 24 June was around 11.2% as-of 30 June, but decreased to about 8.0% as-of 9 July as more records got processed.Knowing that every data point has multiple issues, the as-of feature gives us an accurate view of what data looked like in the past, not just what it looks like now.I personally like to think of issue dates as a second dimension of time that measurements can vary over.Keeping that in mind, we normally think of a signal varying over time and space as follows:The above is a snapshot of a signal for percentage of doctor-visits related to COVID-19, as-of 7 July.But now that we know each measurement actually varied across issue dates too, what does that look like?We can also slide the as-of date across time to come up with a more nuanced visualization:In particular, imagine you live in Mississippi and happen to find the percentage of COVID-19-related doctor visits really important.Over the month of June, you track the most recent released data every day, and you get your hopes up as it keeps seeming to trend downwards.However, if you actually waited for the back-fill \/ issues to roll in, you start realizing how premature that was.This brings us to the next topic, on how back-fill affects model training.Impact on Training ModelsWhen developing statistical models with such signals and metrics, back-fill affects modeling in at least two significant ways:  Most-recent data probably has higher uncertainty than the less-recent ones.  Models trained on historical data and deployed on recent data should try to match the levels of uncertainty.The first is a direct result of how most-recent data has less issues compared to less-recent ones, as data further back in the past are more likely to most of their back-fill already rolled in.That is not to say that most-recent data is always uncertain, as back-fill is very location dependent.From the sliding-as-ofs plot before, I would be put lesser weight on recent data for Mississippi, but maybe not for Pennsylvania.Inverse-variance weighting and kernels come to mind, but actually coming up with these weights in a fair way could be tricky when we consider the second point.The second point is a matter of matching up training and test conditions.If we naively kept track of only the latest issue of each data point, then the \u201chistorical\u201d data has had the advantage of having its back-fill roll in till recent times.We do not get this advantage when actually trying to predict with most-recent data.Thus one has to be careful to use the right views of historical data when training, which is exactly where having a as-of view becomes useful.Train on historical data with appropriate as-of dates to match what will be done during test time!What\u2019s NextA lot of work was put into the COVIDcast API to make issue dates accessible and as-of views easy to use, especially through the R and Python clients, so be sure to check it out!Code used for all examples in this post can be found in this Jupyter Notebook to play around with.In the next post, I hope to go into technical detail about some of the challenges faced in implementing issue tracking into the COVIDcast API.","content_html":"<h2 id=\"what-is-back-fill\">What is Back-fill?<\/h2><p>In most traditional forecasting problems like weather forecasting, data like temperatures and wind speeds from the past up to present are measured with reasonable accuracy across time and space.However in the context of COVID-19 forecasting, the current pandemic has shown a lack of similar infrastructure to produce accurate measurements of important metrics.<\/p><p>For example, it would be great if we knew the true COVID-19 case incidence on a daily basis. However, one major obstacle in achieving that is how <a href=\"https:\/\/www.cnn.com\/2020\/07\/07\/politics\/coronavirus-testing-delays-invs\/index.html\">testing lag tends to increase with demand<\/a>, especially when a region unexpectedly becomes a hotspot without ramping up testing infrastructure beforehand.<\/p><p>This generally manifests in data as the phenomenon of <strong>back-fill<\/strong>, where measurements are retrospectively modified (as far as several months in the past).This happens when hospitals, labs and institutions take more time to process large backlogs of data with limited resources, or even when definitions change.Being able to account for uncertainty due to the back-fill process allows us to make better forecasts (or nowcasts) to assess the state of the current pandemic.<\/p><h2 id=\"issues-and-as-ofs\">Issues and As-ofs<\/h2><p>One of the areas I worked on at Delphi this summer was to keep track of back-fill behavior, and make it easily accessible through the <a href=\"https:\/\/cmu-delphi.github.io\/delphi-epidata\/api\/covidcast.html\">COVIDcast API<\/a>.Back-fill gets its name primarily from how count-based data, like number of new positive cases in a day, tends to get retrospectively \u201cfilled\u201d or increased towards its true count, only sometimes decreasing due to rare events like false-positives or human error.<\/p><p>However once we start using these counts to derive other metrics like case positivity rate for the day, then the notion of \u201cfilling\u201d no longer makes so much sense and the values simply vary.At Delphi, the more general term <strong>issue<\/strong> is used to refer to how a measurement can be <strong>issued<\/strong> multiple times.For example, the percentage of COVID-19-related doctor visits on June 1, 2020 in Kansas was retrospectively <strong>issued<\/strong> more than 10 times:<\/p><div style=\"overflow-x:auto;\">  <table>    <thead>      <tr>        <th style=\"text-align: right\">\u00a0<\/th>        <th style=\"text-align: left\">geo_value<\/th>        <th style=\"text-align: left\">time_value<\/th>        <th style=\"text-align: left\">issue<\/th>        <th style=\"text-align: right\">value<\/th>      <\/tr>    <\/thead>    <tbody>      <tr>        <td style=\"text-align: right\">0<\/td>        <td style=\"text-align: left\">KS<\/td>        <td style=\"text-align: left\">2020-06-24<\/td>        <td style=\"text-align: left\">2020-06-30<\/td>        <td style=\"text-align: right\">11.1994<\/td>      <\/tr>      <tr>        <td style=\"text-align: right\">1<\/td>        <td style=\"text-align: left\">KS<\/td>        <td style=\"text-align: left\">2020-06-24<\/td>        <td style=\"text-align: left\">2020-07-01<\/td>        <td style=\"text-align: right\">9.7082<\/td>      <\/tr>      <tr>        <td style=\"text-align: right\">2<\/td>        <td style=\"text-align: left\">KS<\/td>        <td style=\"text-align: left\">2020-06-24<\/td>        <td style=\"text-align: left\">2020-07-02<\/td>        <td style=\"text-align: right\">9.22323<\/td>      <\/tr>      <tr>        <td style=\"text-align: right\">3<\/td>        <td style=\"text-align: left\">KS<\/td>        <td style=\"text-align: left\">2020-06-24<\/td>        <td style=\"text-align: left\">2020-07-03<\/td>        <td style=\"text-align: right\">9.09493<\/td>      <\/tr>      <tr>        <td style=\"text-align: right\">4<\/td>        <td style=\"text-align: left\">KS<\/td>        <td style=\"text-align: left\">2020-06-24<\/td>        <td style=\"text-align: left\">2020-07-04<\/td>        <td style=\"text-align: right\">8.60064<\/td>      <\/tr>      <tr>        <td style=\"text-align: right\">5<\/td>        <td style=\"text-align: left\">KS<\/td>        <td style=\"text-align: left\">2020-06-24<\/td>        <td style=\"text-align: left\">2020-07-05<\/td>        <td style=\"text-align: right\">8.56377<\/td>      <\/tr>      <tr>        <td style=\"text-align: right\">6<\/td>        <td style=\"text-align: left\">KS<\/td>        <td style=\"text-align: left\">2020-06-24<\/td>        <td style=\"text-align: left\">2020-07-06<\/td>        <td style=\"text-align: right\">8.55982<\/td>      <\/tr>      <tr>        <td style=\"text-align: right\">7<\/td>        <td style=\"text-align: left\">KS<\/td>        <td style=\"text-align: left\">2020-06-24<\/td>        <td style=\"text-align: left\">2020-07-07<\/td>        <td style=\"text-align: right\">8.41921<\/td>      <\/tr>      <tr>        <td style=\"text-align: right\">8<\/td>        <td style=\"text-align: left\">KS<\/td>        <td style=\"text-align: left\">2020-06-24<\/td>        <td style=\"text-align: left\">2020-07-08<\/td>        <td style=\"text-align: right\">8.27452<\/td>      <\/tr>      <tr>        <td style=\"text-align: right\">9<\/td>        <td style=\"text-align: left\">KS<\/td>        <td style=\"text-align: left\">2020-06-24<\/td>        <td style=\"text-align: left\">2020-07-09<\/td>        <td style=\"text-align: right\">8.04732<\/td>      <\/tr>    <\/tbody>  <\/table><\/div><p>This brings us to the next idea: <strong>as-of<\/strong>, which is really just what the data looked like <strong>as-of<\/strong> a certain date.Using the example above, the percentage of doctor visits in Kansas on 24 June was around 11.2% <strong>as-of<\/strong> 30 June, but decreased to about 8.0% <strong>as-of<\/strong> 9 July as more records got processed.Knowing that every data point has multiple issues, the <strong>as-of<\/strong> feature gives us an accurate view of what data looked like in the past, not just what it looks like now.I personally like to think of issue dates as a second dimension of time that measurements can vary over.Keeping that in mind, we normally think of a signal varying over time and space as follows:<\/p><p><img src=\"\/assets\/dv-adj-single-asof.png\" alt=\"Single as-of\" class=\"fullimg\" \/><\/p><p>The above is a snapshot of a signal for percentage of doctor-visits related to COVID-19, as-of 7 July.But now that we know each measurement actually varied across issue dates too, what does that look like?We can also slide the as-of date across time to come up with a more nuanced visualization:<\/p><p><img src=\"\/assets\/dv-adj-sweeping-asofs.png\" alt=\"Sweeping as-ofs\" class=\"fullimg\" \/><\/p><p>In particular, imagine you live in Mississippi and happen to find the percentage of COVID-19-related doctor visits really important.Over the month of June, you track the most recent released data every day, and you get your hopes up as it keeps seeming to trend downwards.However, if you actually waited for the back-fill \/ issues to roll in, you start realizing how premature that was.This brings us to the next topic, on how back-fill affects model training.<\/p><h2 id=\"impact-on-training-models\">Impact on Training Models<\/h2><p>When developing statistical models with such signals and metrics, back-fill affects modeling in at least two significant ways:<\/p><ol>  <li>Most-recent data probably has higher uncertainty than the less-recent ones.<\/li>  <li>Models trained on historical data and deployed on recent data should try to match the levels of uncertainty.<\/li><\/ol><p>The first is a direct result of how most-recent data has less issues compared to less-recent ones, as data further back in the past are more likely to most of their back-fill already rolled in.That is not to say that most-recent data is always uncertain, as back-fill is very location dependent.From the sliding-as-ofs plot before, I would be put lesser weight on recent data for Mississippi, but maybe not for Pennsylvania.Inverse-variance weighting and kernels come to mind, but actually coming up with these weights in a fair way could be tricky when we consider the second point.<\/p><p>The second point is a matter of matching up training and test conditions.If we naively kept track of only the latest issue of each data point, then the \u201chistorical\u201d data has had the advantage of having its back-fill roll in till recent times.We do not get this advantage when actually trying to predict with most-recent data.Thus one has to be careful to use the right views of historical data when training, which is exactly where having a as-of view becomes useful.Train on historical data with appropriate as-of dates to match what will be done during test time!<\/p><h2 id=\"whats-next\">What\u2019s Next<\/h2><p>A lot of work was put into the <a href=\"https:\/\/cmu-delphi.github.io\/delphi-epidata\/api\/covidcast.html\">COVIDcast API<\/a> to make issue dates accessible and as-of views easy to use, especially through the <a href=\"https:\/\/cmu-delphi.github.io\/covidcast\/covidcastR\/articles\/covidcast.html#tracking-issues-and-updates\">R<\/a> and <a href=\"https:\/\/cmu-delphi.github.io\/covidcast\/covidcast-py\/html\/getting_started.html#tracking-issues-and-updates\">Python<\/a> clients, so be sure to check it out!Code used for all examples in this post can be found in this <a href=\"\/assets\/Asof_visualizations.ipynb\">Jupyter Notebook<\/a> to play around with.<\/p><p>In the next post, I hope to go into technical detail about some of the challenges faced in implementing issue tracking into the COVIDcast API.<\/p>","url":"https:\/\/eujing.github.io\/2020\/09\/01\/issues-migration","tags":["covid19","visualization","data"],"date_published":"2020-09-01T00:00:00+00:00","date_modified":"2020-09-01T00:00:00+00:00","author":{"name":"Eu Jing Chua","url":null,"avatar":null}}]}