{"id":858,"date":"2021-02-16T10:30:45","date_gmt":"2021-02-16T18:30:45","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/azure-sql\/?p=858"},"modified":"2021-02-17T17:39:51","modified_gmt":"2021-02-18T01:39:51","slug":"the-insert-if-not-exists-challenge-a-solution","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/azure-sql\/the-insert-if-not-exists-challenge-a-solution\/","title":{"rendered":"The \u201cinsert if not exists\u201d challenge: a solution"},"content":{"rendered":"<p><a href=\"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2021\/02\/pexels-pixabay-262488-scaled.jpg\"><img decoding=\"async\" class=\"aligncenter wp-image-859 size-full\" src=\"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2021\/02\/pexels-pixabay-262488-scaled.jpg\" alt=\"Image pexels pixabay 262488\" width=\"2500\" height=\"1875\" srcset=\"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2021\/02\/pexels-pixabay-262488-scaled.jpg 2500w, https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2021\/02\/pexels-pixabay-262488-300x225.jpg 300w, https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2021\/02\/pexels-pixabay-262488-1024x768.jpg 1024w, https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2021\/02\/pexels-pixabay-262488-768x576.jpg 768w, https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2021\/02\/pexels-pixabay-262488-1536x1152.jpg 1536w, https:\/\/devblogs.microsoft.com\/azure-sql\/wp-content\/uploads\/sites\/56\/2021\/02\/pexels-pixabay-262488-2048x1536.jpg 2048w\" sizes=\"(max-width: 2500px) 100vw, 2500px\" \/><\/a><\/p>\n<h2>The Challenge<\/h2>\n<p>Recently I found a quite common request on\u00a0<a href=\"https:\/\/stackoverflow.com\/questions\/65648013\/how-to-insert-data-in-azure-sql-using-tedious-js-after-verifying-that-the-data-d\/65660502?noredirect=1#comment116112268_65660502\">StackOverflow<\/a>. Generalizing the problem, it can be described as the requirement of insert some data into a table only if that data is not there already.<\/p>\n<p>Many developers will solve it by trying to execute two steps:<\/p>\n<ul>\n<li>check if the data exists already,<\/li>\n<li>if not, insert it<\/li>\n<\/ul>\n<h2><a class=\"anchor\" href=\"https:\/\/dev.to\/yorek\/the-insert-if-not-exists-challenge-a-solution-3g20-temp-slug-7306478?preview=39be506457fe51f0f42d47f061d1d94fee478142c8ee504d2990d8914812581d8202db46ff7e2b0092e8b3ed2675f329ff4c51a7dfb568b061f3c1e9#the-issue\" name=\"the-issue\"><\/a>The issue<\/h2>\n<p>This approach has a flaw, whatever the database you are using and no matter the database if relational or not. The problem, in fact, lies in the algorithm itself.<\/p>\n<p>The moment you have two actions, where one depends on another, you need to make sure that the data used by both doesn&#8217;t change in the meantime because of some other action done by someone else. As you can easily understand, in fact, if you have done the first action, you really need to be sure that in-scope data doesn&#8217;t change before you can do the second action, otherwise the result may be wrong or inconsistent.<\/p>\n<h2><a class=\"anchor\" href=\"https:\/\/dev.to\/yorek\/the-insert-if-not-exists-challenge-a-solution-3g20-temp-slug-7306478?preview=39be506457fe51f0f42d47f061d1d94fee478142c8ee504d2990d8914812581d8202db46ff7e2b0092e8b3ed2675f329ff4c51a7dfb568b061f3c1e9#the-solution-in-principle\" name=\"the-solution-in-principle\"><\/a>The solution, in principle<\/h2>\n<p>You need a transaction, that will provide the needed isolation so that interference won&#8217;t happen.<\/p>\n<p>As the chosen algorithm is comprised of two separate steps, you must create a transaction big enough to keep both steps under the same umbrella, so that the two physical separate steps will logically work like one. This behavior is also known as\u00a0<em>atomicity<\/em>: the two steps are indivisible. Everything works and changes will be persisted, or nothing works and changes will be undone.<\/p>\n<p>While this will work perfectly from a functional perspective, if you are in a highly concurrent system and you need to have a very scalable solution where an extreme number of executions can be done in parallel, you really want to have such protection around your transaction \u2013 the\u00a0<em>isolation<\/em>\u00a0\u2013 for the shortest time possible, as for as long as you are using that data in your algorithm, other may have to wait for accessing it, to avoid interfering with your activity.<\/p>\n<p>How to solve this problem elegantly and without having such big transactions? One option is use what is defined as &#8220;Optimistic Concurrency&#8221;. This approach uses a resource version token \u2013 for example, an\u00a0<a href=\"https:\/\/developer.mozilla.org\/en-US\/docs\/Web\/HTTP\/Headers\/ETag\">ETag<\/a>\u00a0\u2013 to make sure that data didn&#8217;t change between the first and the second step. If data was changed, you simply restart from step 1 and loop continuously until you manage to complete the algorithm or you reach the maximum number of attempts allowed.<\/p>\n<h2><a class=\"anchor\" href=\"https:\/\/dev.to\/yorek\/the-insert-if-not-exists-challenge-a-solution-3g20-temp-slug-7306478?preview=39be506457fe51f0f42d47f061d1d94fee478142c8ee504d2990d8914812581d8202db46ff7e2b0092e8b3ed2675f329ff4c51a7dfb568b061f3c1e9#a-smarter-solution\" name=\"a-smarter-solution\"><\/a>A smarter solution<\/h2>\n<p>Now, that&#8217;s a very basic approach. It works, but I think we can do better. Much better.<\/p>\n<p>What if, for example, we can do both steps in just one command? No need for big transaction, no need for less-then-optimal loops.<\/p>\n<p>With Azure SQL, doing that is easy: you can\u00a0<code>INSERT<\/code>\u00a0a row into a table using the result of a\u00a0<code>SELECT<\/code>\u00a0on that table. Does it start to ring a bell?<\/p>\n<p>By using an\u00a0<code>INSERT...SELECT<\/code> command, we can achieve exactly what is needed. One command, without any explicit transaction. Let&#8217;s say we have a table, named tags, that stores all the tags associated with a blogs post. A tag can be used in different posts, but only once per post. The table would look like the following:<\/p>\n<pre class=\"prettyprint\">create table [dbo].[tags] ( \r\n    [post_id] int not null, \r\n    [tag] nvarchar(50) not null, \r\n    constraint pk__tags primary key clustered ([post_id], [tag]) \r\n)<\/pre>\n<p><span style=\"font-size: 1rem;\">Using such table as example, an\u00a0<\/span><code>INSERT...SELECT<\/code><span style=\"font-size: 1rem;\"> to implement the insert-if-not-exists logic would look like:<\/span><\/p>\n<pre class=\"prettyprint\">insert into [dbo].[tags] ([post_id], [tag]) \r\nselect * from ( \r\n    values (10, 'tag123') -- sample value \r\n) as s([post_id], [tag]) \r\nwhere not exists ( \r\n    select * from [dbo].[tags] t with (updlock) \r\n    where s.[post_id] = t.[post_id] and s.[tag] = t.[tag] \r\n)<\/pre>\n<p><span style=\"font-size: 1rem;\">The first\u00a0<\/span><code>SELECT<\/code><span style=\"font-size: 1rem;\">\u00a0will create a virtual table with the data we want to insert. One or more rows can be created with that technique (it works very nicely up to a few hundred rows. If you need more rows then JSON, Table Valued Parameters or Bulk Insert are a\u00a0<\/span><a style=\"background-color: #f7f7f9; font-size: 1rem;\" href=\"https:\/\/github.com\/yorek\/azure-sql-db-samples\/tree\/master\/samples\/07-network-latency\">better choice<\/a><span style=\"font-size: 1rem;\">). The virtual table will be called\u00a0<\/span><code>s<\/code><span style=\"font-size: 1rem;\">\u00a0and will have two columns:\u00a0<\/span><code>post_id<\/code><span style=\"font-size: 1rem;\">\u00a0and\u00a0<\/span><code>tag<\/code><span style=\"font-size: 1rem;\">. Data types will be automatically inferred; if you want to have some specific data type, you can always\u00a0<\/span><code>CAST<\/code><span style=\"font-size: 1rem;\">\u00a0the value to make sure the data type you want will be used. The\u00a0<\/span><code>UPDLOCK<\/code><span style=\"font-size: 1rem;\">\u00a0is a\u00a0<\/span><a style=\"background-color: #f7f7f9; font-size: 1rem;\" href=\"https:\/\/docs.microsoft.com\/en-us\/sql\/t-sql\/queries\/hints-transact-sql-table\">hint<\/a><span style=\"font-size: 1rem;\">\u00a0to tell Azure SQL that we are reading with the goal to update the row. By allowing the engine to know that, the internal mechanism of lock conversion can be optimized to guarantee the best concurrency and consistency.<\/span><\/p>\n<p>The\u00a0<code>WHERE<\/code>\u00a0clause will make sure only those rows that&#8217;s doesn&#8217;t already exists in the target table &#8211;\u00a0<code>tags<\/code>\u00a0&#8211; will be returned from the virtual table and passed to the\u00a0<code>INSERT<\/code>\u00a0statement.<\/p>\n<p>The\u00a0<code>INSERT<\/code>\u00a0statement will do exactly what it says: insert rows into the\u00a0<code>tags<\/code>\u00a0table, if any.<\/p>\n<h2><a class=\"anchor\" href=\"https:\/\/dev.to\/yorek\/the-insert-if-not-exists-challenge-a-solution-3g20-temp-slug-7306478?preview=39be506457fe51f0f42d47f061d1d94fee478142c8ee504d2990d8914812581d8202db46ff7e2b0092e8b3ed2675f329ff4c51a7dfb568b061f3c1e9#a-more-concise-solution\" name=\"a-more-concise-solution\"><\/a>A more concise solution<\/h2>\n<p>That may sound a little verbose and quite complicated for such a simple operation, so you&#8217;ll be happy to know that all that code can be simplified a lot using the\u00a0<code>MERGE<\/code> statement. The logic behind the scenes is the same, but the code is much leaner in my opinion:<\/p>\n<pre class=\"prettyprint\">merge into \r\n    [dbo].[tags] with (holdlock) t  \r\nusing \r\n    (values (10, 'tag1233')) s([post_id], [tag]) \r\non \r\n    t.[post_id] = s.[post_id] and t.[tag] = s.[tag] \r\nwhen not matched then \r\n    insert values (s.[post_id], s.[tag]);<\/pre>\n<p>The\u00a0<code>MERGE<\/code>\u00a0is a very powerful command and would also allow you to implement in the same command also the\u00a0<em>upsert<\/em>\u00a0(insert-or-update) behavior. When executing multiple operations on the same data (after all the\u00a0<code>MERGE<\/code>\u00a0is\u00a0<code>INSERT<\/code>,\u00a0<code>UPDATE<\/code>\u00a0and\u00a0<code>DELETE<\/code>\u00a0all together) you may have to be careful if you have triggers or if one row could potentially be affected by multiple actions&#8230;but in the simple case of a insert-if-not-exists you shouldn&#8217;t be worried. If you are interested in learning how much more you can do with\u00a0<code>MERGE<\/code>, read here the\u00a0<a href=\"https:\/\/docs.microsoft.com\/en-us\/sql\/t-sql\/statements\/merge-transact-sql\">detailed doc page<\/a>.<\/p>\n<h2><a class=\"anchor\" href=\"https:\/\/dev.to\/yorek\/the-insert-if-not-exists-challenge-a-solution-3g20-temp-slug-7306478?preview=39be506457fe51f0f42d47f061d1d94fee478142c8ee504d2990d8914812581d8202db46ff7e2b0092e8b3ed2675f329ff4c51a7dfb568b061f3c1e9#conclusion\" name=\"conclusion\"><\/a>Conclusion<\/h2>\n<p>Challenge solved, with no big transactions involved, much simpler code on the client side, no matter with language you are using.<\/p>\n<p>To be clear, a transaction will still take place, but since everything is done in a single command, the exclusive lock taken on the tag resource will usually be measured in microseconds, assuming your case is like the discussed sample: if you&#8217;re inserting much more than one row at time, the elapsed time will be different, depending on how many rows you are trying to insert.<\/p>\n<p>If you are worried about this, keep in mind that Azure SQL will allow readers to read from the\u00a0<code>tags<\/code>\u00a0table even while someone is changing its data as by default it uses the\u00a0<code>READ COMMITTED SNAPSHOT<\/code> isolation level. With that level, the last committed version of the row is kept available and served to anyone asking for read, so that they can still be isolated without being blocked.<\/p>\n<p>Of course, this ability has a cost in terms of resource usage (CPU, Memory and IO, as the previous versions of the row are kept in a version store) and we want to be careful on avoiding wasting resources (just like it happens in the real world!) so having a truly short transaction also helps on this side.<\/p>\n<p>The result of this little optimization is a cleaner code, better performances, less resource usage, more concurrency, and increased efficiency.<\/p>\n<h2><a class=\"anchor\" href=\"https:\/\/dev.to\/yorek\/the-insert-if-not-exists-challenge-a-solution-3g20-temp-slug-7306478?preview=39be506457fe51f0f42d47f061d1d94fee478142c8ee504d2990d8914812581d8202db46ff7e2b0092e8b3ed2675f329ff4c51a7dfb568b061f3c1e9#want-to-know-more\" name=\"want-to-know-more\"><\/a>Want to know more?<\/h2>\n<p>You may be interested to know if the one described is the only way to implement an insert-only-if-exist pattern.<\/p>\n<p>I already mentioned the\u00a0<code>MERGE<\/code>, but that there are couple of other ways to solve this matter.<\/p>\n<p>The two presented here, especially the\u00a0<code>MERGE<\/code>, will be more than enough for you if you don&#8217;t want to get into this only apparently simple topic even more. In case you want to, instead, I have found a remarkably interesting article the summarizes the most common techniques and does some evaluation on pros and cons of each one here:\u00a0<a href=\"https:\/\/cc.davelozinski.com\/sql\/fastest-way-to-insert-new-records-where-one-doesnt-already-exist\">https:\/\/cc.davelozinski.com\/sql\/fastest-way-to-insert-new-records-where-one-doesnt-already-exist<\/a>.<\/p>\n<hr \/>\n<p>Photo by\u00a0<a href=\"https:\/\/www.pexels.com\/@pixabay?utm_content=attributionCopyText&amp;utm_medium=referral&amp;utm_source=pexels\">Pixabay<\/a>\u00a0from\u00a0<a href=\"https:\/\/www.pexels.com\/photo\/white-jigsaw-puzzle-illustration-262488\/?utm_content=attributionCopyText&amp;utm_medium=referral&amp;utm_source=pexels\">Pexels<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>How to solve the common problem of &#8220;insert-if-not-exists&#8221; in a very elegant, clever and scalable way. <\/p>\n","protected":false},"author":24720,"featured_media":859,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[449,165,34,489],"class_list":["post-858","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-azure-sql","tag-development","tag-optimization","tag-t-sql","tag-transaction"],"acf":[],"blog_post_summary":"<p>How to solve the common problem of &#8220;insert-if-not-exists&#8221; in a very elegant, clever and scalable way. <\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/posts\/858","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/users\/24720"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/comments?post=858"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/posts\/858\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/media\/859"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/media?parent=858"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/categories?post=858"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/azure-sql\/wp-json\/wp\/v2\/tags?post=858"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}