Category Archives: SQL

So, what happened with postgres_air database?

As promised, a follow-up to this blog.

The funniest thing is that if this happened with any other database, not “my” database, I would figure it out instantly. However, the notion of knowing everything about this database seemed to turn the investigative side of my brain off! I knew that the plan was wrong because it was a short query (returning less than ten records), and the execution time was 5 seconds instead of 300 ms. I had thought that the statistics on the largest of the tables could be off, but it didn’t seem so. I even ran an ANALYZE, just in case, but nothing changed.

Then, I finally did what I should have done in the first place: I ran EXPLAIN ANALYZE and looked for a node with the biggest difference between estimated and actual cost, and it was a node for SELECT from the smaller table. Just then, I realized that after I built a new index, I didn’t run ANALYZE on that table! (And for the record, the example in the book was to illustrate how the optimizer chooses the right index).

Everything went back to normal after I ran ANALYZE :). It was a real “textbook case,” and I made sure to add the comment about running ANALYZE to this chapter. It was not because of a different Postgres version or because of the data cleanup – it was all because of me!

Leave a comment

Filed under SQL

My Own Creature is Out of Control!

It’s embarrassing; it really is.

We are working on the second edition of the PostgeSQL Query Optimization book. Our postrge_air database is promoted to three years ahead, so it is not in the pandemic anymore :).Postgres Air airlines do not fly to Russia anymore. Boris completed a tremendous job of fixing unevenly distributed bookings, filling in the flights which had almost no passengers, and making tons of other improvements. And I run all examples on PG 15. Yes, things are different, but not that different; after all, I am explaining very fundamental concepts.

Now, imagine that: Postges is giving me a wrong execution plan! In the chapter where I explain how to make sure that Postgres would use the right execution plan, where I explain how to be sure that proper indexes are used, in this very chapter, I can’t make my example work how it should work, and how it worked (correctly) for the first edition of the book!

I’ve already spent two hours on this one example, and I am going to sleep on it and hope that tomorrow that magic will happen. I still can’t believe it! I promise to report when I find out what the problem is!

… Yes, I ran anlayze:)

4 Comments

Filed under Data management, SQL

Citus Con 2023

I barely mentioned my participation in that event, partially because it was virtual and partially because it was too close to PG Day Chicago, but I want to say it again – I am glad I participated! Also, I am glad I recorded the video early, and I am very thankful to the organizers because it was the least stressful recording of all I ever had!

And it was such a delight to receive the speaker’s gifts today!

1 Comment

Filed under Data management, events, SQL, talks

PGSQL Phriday #007 – A word about triggers

I am two days late for the PGSQL Phriday #007 deadline, but I can’t skip the question asked by Laeticia Avrot: do you love triggers? Do you hate triggers? Also, I really enjoyed reading Ryan Booz’s response. I can relate to many of his statements about triggers.

Since Ryan already said everything I could possibly say about triggers, I will be brief. Triggers can do a lot of harm! Triggers can go really wrong – those who attended Nordic PG Day are sure to remember this war story. I do not support those who say that triggers should be banned, but I understand their frustration.

Still, PostgreSQL is impossible without triggers. Did you hear about internal triggers? Postgres creates these triggers behind the scene to support constraints (how would you support them otherwise?!). When I wanted to add foreign key constraints to pg_bitemporal, I had to create a bitemporal version of these triggers.

Triggers are exceptionally powerful, which is both good and bad. I believe that there is no problem in Postgres that you can’t solve by creating triggers that will do exactly what you want. A couple of months ago, I complained about permissions in Postgres: what I do not like in the default permissions setup and what I would change. Since patience is not my virtue, and since I do not like to complain and prefer “being a change,” I decided to model my ideal permissions settings. I achieved it by building a system of event triggers. I am not going to lie – the result is inhumanly complex, but event triggers made it possible.

My short summary for this short post is the following: please stop pretending that we can live without triggers. We can’t!

4 Comments

Filed under SQL

Can we make permissions management more user-friendly?

Thank you to everybody who contributed to the discussion about grants and permissions in PostgreSQL. I understand that my post from two weeks ago was more emotional than factual. Today I will present specific cases of how the flexibility of setting permissions in PostgreSQL might cause problems in production settings.

Let me make it clear: permission inheritance’s flexibility is a treasure. I won’t be able to do my job if PostgreSQL won’t offer that option. Or probably I would, but it would take significantly more effort both to set up and maintain. I would never want to go to the pre-8 situation when roles could not be granted to other roles. 

However, same as we have a system of checks and balances in our political system, same as we need laws to prevent democracy from becoming anarchy, I believe some additional rules are needed to be added to limit the unlimited flexibility for roles and permissions.

I come from a very practical need to support a massive production environment with hundreds of PostgreSQL instances which are touched by hundreds of users who continuously deploy new features, add partitions, drop and create tables and schemas; the users who want to be completely isolated from other users without a necessity to span a new host for each new application. In short, I need an environment in which access control can be automated, no user can break existing permissions conventions, and no user has more privileges than are necessary to complete the task. 

Here are some problems I face on the path to accomplishing this task

Viewing all access privileges for a specific user

To manage access, the first thing you need is to be able to see what privileges any given user has, but that task is not easy. There is no single command which would provide this information. For a given role, you can see the list of roles this role is granted, but you need to write a recursive query to go through the whole chain of roles. After that, you need to collect information about all individual grants for each of the low-level roles. In addition, you need to take a look at default privileges on schemas and individual permissions granted directly to that role.

For DBA’s convenience, I wrote a function that returns the complete list of atomic permissions for a given user, and it ended up in 110 lines of code. Please do not tell me things have to be that complicated. 

We scream when we see applications connecting to the database as a superuser, and this practice is indeed alarming. But at the same time, we do nothing to help application developers see what permissions are granted and identify what’s missing.

Direct privileges

Almost all recommendations for standardizing and simplifying access management call for never granting privileges directly to login users. Instead, the best practices call for assigning all privileges required for one task to group roles and then granting group roles to login users. The advantage of this approach is obvious: the set of privileges is defined once and then granted as a set, significantly reducing the risk of inconsistency. Why do we still allow granting privileges directly to login users, although the majority of practitioners consider it a bad practice? If we want just one user to have certain privileges, we can still create a group role and grant it to one user. 

If the above seems too restrictive, let’s at least disallow granting a login role to other group or login roles. I witnessed so many horror stories happen because of assigning a login role to a group role and then keeping assigning this group role to some other login role that I do not even want to start.

Default schema permissions

The option of assigning default schema permissions is wonderful and amazingly helpful, but with a twist. Imagine you have a schema app_schema that stores data for the application “app.” The owner of this schema is a role app_owner, and we have login users power_user_1 and power_user_2 who are granted the app_owner role. We want to grant all read/write privileges on all tables in this schema to the app_api group role, so we issue a command:

ALTER DEFAULT PRIVILEGES IN SCHEMA app_schema FOR ROLE app_owner GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO app_api

Now, what happens when the power_user_1 connects to the database and creates a new table? If they remember to execute

SET ROLE app_owner

before creating a table, everything will be fine, and the read/write privileges on this new table will be granted to the app_api role. But if they don’t switch to the app_owner role, the table will still be created; however, the default permission will not be granted. And there is no way to change this behavior. You can set default privileges for multiple users, but then you will need to add a new default each time a new power user is created (and remove them before a user is removed). I am not saying it’s impossible to force the correct behavior; in fact, I succeeded setting it up correctly, but the way it is done looks like acrobatics more than regular access management.

Other inconsistencies

You can build a hierarchy of roles, but you can’t build a hierarchy of revoking grants. Let’s say you have a group role g1 and grant this role to a group role g2, along with some additional permissions. You can then grant a role g2 to a u1 user. Now, if you try to revoke a g1 role from that user, nothing will happen (and PostgreSQL won’t report any error; it will report the success of the revoke command). Same way, you can’t revoke any individual permissions, which are parts of the granted role. This might be a desirable behavior, but coupled with no error reported on REVOKE and no easy way to see all granted privileges to the user, it might give a Kafkian effect to somebody trying to restrict the user’s access. 

All of the above is a small portion of the issues I face daily, trying to provide better guidance to database and application developers. If we do not want the users to put everything into the public schema and connect applications as a superuser, we need to make permissions management a little more user-friendly :). 

8 Comments

Filed under Data management, SQL

I have a question…

Below is an excerpt from PostgreSQL documentation. Please, do not take me wrong – I take permissions and security very seriously, probably more than many others. I believe I have a very decent understanding of how PostgreSQL permissions work, better than many others. Still, my favorite database never fails to surprise me!

If a superuser chooses to issue a GRANT or REVOKE command, the command is performed as though it were issued by the owner of the affected object. Since all privileges ultimately come from the object owner (possibly indirectly via chains of grant options), it is possible for a superuser to revoke all privileges, but this might require use of CASCADE as stated above.

REVOKE can also be done by a role that is not the owner of the affected object, but is a member of the role that owns the object, or is a member of a role that holds privileges WITH GRANT OPTION on the object. In this case the command is performed as though it were issued by the containing role that actually owns the object or holds the privileges WITH GRANT OPTION. For example, if table t1 is owned by role g1, of which role u1 is a member, then u1 can revoke privileges on t1 that are recorded as being granted by g1. This would include grants made by u1 as well as by other members of role g1.

Please read these two paragraphs, preferably read them out loud. I have two questions. First, why in the world it has to be so complicated?! And second – why PostgreSQL documentation has to be written like that?!

14 Comments

Filed under SQL

PGSQL Phriday #005: Relational and Non-relational Data

The topic for the February edition of PGSQLPhriday is Relational and Non-Relational Data.

I was a little puzzled that the question “How do you define non-relational data?” is the last one. It only makes sense to answer the first three questions once you clearly define what you are talking about, so that is the first question I will address. 

Do we have a formal definition of non-relational data (or relational, for that matter)? If you can’t think of one, there is a good reason for that: “relational” is a characteristic of the model, not the dataThereby, the only definition we can give would be “The data you can’t represent using the relational model.” But is there anything in the definition of the relational model that limits it to certain types of data? The answer is no, so the best way to interpret the term “non-relational data” would be “the data which can’t benefit from being stored in a relational database.” Most often, it would be documents, images, and other blobs. 

If we never need to search inside the document, in other words, we never expect to use any full-text search; in my opinion, there is no reason to store these documents in a database. Like many others, I can recall several cases like the one mentioned by Pat Wright in his blog post. The only thing we need to store in the database is the pointer to the place where the actual document is stored. There is no benefit in storing it in the database.

However, it’s a different story when we need to perform a full-text search. Knowing that PostgreSQL’s full-text search capabilities are not the best tools available on the market, I would always try to justify the necessity of that feature. In many cases, after talking to the end users, I would find out that, in reality, the required search touches a limited number of fields/tables and can be transformed into a dynamically constructed query supported by b-tree indexes. (And we know that nothing can perform better than b-tree indexes!)

Finally – what if we truly need to implement the full-text search? Would I use an external tool, or what PostgreSQL has to offer? My experience with Elastic search was quite negative, mainly because the search database gets behind the actual database, and this delay is often critical. That was a major argument in favor of using PostgreSQL. However, I never had a chance to perform precise measurements, so my opinion is more emotional than scientific.

Since I want to play by the rules and actually publish this blog on Friday, I will leave the topic of using JSON/JSONB for later coverage in a separate post! 

Thank you, Ryan Lambert, for the topic, and thanks to everybody who has already contributed to this discussion!

Leave a comment

Filed under publications and discussions, SQL

One of many dumb things I did recently…

I know that many people are reluctant to admit they made some stupid mistakes (like accidentally dropping a table in production). Some of these mistakes are very visible to the rest of the world, like the infamous Amazon S3 rm *, but raise the hands those who never ever did anything like this?! That’s why I always try to share my “oops,” in hope that it will make many people feel better.

Now, here is my recent story.

I spent two workdays trying to figure out why my stored procedure failed on commit. It was failing with an error “can’t commit while a subtransaction is running.” I am looking at the code, and there is no subtransaction! And I was trying to figure it out for two days! The use case was a classic one for “why we need to commit inside a stored procedure: we have a loop for a hundred million rows, and we want to commit, say, every 100K or 0.5M.

I read and re-read all related PostgreSQL documentation, all mailing lists, and all blog posts I could find, and I still could not figure out how my example differs from other people’s examples which work.

Out of total desperation, I started to remove parts of the code, hoping that, at some point, I would remove something which prevented it from working.

Guess what? I am down to i:=i+1, and it still returns the same error!

And then, finally, I realized what the problem was! I had exception handling!!! Which implicitly creates a checkpoint!!! Which I theoretically knew but, for some reason, I didn’t associate it with a subtransaction.

Now I knew what was wrong, and I knew that the solution was to create a block inside the loop and put the exception handling there. Great.

Doing this, sitting and waiting for commit messages, and nothing happens. I know that the commit message should appear within 20 min, and I am already waiting for two hours. Since this is happening not in production but in my sandbox, I let it run… until it hits “out of disk space.”

Next thing I realized – when I was putting back into my code all of the meaningful pieces I removed, I accidentally deleted i:=i+1.

😂 😂 😂

2 Comments

Filed under Development and testing, SQL

The postgres_air makeover

We created postgres_air – the largest publically available Postgres database while we worked on our book PostgreSQL Query Optimization, thereby that version has “today” set to August 18, 2020. While we want to keep this version available to our readers until the book’s next edition is out, we also want to provide a more up-to-date version. 

This updated version is now available as postgres_air_2022.backup. 

The following changes were made to the data set:

* “today” is now August 18, 2022

* to adequately reflect the current situation, all Russian airports are excluded from the bookings. The Postgres Air airline company complies with international sanctions and no longer flies to Russia.

Flights within Russian Federation are covered by PostgresPro within their airline database 

Leave a comment

Filed under news, SQL

Functions and Migration

I’ve always been a fierce advocate for using functions and stored procedures with any databases, not just PostgreSQL. When properly used, they provide the best opportunities to optimize application performance and improve it without making changes to applications. I can go on and on about why I am the biggest fan of functions. 

However, many people believe that there is only a limited number of situations when functions are helpful. The two arguments are used most often. 

The first argument is that “the business logic belongs to the application,” and I spoke quite a bit in the past about why I can’t agree with this statement.

The second argument is that if in the future we will migrate the system to some other database, we can leave the SQL code intact, but we will need to rewrite all the functions and stored procedures. In the past, I used to respond that migrating to another database is a major project which does not happen often, and most likely, many changes to different parts of the system will be required in any case.

I used to reply like this until I joined EDB and worked with several customers who migrated from Oracle to PostgreSQL. Most of them used functions and stored procedures and had to convert them to PostgreSQL. I found that converting these functions to PostgreSQL syntax was the easiest part. Following a set of simple rules, one can write a converter that will do it (almost) automatically and produce functions that are not only syntactically correct but return the expected result. 

However, in many cases, the performance of these functions was worse than their Oracle counterpart, and they would consume more resources, especially the CPU. The users blame PostgreSQL, but in fact, it’s not the PostgreSQL fault; it’s just that you need to write your queries differently, and in some cases, you need to structure your application differently.

When we say that “Postgres is as performant as Oracle,” we omit that an application in question is not a pg_bench, and the queries executed are not straightforward reads and writes. 

The differences include the absence of the query cache, different (and less efficient) implementation of cursors, query optimization performed at a later stage of query processing, functions not being compiled, and others. 

On the other hand, there are multiple PostgreSQL features not available in Oracle, which can significantly improve performance. Still, once again, that means some changes should be made to the queries and/or the application itself.

The point I am making is this: 

  • No matter whether you use functions or not, migrating to the new RDBMS can’t be seamless.
  • “The SQL is the SQL,” but the implementation is different in different databases. SQL is declarative, which means that instructions say “what” but do not tell “how.”   
  • The suggestion that “everything will work automatically” is not sustainable, and when migration is planned, the resources for application rewrite should be allocated proactively. 

Then, everything indeed will look “better than with Oracle.”

Leave a comment

Filed under Data management, SQL