Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
PostgreSQL Magic (project-a.com)
237 points by websec on Sept 11, 2015 | hide | past | favorite | 64 comments


As a fellow Postgres amateur wizard, love the positive attention that postgres seems to be getting more and more, and some half databases less and less (unless you actually need map-reduce, ofcourse. You probably don't. /trollface)

What I miss though in this article, and where I think postgres shines majorly compared to other rel dbs, are window functions.

It allows you to apply a partition to a set. You can do some great wizardly magic with this, like 'give me each row matching this and that, which matches the last occurence of given column'.

Edit: WITH clauses (CTE) are great for avoiding a lot of nesting with subqueries and/or reusing subqueries throughout the main query. They have added functionality for recursion, but I suppose that unless you do some kind of tree traversal on big data sets, benefits of that are soso, readabillity and all that.

Edit2: I had to double check this, I never use custom types, using UNNEST() ARRAY[] on a custom type is superfluous. Just use ROW().


For windows functions a good tutorial can be found at http://tapoueh.org/blog/2013/08/20-Window-Functions

Postgresql also has good support for SQL-99 : http://www.slideshare.net/MarkusWinand/modern-sql

WITH clauses (CTE) are great, but they are "optimization fences" in Postgresql : http://blog.2ndquadrant.com/postgresql-ctes-are-optimization...


I did not know that about CTEs. Thanks!


Window functions are not a Postgres thing, but part of the SQL standard with (varying, of course) support across most of the major RDBMS's.

I've seen comments like this several times that seem to imply that Postgres is exceptional either due to having window functions, or like in this one where the tone sounds as if Postgres does them particularly better.

I'm curious, as I do almost 100% of my production work in MS SQL Server, where I am able to do everything that comes up when "Postgres window functions" are mentioned, whether there is extra functionality in Postgres compared to other RDBMS's window function implementations?

Is there a reason that Postgres seems to get special attention for window functions?

Thanks.


I think it's largely a question of community.

Many users of MS SQL or Oracle learned advanced SQL or platform-specific features which they take for granted. They work at companies that spend lots of money for DBs and hire people who use them very well.

Postgres also has a core community of very knowledgeable folks. But most of its user base, and especially those new to using it, would otherwise use MySQL. They only know small bits of SQL, have only a basic understanding of why any company would need a DBA, and only learn things like this when they absolutely need them for a task. And, for that group of people, MS and Oracle probably aren't serious choices, so the fact that a free database has these cool features seems exciting to them.

(Note that I don't say this with distain. I'm a former-mysql user who now uses postgres on Heroku and am constantly learning things like this. I wouldn't even have understood your perspective until I started working with people who knew Oracle and MS SQL so thoroughly.)


> so the fact that a free database has these cool features seems exciting to them

It IS exciting to them. And to anybody who can't afford Oracle or other expensive enterprise solutions.


Thanks very much. It is helpful to understand more of the background from someone who's seen multiple sides of the situation.

For my work especially (BI consulting), fluency in SQL is incredibly useful, and specifically I don't think there's a single person at my company who doesn't use window functions regularly in all their development work. Other advanced constructs are very common as well.

Again, I appreciate understanding the world outside my own little bubble. Thanks.


This may be related to how long Postgres has had support for various window functions, vs. MSSQL only gaining equivalent support in v2012 (AFAIK). Also, the various query optimization struggles older versions of MSSQL had.

http://sqlperformance.com/2013/03/t-sql-queries/the-problem-...


I think because Postgres is commonly being considered as an alternative to MySQL rather than MSSQL or Oracle.


This does make some sense. Sometimes I forget how abnormal my own frame of reference is in comparison with the typical (at least of the vocal subset) HN poster.


I dunno, getting the hell off Oracle to Postgres is fashionable as hell. We're doing it and EVERYTHING IS BETTER.


Half the reason you go with solutions like Oracle is because of (a) enterprise support and (b) easy access to talent pool. PostgreSQL has neither of these. So it may be fashionable but I don't know anyone who is doing it.


> Half the reason you go with solutions like Oracle is because of (a) enterprise support and (b) easy access to talent pool. PostgreSQL has neither of these.

If you need enterprise support for Postgres, there are vendors that offer it (EnterpriseDB is probably the closest to a "first party" equivalent.)


Our third-party Oracle support were most keen to offer PG support. They can tell which way the wind's blowing.

(We haven't taken them up on it. We seriously can't see how we'll need it. But if you do need it, it's there.)


Sure. But they are few and far between and absolutely do not have the same SLAs as someone like an Oracle, NEC, IBM etc. They are generally mom+pop shops. It doesn't really cut it for enterprises which is generally where you are finding Oracle instances.


That's half the justification. Aaand it turns out the justification doesn't hold against practice.

We're discovering that literally everything is better with Postgres. Mostly because instead of a single expensive point of failure, every app gets its own clustered PG pair. Because we can, because we don't have to think about licensing ever again.

Just everything not having to play nicely with anything else makes a huge difference.

The other nice thing is that PG is administerable by clear-thinking (and understand relational databases) non-specialists who can read a manual. You don't actually need big-ticket support unless you do.

And, guess what? Our Oracle support was most keen to offer Postgres support, because they too can tell which way the wind is blowing.

(PG 9.3 out of Ubuntu 14.04 repos. Failover pair with a primary and standby. Primary streams write-ahead log records to standby as they’re generated. Some script gaffer-tape to watch for primary failure and fail over (I think we haven’t ever yet actually had to invoke this). Conversions done by hand with ora2pg then faff and twiddling and unit tests. Gotchas: malformed sql that Oracle accepts but PG chokes on. All cobbled together just following the docs, almost certainly better ways to do all this.)

As for anyone else doing it ... we were buying AppDynamics (which is frickin' awesome btw) and talking to them about our plans to move from Oracle to PG. They said quite a few of their customers were thinking similarly. So maybe it's our own personal bubbles differing, but I think it's happening in at least some quarters.


AppDynamics which I know very well is used by web shops. Very few of those are using Oracle and it wouldn't surprise me at all if they were moving to something else.

The majority of Oracle instances however are in the back rooms of enterprises.


Yeah, that's pretty plausible.

Key point: it's possible. And unlike MySQL, Postgres is actually a proper database that can do what Oracle, MSSQL etc do. (We're also looking askance at our remaining MSSQL and have replaced one instance of the two with PG.)


Whenever I read articles like this, Postgres seems to have so many more small niceties and syntaxes than MSSQL though (I don't have a list of things, it just seems "easier" and more flexible).


I don't disagree in the general case, but I'm particularly interested in window functions here, as they are part of the SQL standard and seem equivalent between MS SQL Server and Postgres (I don't have near enough experience in other RDBMS's to speak to their implementations), despite being called out with some regularity as a Postgres-specific nicety.


The only one I know of is that PostgreSQL has some nice ordinary aggregate functions which can be used well with window functions, like bool_and() and array_agg(). I can't think of an example on top of my head where it is useful but I know I have used them both.


Thanks. There are definitely features in Postgres that I'm jealous of in the MS SQL Server world, and you've listed a couple of them.


There are window functions in sql server, too

https://www.simple-talk.com/sql/learn-sql-server/window-func...


I am well aware. My confusion is, therefore, why do window functions get called out specifically, fairly commonly, as a Postgres-specific piece of awesomeness.

Window functions are awesome, and they are a part of the SQL standard, thus my confusion. As a sibling to your post pointed out, Postgres is often being compared to MySQL, which leads to highlighting the differences between these two, ignoring the features of DB2, MS SQL Server, and Oracle.


> As a sibling to your post pointed out, Postgres is often being compared to MySQL, which leads to highlighting the differences between these two, ignoring the features of DB2, MS SQL Server, and Oracle.

License issues (and not just licensing costs, though that's often a factor) often mean that DB2, MS SQL, and Oracle are excluded options for non-technical reasons when MySQL and/or Postgres are under considerations.


Isn't a CTE in Postgres (unlike in MS SQL, AFAIR) also an optimization fence?

Just something to keep in mind when using it as a substitute for subquery, readability vs performance and all that :)


Yes, at present, the query executor can't optimize across CTEs. Sometimes, that's even the behavior you want.


> Sometimes, that's even the behavior you want.

Sadly, in most cases, this mean you have to decide between ugly and performant, or nice and slow code.

I hope this gets fixed soon. Nobody should have to write queries like this:

      select blah blah blah
      from x, (select blah blah blah
        from y, (select blah blah blah 
          from z, (select blah blah
            from w
            where a=b
            and c=d)
          where z.id = w.id
          and p = 2
          and q = 4)
        where z.id = y.different_id
        and r = 3
        and t = 'BLAH'
        and u not in (select u from w)) l
      where l.id = x.id
CTEs allow you to build "lisp-like" pipeline where you transform your data as you go and are able to give the intermediate results useful names.


Of course, you can first write the query with CTEs, then convert to the nested form. It's just substitution.

(For bonus points, keep the original CTE form, but commented out, to help with troubleshooting further down the road.)


As far as I know generally it doesn't matter. But please proof me wrong, sounds important!

Edit: It matters! See comment above :-)


http://cramer.io/2010/05/30/scaling-threaded-comments-on-dja... This is where recursions is useful. For many scenarios adjacency list + recursive CTE scales better than any other hierarchy model.


Do you mean Postgres's implementation of Excel pivot tables? ;)


tablefunc() :-)


Cool.


Quibble: the "now()" function doesn't return the time of statement start; it returns the time of transaction start.

  $ psql -q
  rosser=# begin;
  rosser=# select now();
                now              
  -------------------------------
   2015-09-11 02:28:54.262142-07
  (1 row)
  
  rosser=# select now();
                now              
  -------------------------------
   2015-09-11 02:28:54.262142-07
  (1 row)


That's also not completely acurate:

  postgres=> SELECT now(), now(), clock_timestamp(), clock_timestamp();
              now              |              now              |        clock_timestamp        |       clock_timestamp
-------------------------------+-------------------------------+-------------------------------+------------------------------

  2015-09-11 09:57:00.414422+00 | 2015-09-11 09:57:00.414422+00 | 2015-09-11 09:57:00.419087+00 | 2015-09-11 09:57:00.41909+00
Now() stays the same for the entire statement as well. clock_timestamp() doesn't.


If you ever manage to have multiple transactions in one statement please email me.


Easy. Just use dblink or foreign table to execute part of the statement inside other transaction. dblink to the same server/db is often used as way to create an "AUTONOMOUS_TRANSACTION"


Thanks for the chuckle.

(Yes, I know this adds nothing to the discussion)


Came here to mention this. Aside from now()/transaction_timestamp() and clock_timestamp() there's also statement_timestamp() that returns the start time of the current statement.


So many amazing features that are incredibly relevant to modern web development. I've switched over a personal project from MySQL to PG recently and this time I'm not looking back.

On 9.4 I have the HSTORE and JSONB types as well as range types which are incredibly useful. If you have a solid language library wrapper for PG you can spend far less time mangling data from one format to the next and just get to work. I love it.


I have seen magic done with custom types and aggregates. For example, I know of projects with run-length-encoded bitsets and custom aggregates for doing set operations and counts, sort of like Redis' bitset commands but built into PG. This is a massive space optimization, because otherwise you'd just have a join table with zillions of tiny rows.


I have to admit that I'm still not quite sure about arrays in relational databases. Don't get me wrong, I use them all the time, but it kinda feels like when you've got tables that you just know could be normalized more thoroughly.

Also: If someone has a good version-control wrapper for stored procedures, that would be swell. And while I'm doing the wishful thinking shtick, maybe a Coffeescript-like preprocessor and a good linter?


We use them a lot where the join table could be over a million rows. Saves a ton of performance and a join. Were talking seconds here on a 3-4s query before, ~1s after.


Yes, it can make a huge difference. I did this once when I had arrays of ~1 million floats and had to compute statistics on them. I wound up implementing a bunch of stats functions as C stored procedures that operate on arrays, and brought query times down from ~12 seconds to ~20 milliseconds:

https://github.com/pjungwir/aggs_for_arrays/

Another time arrays are handy is when you don't know how many "columns" you need to return. SQL can't do this, but a variable-length array can. Here is a writeup for one time that came in handy:

http://illuminatedcomputing.com/posts/2013/03/fun-postgres-p...

Another time they are helpful is to throw an `array_agg` into an aggregate query to see what values are getting rolled up. This can be really useful if you're trying to debug weird behavior.

Also `(array_agg(...))[1]` is a poor-man's `first` function. :-)


Thanks -- I had a a query on 9.1 using MEDIAN (implemented in pl/psql) going from 12 seconds to 0.2 seconds by switching to array_to_median(array_agg(the_column)). I found it was faster to actually query a million values and calculate the median in Python than to use that pl/psql MEDIAN so it's nice this can be done easier with array_agg + your thing.


They are incredibly useful in stored procedures. Also at indexes, pg's full text search is completely reliant on array indexes, ditto for the json storage.

I think I've never used them as column type, but I can imagine a few uses there too.


> I think I've never used them as column type, but I can imagine a few uses there too.

A lot of cases where you would use a one:many are useful to store in an array instead. Tags would be a good example, multi-select lists, etc.


Except that if you'll want to search by those tags, using an array, instead of a one:many means you'll need a bigger and more expensive index.

It's good for some kind of list that you won't search by. That's very limiting, but not an empty set.


> I have to admit that I'm still not quite sure about arrays in relational databases. Don't get me wrong, I use them all the time, but it kinda feels like when you've got tables that you just know could be normalized more thoroughly.

Why is that? There are many use cases for data (like vector data) which really needs an array. And it would be unwise to store it as columns. Think of, for example, matrix data or a practically unbounded number of double values coming from a sensor. Plus, PostgreSQL has a limit in the number of columns (1600) of a table, of which you could run out soon if representing this kind of data as regular columns rather than array values.


Why wouldnt an unbounded set of values related to a sensor just fit in a one to many model of data? (Instead of a constantly updating row, simply inserting more rows)

It seems like you would be trading one fat row for something the database does well (unless you always want all of the sensor data every time)


An array is basically a set of homogeneous fields indexed by number, which are probably accessed in constant-time. A relational table is a completely different beast. Even after disregarding performance (which might be an abysmal difference in many use cases) it's not obvious how to access by index without "hacks". And if you consider multidimensional data, things get tougher without arrays...


That's a nice justification for their inclusion, but I don't really see that as the common actual use case. More often it seems that you'll get arrays where normally a join table would be used (e.g. tags for a blog post, authors for a book).


I've not actually used arrays in relational databases, but often wished I had them when doing grouping and aggregates. It would often be nice to do the aggregate on the server side, but also return all the IDs in each group in a clean way.


IANADBA, but "normalizing something more" without taking into account the performance impact or the overall usability for the end user sounds like a design mistake.

Everything is a nail and all that...


Array are very useful in ad-hoc queries since you can create arrays with the array_agg() aggregate function and then use those arrays for lot's of handy things.


i use arrays as a substitute for app-side caching via materialized views


What tool do you use for visualizing table relationships (foreign key constraints f.x.) in PostgreSQL?


To generate table diagrams that you can navigate (in HTML), SchemaSpy (http://schemaspy.sourceforge.net/) is really good.


http://www.pgadmin.org/ ?? but I am also new to postgresql. There are also some commercial tools available.


how?



IntelliJ


The best part of this for me was the batch updates. I am a little confused at the need for a user defined type that is a copy of the schema of the table. Wouldn't it be better to do use:

::table_name%ROWTYPE;

That way you wouldn't need to maintain a table and a type with the exact same schema, no?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: