CubicWeb using Postgresql at its best
We had a chat today with a core contributor to Postgresql from whom we may buy consulting services in the future. We discussed how CubicWeb could get the best out of Postgresql:
- making use of the LISTEN/NOTIFY mechanism built into PG could be useful (to warn the cache about modified items for example) and PgQ is its good friend;
- views (materialized or not) are another way to implement computed attributes and relations (see CWEP number 002) and it could be that the Entities table is in fact a view of other tables;
- implementing RQL as an in-database language could open the door to new things (there is PL/pgSQL, PL/Python, what if we had PL/RQL?);
- Foreign Data Wrappers written with Multicorn would be another way to write data feeds (see LDAP integration for an example);
- managing dates can be tricky when users reside in different timezones and UTC is important to keep in mind (unicode/str is a good analogy);
- for transitive closures that are often needed when implementing access control policies with __permissions, Postgresql can go a long way with queries like "WITH ... (SELECT UNION ALL SELECT RETURNING *) UPDATE USING ...";
- the fastest way to load tabular data that does not need too much pre-processing is to create a temporary table in memory, then COPY-FROM the data into that table, then index it, then write the transform and load step in SQL (maybe with PL/Python);
- when executing more than 10 updates in a row, it is better to write into a temporary table in memory, then update the actual tables with UPDATE USING (let's check if the psycopg driver does that when executemany is called);
- reaching 10e8 rows in a table is at the time of this writing the stage when you should start monitoring your db seriously and start considering replication, partition and sharding.
- full-text search is much better in Postgresql than the general public thinks it is and recent developments made it orders of magnitude faster than tools like Lucene or Solr and ElasticSearch;
- when dealing with complex queries (searching graphs maybe), an option to consider is to implement a specific data type, use it into a materialized view and use GIN or GIST indexes over it;
- for large scientific data sets, it could be interesting to link the numpy library into Postgresql and turn numpy arrays into a new data type;
- Oh, and one last thing: the object-oriented tables of Postgresql are not such a great idea, unless you have a use case that fits them perfectly and does not hit their limitations (CubicWeb's is_instance_of does not seem to be one of these).
Hopin' I got you thinkin' :)