EnterpriseDB: The Enterprise Postgres Company Postgres Plus Forums: The PostgreSQL Open Source Database from EnterpriseDB
  [Search] Search   [Recent Topics] Recent Topics   [Hottest Topics] Hottest Topics   [Members]  Member Listing   [Groups] Back to home page 

GridSQL and Muticore architectures  XML

Forum Index » GridSQL - General
Author Message
Shoaib_M

New member

Joined: 30/03/2008 20:33:49
Messages: 7
Offline

Hi Mason,

Going through the overview it appears to me the system is suited for an environment where you have several processors available (all separate boxes) but in case I am using one box with multi cores, can I still make use of GridSQL in an efficient way?

Regards,
Shoaib
Mason_S

Senior member

Joined: 01/04/2008 09:03:08
Messages: 366
Offline

Hi Shoaib,

GridSQL makes use of Postgres Plus. Postgres Plus can take advantage of SMP architectures when there is concurrency with each core in effect servicing a request.

If GridSQL is deployed for a reporting application where there is low concurrency, then by default these additional cores won't be taken advantage of if it is used (one node per server). However, you can define multiple logical nodes per physical server. Anecdotal evidence suggests that it will help, but will not scale linearly per core-- they are sharing the same memory, using the same bus, and perhaps even the same controller.

One user who loads about 50 million rows per day sent me stats where he compared a single server versus two dual CPU systems. Going from 1 to 2 servers, he saw a much better than 100% improvement (more memory available means more caching, and also Postgres Plus may plan differently if for the data it needs to process fits in memory for certain steps). Setting up 4 logical nodes on those 2 servers resulted in another 60%+ improvement in performance.

Again, I think that sounds reasonable given the fact that they shared a lot of resources, but were able to take advantage of the extra core.

In another internal test we did, we set up 8 logical nodes on a single dual quad core system. Performance increased by 5+ times, not quite 8, but respectable I think in light of the shared resources.

If planning on purchasing systems, I would look at getting separate physical servers, but if you have the extra hardware, you could use GridSQL as described, with the approximate performance expectations.

One thing I have been thinking about recently is allowing multiple logical node databases share the same physical database (right now they are always separate). This would allow for replicated tables to be cacheable just once per physical server, which would be more efficient, particularly if you have large dimensions or lookup tables that you want to replicate.

Of course, an additional possibility is to add more parallelism to Postgres Plus itself.

Regards,

Mason

This message was edited 2 times. Last update was at 02/04/2008 20:05:25

[WWW]
Shoaib_M

New member

Joined: 30/03/2008 20:33:49
Messages: 7
Offline

Mason_S wrote:Hi Shoaib,

GridSQL makes use of Postgres Plus. Postgres Plus can take advantage of SMP architectures when there is concurrency with each core in effect servicing a request.

If GridSQL is deployed for a reporting application where there is low concurrency, then by default these additional cores won't be taken advantage of if it is used (one node per server). However, you can define multiple logical nodes per physical server. Anecdotal evidence suggests that it will help, but will not scale linearly per core-- they are sharing the same memory, using the same bus, and perhaps even the same controller.

One user who loads about 50 million rows per day sent me stats where he compared a single server versus two dual CPU systems. Going from 1 to 2 servers, he saw a much better than 100% improvement (more memory available means more caching, and also Postgres Plus may plan differently if for the data it needs to process fits in memory for certain steps). Setting up 4 logical nodes on those 2 servers resulted in another 60%+ improvement in performance.

Again, I think that sounds reasonable given the fact that they shared a lot of resources, but were able to take advantage of the extra core.

In another internal test we did, we set up 8 logical nodes on a single dual quad core system. Performance increased by 5+ times, not quite 8, but respectable I think in light of the shared resources.

If planning on purchasing systems, I would look at getting separate physical servers, but if you have the extra hardware, you could use GridSQL as described, with the approximate performance expectations.

One thing I have been thinking about recently is allowing multiple logical node databases share the same physical database (right now they are always separate). This would allow for replicated tables to be cacheable just once per physical server, which would be more efficient, particularly if you have large dimensions or lookup tables that you want to replicate.

Of course, an additional possibility is to add more parallelism to Postgres Plus itself.

Regards,

Mason


Thanks a lot for the detailed reply Mason, that really helped me understand what I need to do moving forward.

Can I get some documentation on the internals of GridSQL (as its opensource i guess) and also some manuals from the user end? If that is possible it will be really helpful.

Regards,
Shoaib

This message was edited 1 time. Last update was at 02/04/2008 20:05:58

Mason_S

Senior member

Joined: 01/04/2008 09:03:08
Messages: 366
Offline

Hi Shoaib,

I think your best bet is to start with this page: http://www.enterprisedb.com/gridsql/architecture.do to get an architectural overview of the system, and then dig into the source code. If you have specific questions you can ask them in the GridSQL Developer forum. Also, user manuals are included if you download either the source or the executables.

Regards,

Mason


[WWW]
Shoaib_M

New member

Joined: 30/03/2008 20:33:49
Messages: 7
Offline

Mason_S wrote:Hi Shoaib,

I think your best bet is to start with this page: http://www.enterprisedb.com/gridsql/architecture.do to get an architectural overview of the system, and then dig into the source code. If you have specific questions you can ask them in the GridSQL Developer forum. Also, user manuals are included if you download either the source or the executables.

Regards,

Mason




Thanks Mason, I have just downloaded the source. As I start browsing through it, I will ask you for any questions I have regarding it.

Regards,
Shoaib
Emilio F_P

New member

Joined: 07/12/2008 23:13:25
Messages: 4
Offline

I'm actually interested in making use of a 8-core server (2 Quad-core CPUs) for reporting with low concurrency. Can you provide more or less detailed instructions on how to configure GridSQL for this? Basically, I'm interested in parallelizing queries where there's a single large table (20GB) and several small to very small tables.
It seems unclear if I have to create a separate instance of Postgres (which its own set of DB files) or how can I achieve this?

Thanks
Mason_S

Senior member

Joined: 01/04/2008 09:03:08
Messages: 366
Offline

Hi Emilio,

GridSQL will be able to help you here.

You can use one single instance of PostgreSQL / Postgres Plus here. Each underlying node database is named differently, to avoid any conflict, but you are shielded from needing to worry about these details.

In the gridsql.config file, you will need to configure all of the nodes to use the same host/IP. Then at database creation time, it will create 8 different logical node databases on the same server. So, in your config file you want something like this:

xdb.nodecount=8

xdb.node.1.dbhost=127.0.0.1
xdb.node.2.dbhost=127.0.0.1
xdb.node.3.dbhost=127.0.0.1
xdb.node.4.dbhost=127.0.0.1
xdb.node.5.dbhost=127.0.0.1
xdb.node.6.dbhost=127.0.0.1
xdb.node.7.dbhost=127.0.0.1
xdb.node.8.dbhost=127.0.0.1

Again, if though you are using the same physical system, GridSQL will treat these as 8 separate logical nodes.

Normally I would mention that you should try to put each logical node's data on a separate physical device (see CREATE TABLESPACE in the SQL Reference), but since you only have 20GB, it should not be that big of a deal, chances are you may be able to fit everything in memory...

Partition out your large fact tables using a good column to partition on. For your smaller tables, just set these up to be replicated. That way joins can happen locally within the logical node databases. If you do have one particularly large dimension, consider partitioning it and the fact table on the columns that they join on.

For faster query times try using constraint exclusion partitioning for your fact tables by something you might often query by, such as date range. Note this is different from GridSQL's partitioning across nodes. Please see CREATE TABLE for more details.

There is no need for you to worry about using agents if it is all on a single system.

With so many logical nodes on one system, you may want to bump up the memory used in the bin/gs-server.sh script file, or you may run into an OutOfMemoryException.

Also, review the Admin Guide to configure your OS so you don't run into issues like not enough file descriptors.

I hope this is enough to get you started.

Regards,

Mason Sharp


[WWW]
Emilio F_P

New member

Joined: 07/12/2008 23:13:25
Messages: 4
Offline

Thanks.

Yes, it's indeed good info to start with.
The storage is a RAID 5, so I'm not too concerned bout IO for reading. Now, as for partition schemas, The queries range for data in the range of 30 minutes to 15 days. In general, it's about 24 hours worth of data and that is already noticeable in response times. Would partition based in hours of the day make sense? I'm thinking about 8 partitions, one for 0-2 hs, the other for 4-6 .. 21-23 (Or with variations to accommodate busy hours). Would that make sense?
Mason_S

Senior member

Joined: 01/04/2008 09:03:08
Messages: 366
Offline

Hi Emilio,

If it is just 24 hours worth of data that you have at a time, partitioning by hour or every 2 hours would make sense.

You also mentioned having up to 15 days worth of data. In that case you might want to by every 12 or 24 hours. If it is a mixture and depends, perhaps something in between, like every 4 hours.

With partitioning, you might want to run an EXPLAIN query or two against one of the underlying databases directly, just to make sure it is using the partitioning properly, sometimes with data type casting and such it does not always eliminate subtables as it should.

Regards,

Mason
[WWW]
 
Forum Index » GridSQL - General
Go to:   
Powered by JForum 2.1.8 © JForum Team