EnterpriseDB: The Enterprise Postgres Company Postgres Plus Forums: The PostgreSQL Open Source Database from EnterpriseDB
  [Search] Search   [Recent Topics] Recent Topics   [Hottest Topics] Hottest Topics   [Members]  Member Listing   [Groups] Back to home page 

very slow gs-loader  XML

Forum Index » GridSQL - General
Author Message
Min_Zhou

Member

Joined: 14 Mar 2010 04:47:31
Messages: 16
Offline

Hi all,

I've set up a 38 nodes gridsql cluster. One of these is a coordinator, the others are for running agents.
Hardware: Intel(R) Xeon(R) CPU E5420 @ 2.50GHz * 8, 8 GB RAM,1TB×4 SATA Disks。
Software: underlying databases are Postreges 8.4.3


I loaded data to gridsql csv file using the shell gs-loader.
$ll /disk1/movies
-rwxrwxrwx 1 zhoumin zhoumin 2610540665 Mar 25 16:34 /disk1/movies
for loading only about 2.6GB data, it's too slow. Data about 1TB is very common for us.

gs-loader.sh -d coderplay -u admin -p admin -i /disk1/movies -t netflix -f ',' -k 100000,20,1 -y /home/zhou/bad_records
Loading is finished
100480507 input rows have been sent to Server in 1005 chunks

Time : 1522.617 (seconds)


here is a log fragment from query.log

2010-03-25 17:22:35,128 - 5 - COPY netflix FROM STDIN WITH DELIMITER ','
2010-03-25 17:22:36,552 - 5 - COPY netflix FROM STDIN WITH DELIMITER ','
2010-03-25 17:22:37,914 - 5 - COPY netflix FROM STDIN WITH DELIMITER ','
2010-03-25 17:22:39,346 - 5 - COPY netflix FROM STDIN WITH DELIMITER ','
2010-03-25 17:22:40,705 - 5 - COPY netflix FROM STDIN WITH DELIMITER ','
2010-03-25 17:22:42,150 - 5 - COPY netflix FROM STDIN WITH DELIMITER ','
2010-03-25 17:22:43,513 - 5 - COPY netflix FROM STDIN WITH DELIMITER ','
2010-03-25 17:22:45,114 - 5 - COPY netflix FROM STDIN WITH DELIMITER ','
2010-03-25 17:22:46,674 - 5 - COPY netflix FROM STDIN WITH DELIMITER ','
2010-03-25 17:22:48,267 - 5 - COPY netflix FROM STDIN WITH DELIMITER ','
2010-03-25 17:22:49,715 - 5 - COPY netflix FROM STDIN WITH DELIMITER ','


Is there any better way for this?

This message was edited 2 times. Last update was at 25 Mar 2010 07:18:27

Andrei_M

Senior member

Joined: 19 Dec 2008 01:37:13
Messages: 116
Offline

Fastest way to load data is to run COPY netflix FROM '/disk1/movies' ...
from gs-cmdline, psql or any other tool.
If data file is not in coordinator's file system you can use \copy command from psql
gs-loder.sh provides additional benefits, like better error handling.
However if data are OK, it can not slow down things much.

Probably problem is with network or disk I/O. How long would it take to load this file to Postgres database running on other node in the cluster?
Is your table indexed? Was it created "with xrowid"?

Thanks
Andrei
Min_Zhou

Member

Joined: 14 Mar 2010 04:47:31
Messages: 16
Offline

Hi,

Following your suggestion, I tried loading data from gs-cmdline, and it was still very slow.


On the other hand, a try loading the data into another single postgres instance costed 6m 22s 777ms here.


Any suggestions?

Min

Andrei_M

Senior member

Joined: 19 Dec 2008 01:37:13
Messages: 116
Offline

Is your table indexed? Was it created "with xrowid"?

Your table is partitioned? What is the data type of partitioning column?

Thanks
Andrei
Min_Zhou

Member

Joined: 14 Mar 2010 04:47:31
Messages: 16
Offline

Here is the sql i created this table, with a partitioning key, and w/o any index or xrowid


Thanks,
Min
Andrei_M

Senior member

Joined: 19 Dec 2008 01:37:13
Messages: 116
Offline

Hmm, so approximately 1GB out of 2.6GB is the matherial to calculate hashes to determine node where row should be sent to.
That may turn out into such significant overhead.
May be things are not so bad, as they seem. If your other data have relatively short partitioning key rate would be better.
Even if all your data are like these - relatively long partitioning key, you would need about 6 days to load 1TB of data, while postgres would load those in about 2 days.
Maybe this one-time loss is acceptable - you will benefit from faster indexing and faster queries.

Thanks
Andrei
Min_Zhou

Member

Joined: 14 Mar 2010 04:47:31
Messages: 16
Offline

Very disappointed on this conclusion. Every day we have more than 2TB data should be loaded into our cluster. Our greenplum cluster with IB network card achieves a rate nearly 4TB/hour, and our loader of Hadoop cluster under the same LAN with GridSQL is nearly 40x quicker than GridSQL's.

I think there is a long way to GridSQL on this issue.

Regards,
Min

This message was edited 1 time. Last update was at 26 Mar 2010 07:17:26

Mason_S

Senior member

Joined: 1 Apr 2008 09:03:08
Messages: 380
Offline

Hi Min,

I am surprised that loading is slower compared to vanilla PostgreSQL. I have never seen that happen before, it should be faster.

Did you encounter any errors when trying to load the data?

We'll look into this and see if we can reproduce.

Thanks,

Mason
[WWW]
Min_Zhou

Member

Joined: 14 Mar 2010 04:47:31
Messages: 16
Offline

Hi,

Thanks for your attention. It never go wrong when running copy from gs-loader or from gridsql cmd line, just slowly loading.
I thought the loader of gridsql should be in a parallel way, thus theoretically, it should be much faster than postgres.

Thanks,
Min

This message was edited 1 time. Last update was at 26 Mar 2010 11:10:17

 
Forum Index » GridSQL - General
Go to:   
Powered by JForum 2.1.8 © JForum Team