| Author |
Message |
![[Post New]](/templates/default/images/icon_minipost_new.gif) 25 Mar 2010 06:36:26
|
Min_Zhou
Member
Joined: 14 Mar 2010 04:47:31
Messages: 16
Offline
|
Hi all,
I've set up a 38 nodes gridsql cluster. One of these is a coordinator, the others are for running agents.
Hardware: Intel(R) Xeon(R) CPU E5420 @ 2.50GHz * 8, 8 GB RAM,1TB×4 SATA Disks。
Software: underlying databases are Postreges 8.4.3
I loaded data to gridsql csv file using the shell gs-loader.
$ll /disk1/movies
-rwxrwxrwx 1 zhoumin zhoumin 2610540665 Mar 25 16:34 /disk1/movies
for loading only about 2.6GB data, it's too slow. Data about 1TB is very common for us.
gs-loader.sh -d coderplay -u admin -p admin -i /disk1/movies -t netflix -f ',' -k 100000,20,1 -y /home/zhou/bad_records
Loading is finished
100480507 input rows have been sent to Server in 1005 chunks
Time : 1522.617 (seconds)
here is a log fragment from query.log
2010-03-25 17:22:35,128 - 5 - COPY netflix FROM STDIN WITH DELIMITER ','
2010-03-25 17:22:36,552 - 5 - COPY netflix FROM STDIN WITH DELIMITER ','
2010-03-25 17:22:37,914 - 5 - COPY netflix FROM STDIN WITH DELIMITER ','
2010-03-25 17:22:39,346 - 5 - COPY netflix FROM STDIN WITH DELIMITER ','
2010-03-25 17:22:40,705 - 5 - COPY netflix FROM STDIN WITH DELIMITER ','
2010-03-25 17:22:42,150 - 5 - COPY netflix FROM STDIN WITH DELIMITER ','
2010-03-25 17:22:43,513 - 5 - COPY netflix FROM STDIN WITH DELIMITER ','
2010-03-25 17:22:45,114 - 5 - COPY netflix FROM STDIN WITH DELIMITER ','
2010-03-25 17:22:46,674 - 5 - COPY netflix FROM STDIN WITH DELIMITER ','
2010-03-25 17:22:48,267 - 5 - COPY netflix FROM STDIN WITH DELIMITER ','
2010-03-25 17:22:49,715 - 5 - COPY netflix FROM STDIN WITH DELIMITER ','
Is there any better way for this?
This message was edited 2 times. Last update was at 25 Mar 2010 07:18:27
|
|
|
 |
![[Post New]](/templates/default/images/icon_minipost_new.gif) 25 Mar 2010 11:56:31
|
Andrei_M
Senior member
Joined: 19 Dec 2008 01:37:13
Messages: 116
Offline
|
Fastest way to load data is to run COPY netflix FROM '/disk1/movies' ...
from gs-cmdline, psql or any other tool.
If data file is not in coordinator's file system you can use \copy command from psql
gs-loder.sh provides additional benefits, like better error handling.
However if data are OK, it can not slow down things much.
Probably problem is with network or disk I/O. How long would it take to load this file to Postgres database running on other node in the cluster?
Is your table indexed? Was it created "with xrowid"?
|
Thanks
Andrei |
|
|
 |
![[Post New]](/templates/default/images/icon_minipost_new.gif) 26 Mar 2010 02:03:03
|
Min_Zhou
Member
Joined: 14 Mar 2010 04:47:31
Messages: 16
Offline
|
Hi,
Following your suggestion, I tried loading data from gs-cmdline, and it was still very slow.
On the other hand, a try loading the data into another single postgres instance costed 6m 22s 777ms here.
Any suggestions?
Min
|
|
|
 |
![[Post New]](/templates/default/images/icon_minipost_new.gif) 26 Mar 2010 03:21:48
|
Andrei_M
Senior member
Joined: 19 Dec 2008 01:37:13
Messages: 116
Offline
|
Is your table indexed? Was it created "with xrowid"?
Your table is partitioned? What is the data type of partitioning column?
|
Thanks
Andrei |
|
|
 |
![[Post New]](/templates/default/images/icon_minipost_new.gif) 26 Mar 2010 04:29:09
|
Min_Zhou
Member
Joined: 14 Mar 2010 04:47:31
Messages: 16
Offline
|
Here is the sql i created this table, with a partitioning key, and w/o any index or xrowid
Thanks,
Min
|
|
|
 |
![[Post New]](/templates/default/images/icon_minipost_new.gif) 26 Mar 2010 05:56:02
|
Andrei_M
Senior member
Joined: 19 Dec 2008 01:37:13
Messages: 116
Offline
|
Hmm, so approximately 1GB out of 2.6GB is the matherial to calculate hashes to determine node where row should be sent to.
That may turn out into such significant overhead.
May be things are not so bad, as they seem. If your other data have relatively short partitioning key rate would be better.
Even if all your data are like these - relatively long partitioning key, you would need about 6 days to load 1TB of data, while postgres would load those in about 2 days.
Maybe this one-time loss is acceptable - you will benefit from faster indexing and faster queries.
|
Thanks
Andrei |
|
|
 |
![[Post New]](/templates/default/images/icon_minipost_new.gif) 26 Mar 2010 06:21:05
|
Min_Zhou
Member
Joined: 14 Mar 2010 04:47:31
Messages: 16
Offline
|
Very disappointed on this conclusion. Every day we have more than 2TB data should be loaded into our cluster. Our greenplum cluster with IB network card achieves a rate nearly 4TB/hour, and our loader of Hadoop cluster under the same LAN with GridSQL is nearly 40x quicker than GridSQL's.
I think there is a long way to GridSQL on this issue.
Regards,
Min
This message was edited 1 time. Last update was at 26 Mar 2010 07:17:26
|
|
|
 |
![[Post New]](/templates/default/images/icon_minipost_new.gif) 26 Mar 2010 10:54:48
|
Mason_S
Senior member
Joined: 1 Apr 2008 09:03:08
Messages: 380
Offline
|
Hi Min,
I am surprised that loading is slower compared to vanilla PostgreSQL. I have never seen that happen before, it should be faster.
Did you encounter any errors when trying to load the data?
We'll look into this and see if we can reproduce.
Thanks,
Mason
|
|
|
 |
![[Post New]](/templates/default/images/icon_minipost_new.gif) 26 Mar 2010 11:09:47
|
Min_Zhou
Member
Joined: 14 Mar 2010 04:47:31
Messages: 16
Offline
|
Hi,
Thanks for your attention. It never go wrong when running copy from gs-loader or from gridsql cmd line, just slowly loading.
I thought the loader of gridsql should be in a parallel way, thus theoretically, it should be much faster than postgres.
Thanks,
Min
This message was edited 1 time. Last update was at 26 Mar 2010 11:10:17
|
|
|
 |
|
|