Distributing Tables

Table creation with default data distribution

There is a default placement of the data segments of new tables on the devices in a cluster. The default is that new tables are given 3 data segments on every device.

As an example, this is a CREATE TABLE statement with default data segment placement:

CREATE TABLE default_table (col1 int, col2 dec(5,4));

The resulting layout can be displayed using a SHOW TABLES statement. The --output-file option of the client_cli command can be used to get untruncated output. The TABLE_METADATA field shows the equivalent CREATE TABLE statement that would explicitly define the table layout. In the example cluster that has 3 nodes each with 3 devices, the above CREATE TABLE command is equivalent to the following command that specifies the segments and their locations explicitly:

CREATE TABLE "default_table" ("col1" int DEFAULT NULL, "col2" numeric(5,4) DEFAULT NULL)
	WITH (devices =
		(m3d0,m3d0,m3d0,m3d1,m3d1,m3d1,m3d2,m3d2,m3d2,
		 m4d0,m4d0,m4d0,m4d1,m4d1,m4d1,m4d2,m4d2,m4d2,
		 m5d0,m5d0,m5d0,m5d1,m5d1,m5d1,m5d2,m5d2,m5d2));

As you can see, each device is specified 3 times, thus creating 3 segments on each device.

Overriding the default table creation parameters

You can override those defaults. You can decide to place more or fewer segments on a device, or place segments only on a subset of the devices. For example, this may be done in order to avoid using a device which is already nearly full. As another example, you may know that a table will always remain small, in which case the wide distribution of data across many nodes/devices creates more overhead that outweighs the parallel execution in terms of performance.

To take control of the distribution of segments to devices, a WITH DEVICES clause is added to the CREATE TABLE, as shown in the output of SHOW TABLES above. Below is an example of a CREATE TABLE command with a WITH DEVICES clause specifying 3 data segments on 2 storage devices:

CREATE TABLE placed_table (col1 int, col2 dec(5, 4))
	WITH (DEVICES = (m5d0, m5d0, m5d2));

A large table will get higher performance when its data segments are spread on multiple nodes and on multiple devices. On the other hand, a small table would do better on just one device. For the purpose of this discussion, small is up to a few thousand rows.

As described in Data distribution of a table above, parallel processing on a single segment is maximized when it reaches a few 10s of millions of rows. If the data distribution scheme of a table does not reach that scale, consider creating 3 segments on each device (by specifying the same device 3 times in the command). Note that any number beyond 3 segments per device won’t increase performance and should be avoided.

Tuning Performance

​Table creation with default data distribution

​Overriding the default table creation parameters

Table creation with default data distribution

Overriding the default table creation parameters