Distributing Tables
Table creation with default data distribution
There is a default placement of the data segments of new tables on the devices in a cluster. The default is that new tables are given 3 data segments on every device.
As an example, this is a CREATE TABLE
statement with default data segment placement:
The resulting layout can be displayed using a SHOW TABLES
statement. The --output-file
option of the client_cli
command can be used to get untruncated output. The TABLE_METADATA
field shows the equivalent CREATE TABLE
statement that would explicitly define the table layout. In the example cluster that has 3 nodes each with 3 devices, the above CREATE TABLE
command is equivalent to the following command that specifies the segments and their locations explicitly:
As you can see, each device is specified 3 times, thus creating 3 segments on each device.
Overriding the default table creation parameters
You can override those defaults. You can decide to place more or fewer segments on a device, or place segments only on a subset of the devices. For example, this may be done in order to avoid using a device which is already nearly full. As another example, you may know that a table will always remain small, in which case the wide distribution of data across many nodes/devices creates more overhead that outweighs the parallel execution in terms of performance.
To take control of the distribution of segments to devices, a WITH DEVICES
clause is added to the CREATE TABLE
, as shown in the output of SHOW TABLES
above. Below is an example of a CREATE TABLE
command with a WITH DEVICES
clause specifying 3 data segments on 2 storage devices:
A large table will get higher performance when its data segments are spread on multiple nodes and on multiple devices. On the other hand, a small table would do better on just one device. For the purpose of this discussion, small is up to a few thousand rows.
As described in Data distribution of a table above, parallel processing on a single segment is maximized when it reaches a few 10s of millions of rows. If the data distribution scheme of a table does not reach that scale, consider creating 3 segments on each device (by specifying the same device 3 times in the command). Note that any number beyond 3 segments per device won’t increase performance and should be avoided.