An adventure on SQL Server performance and features

Test: confronting various methods of bulk loading data from a table to another, what’s the fastest?

T

Most of the support request I get involved with can be summarized with the following keywords: “slow” + “datawarehouse” + “ETL” + “Save us”
What about thinking about ETL performance before it goes bad? Before the system is in production for some time , the data size reaches a decent size which wasn’t tested in DEV and you find out that the system is not able to scale at all?

Here are the steps to success:

  1. Don’t think about the data that you have right now, think about how many data you’ll have way after the Go Live
  2. TEST with the biggest dataset possibile.

On the subject, I’m saving you some job by testing how the performance changes when bulk importing data into a table in various conditions, let’s get to the data.

I’m testing this on my local machine, nothing fancy, SQL Server 2017 CU3, only 16GB of ram, 4 cores with HT and a very fast NVME drive for both Data files and Logs, I wanted to test the following conditions:

  • Table: Heap vs Clustered Index vs Clustered Columnstore Index
  • Recovery Mode: FULL vs BULK_LOGGED (which for bulk operations is basically the same as SIMPLE)
  • Hints: No Hints vs TABLOCK (My favorite hint)

In each run I monitored, other than the time to complete the operation, the log growth from a fixed dimension and the wait stats occurred during the insert; Each run was repeated twice, once with the log file starting from the standard 10GB size, and another one repeating the test with the size the Log had grown into to verify how much log growing affected performances.
As you may have noticed, an interactive PowerBI report to analyze the data I’ve gathered (don’t forget to drill down, neat!) and at the end of the post you’ll find a summary table with all the tests.

As source data, I used the lineitem table in the TPC-H database created by HammerDB that I already had on my Instance; You can install it for free and it’s a nice tool to literally hammer your instance with a simulated workload of either TPC-C or TPC-H scenarios. The table has 60 million rows , weights about 2.3GB and it’s a Clustered Columnstore, so reading performance are not a bottleneck.
This is the definition of the table I’m going to insert into:

CREATE TABLE [dbo].[lineitem](
  [identity] [bigint] IDENTITY(1,1) NOT NULL,
  [l_shipdate] [date] NULL,
  [l_orderkey] [bigint] NOT NULL,
  [l_discount] [money] NOT NULL,
  [l_extendedprice] [money] NOT NULL,
  [l_suppkey] [int] NOT NULL,
  [l_quantity] [bigint] NOT NULL,
  [l_returnflag] [char](1) NULL,
  [l_partkey] [bigint] NOT NULL,
  [l_linestatus] [char](1) NULL,
  [l_tax] [money] NOT NULL,
  [l_commitdate] [date] NULL,
  [l_receiptdate] [date] NULL,
  [l_shipmode] [char](10) NULL,
  [l_linenumber] [bigint] NOT NULL,
  [l_shipinstruct] [char](25) NULL,
  [l_comment] [varchar](44) NULL,
 -- CONSTRAINT [PK_lineitem] PRIMARY KEY CLUSTERED 
 --(
 -- [identity] ASC
 --) ON [PRIMARY]
) ON [PRIMARY]

This is the query that I’m using to test the inserts:

DBCC SQLPERF('sys.dm_os_wait_stats', CLEAR);
GO

DECLARE @t1 datetime2, @t2 datetime2;

SET @t1= SYSDATETIME();

TRUNCATE TABLE dbo.lineitem;

INSERT INTO dbo.lineitem --WITH (TABLOCK) 
       ([l_shipdate],[l_orderkey],[l_discount],[l_extendedprice],
       [l_suppkey],[l_quantity],[l_returnflag],[l_partkey],
       [l_linestatus],[l_tax],[l_commitdate],[l_receiptdate],
       [l_shipmode],[l_linenumber],[l_shipinstruct],[l_comment])
SELECT
       [l_shipdate],[l_orderkey],[l_discount],[l_extendedprice],
       [l_suppkey],[l_quantity],[l_returnflag],[l_partkey],
       [l_linestatus],[l_tax],[l_commitdate],[l_receiptdate],
       [l_shipmode],[l_linenumber],[l_shipinstruct],[l_comment]
FROM   [tpch].[dbo].[lineitem];

--Insert Index Creation Statement here

SET @t2 = SYSDATETIME()

SELECT DATEDIFF(Millisecond,@t1,@t2) as Duration

I’m clearing the Waitstats right before starting the insert query to be able to get a snapshot of what the query was waiting on during the execution.

Run #1: Clustered Index, FULL Recovery model, No Hint, 10GB Log

Time: 239085ms
Log Size: 10GB -> 23GB

This should be the slowest run, that we’re taking as a baseline.
The log has grown by 13GB over the preallocated size and it shows in the waitstats too  as the waits related to the Log write and Log File file growth

Run #2: Clustered Index, FULL Recovery model, No Hint, 23GB Log

Time: 219969ms
Log Size: 23GB

By having the log already grown to size, the PREEMPTIVE_OS_WRITEFILEGATHER wait is gone, since Windows didn’t have to grow the Log file on disk, it was  10 seconds faster, not much.

Run #3: Clustered Index, FULL Recovery model, TABLOCK, 10GB Log

Time: 239574ms
Log Size 23GB

Regardless of the TABLOCK index, nothing really changed here.

Run #4: Clustered Index, FULL Recovery model, TABLOCK, 23GB Log

Same as #2

Run #5: Clustered Index, BULK_LOGGED Recovery model, No Hint, 10GB Log

Time: 170744ms
Log: 10GB -> 32GB

By switching to the BULK_LOGGED recovery model we finally see some noticeable improvement in time, the WRITELOG wait stat has gone way down in time but the PREEMPTIVE_OS_WRITEFILEGATHER is still present, as the log file had to grow and for some reason it has grown even more than in the FULL recovery model, while one would expect to grow the same or less; One possible explanation that I have is that, while the operation is running, the Log in the BULK_LOGGED  contains both log records and the data pages that were changed by bulk operations (bol)

Run #6: Clustered Index, BULK_LOGGED Recovery model, No Hint, 32GB Log

 

Time: 140099ms
Log: 32GB

With the Log file already grown, we only have a few relevant PAGELATCH waits and the WRITELOG has gone way down, indicating that we’re not waiting on the log anymore, we saved almost 30 second!

Run #7: Clustered Index, BULK_LOGGED Recovery model, TABLOCK Hint, 10GB Log

Time: 152851ms
Log: 10GB -> 32GB

As expected, we’re still waiting on the Log file to grow, but the TABLOCK hint made this query a little bit faster, so we’re still better off than run #5

Run #8: Clustered Index, BULK_LOGGED Recovery model, TABLOCK Hint, 32GB Log

Time: 136668ms
Log: 32GB

The fastest run yet, as expected not having to wait on the Log File to grow and with a TABLOCK hint we’re not waiting on anything.

Run #9: Heap, FULL Recovery model, no Hint, 10GB Log

Time: 304027ms
Log: 10GB -> 28GB

What? Wasn’t inserting into a Heap always faster? Not in Full recovery model, a small Log and no Hints, it seems. Slowest run ever, the WRITELOG wait is killing the run.

Run #10: Heap, FULL Recovery model, no Hint, 28GB Log

Time: 270207ms
Log: 28GB

The Log file didn’t have to grow, but we aren’t really in a better position.

Run #11: Heap, FULL Recovery model, TABLOCK Hint, 10GB Log

Time: 97065ms
Log: 10GB

Finally we’re seeing good results! Even with the Full recovery model the insert with the TABLOCK hint more than halved the insert time, the Log file didn’t grow at all

Run #12: Heap, FULL Recovery model, TABLOCK Hint, 10GB Log

Same as #11, just because, just to verify.

Run #13: Heap, BULK_LOGGED Recovey mode, no Hint, 10GB Log

Time: 291038ms
Log: 10GB -> 29GB

Losing the hint makes us lose all the performance benefits, the situation is basically the same as using the FULL recovery model.

Run #14: Heap, BULK_LOGGED Recovey mode, no Hint, 29GB Log

Time: 276963ms
Log: 29GB

As expected, a little bit faster since we didn’t have to grow the log file, but nothing else.

Run #15: Heap, BULK_LOGGED Recovey mode, TABLOCK Hint, 10GB Log

Time: 77588ms
Log: 10GB

Here is where the real money is! Lowest time ever, no additional Log Growth, as fast as it can get.
But what if I then need to query this table, so I need at least a Clustered Index? Let’s see how much time it takes to create it:

Run #16: Run #15 + Create the Clustered Index

Time: 135376
Log: 10GB

In this case, by first inserting the data into the Heap with a TABLOCK, we create the index right after inserting all the records.
The total time of the operation is basically the same one of inserting the records directly in the Clustered index with TABLOCK and BULK_LOGGED recovery model, the only difference is that the log didn’t grow to 32GB but stayed at the original size.

Run #17: Clustered Columnstore, Full Recovery Model, No Hint, 10GB Log

Time: 245172ms
Log: 10GB

Throwing Columnstore into the mix, we’re seeing that for this kind of table structure/Data doesn’t really performs any better than a classic Clustered Index, but in this case the log didn’t need to grow more than the prefixed 10GB size.
I expected better performance from this test, probably I’ll get some additional benefits in a machine with more ram, even if I’m dealing only with 2.3GB of data in 16GB of RAM the columnstore may benefit from a little bit more memory.

Run #18 to #21: All the various combinations of HINT/Recovery Model

Since Columnstore doesn’t need the TABLOCK hint to run minimally logged, these tests offer no real difference than the last one and were made/included just for completeness.
Log size never exceeds 10GB and the time is always around the same with some small variability related to the recovery model.

Run #22: Run #15 + Columnstore Creation

Time: 133218ms
Log 10GB

Even if in these test Columnstore wasn’t ideal to straight insert into (again, probably a limitation of my machine), when creating the index in a second time we basically get the same results as #16
However, it has the most interesting wait stats yet, as it shows that the index creation was done in parallel and that indeed I need way more than 16GB of ram to demonstrate the Columstore glory.

Round Up:

So, what’s best? As always, it depends.
If you have to get it done and just need the data in the table in the fastest way, then go ahead with a Heap and TABLOCK (I guess you’ll create your indexes Online right after)
If the data needs to be ready in an indexed table, then you can either create the index before or after the insert, the time will be the same but the transaction log will grow to a bigger size while the operation is running in the first case, maybe keep that in mind if you’re in a log shipping scenario.

The definitive answer is: TEST, TEST, TEST.

Finally, here’s table with all the results neatly organized:

Run #ToHintRecovery ModeStarting Log Size (MB)Ending Log Size (MB)Time
1ClusteredNoneFULL1024823176239085
2ClusteredNoneFULL2317623176219969
3ClusteredTABLOCKFULL1024823688239574
4ClusteredTABLOCKFULL2368823688222865
5ClusteredNoneBULK_LOGGED1024832392170744
6ClusteredNoneBULK_LOGGED3239232392140099
7ClusteredTABLOCKBULK_LOGGED1024832392152851
8ClusteredTABLOCKBULK_LOGGED3239232392136668
9HeapNoneFULL1024828680304027
10HeapNoneFULL2868028680270207
11HeapTABLOCKFULL102481024897065
12HeapTABLOCKFULL286802868099071
13HeapNoneBULK_LOGGED1024828936291038
14HeapNoneBULK_LOGGED2893628936276963
15HeapTABLOCKBULK_LOGGED102481024877588
16Heap + CX CreationTABLOCKBULK_LOGGED1024810248135376
17Clustered ColumnstoreNoneFULL1024810248245172
18Clustered ColumnstoreNoneFULL1024810248242231
19Clustered ColumnstoreTABLOCKFULL1024810248251289
20Clustered ColumnstoreNoneBULK_LOGGED1024810248235741
21Clustered ColumnstoreTABLOCKBULK_LOGGED1024810248245408
22Heap + Columnstore CreationTABLOCKBULK_LOGGED1024810248133218

 

About the author

Emanuele Meazzo

2 comments

An adventure on SQL Server performance and features

Emanuele Meazzo

My Social Profiles

If you're interested in following me around in the social world, here are some URLs for you