MS SQL Server - Free Queries and Tools for Data Analysis
4/20/2021
One of my many job duties is to perform database migrations – translating date from other proprietary databases to our MS SQL database. When the database is first loaded, and a random table is queried, it might be a bit overwhelming looking at the foreign data. Some of the databases I have migrated have been gigabytes in size and some tables contained millions of rows with many columns, so I certainly needed to improve my analysis skills.
Over time, I’ve learned some neat tips on how I like to prep my database for analysis, along with free tools I use for finding data I know exists, but cannot locate.
Prep the Data (1) – Eliminate Empty Tables
Sometimes, depending on the database your are analyzing, you may notice that A LOT of tables have zero rows in them, and from your Object Explorer, your list of tables could be extremely long.
Below is a procedure you can create on your database to identify any empty tables, and generate DROP statements to remove them. Please note any reference constraints when attempting to remove a table, these could be very useful!
CREATE TABLE #counts
(
table_name varchar(255),
row_count int
)
EXEC sp_MSForEachTable @command1='INSERT #counts (table_name, row_count) SELECT ''?'', COUNT(*) FROM ?'
SELECT 'drop table ' +table_name FROM #counts where row_count = 0 ORDER BY row_count desc
DROP TABLE #counts
Prep the Data (2) – Eliminate Unused Columns
Like the previous point, there are also many columns that serve no purpose at all. To prevent overwhelming myself when running queries, and determining if rows are used, I found a procedure that allows me to identify unused columns and drop them from the database.
Below is a procedure you can create on your database to identify any unused columns, and generate ALTER statements to remove them. Please note any reference constraints when attempting to remove a column, these (like tables) could be very useful!
CREATE procedure [dbo].[mysp_DropEmptyColumns]
@tableName nvarchar(max)
as begin
declare @FieldName nvarchar(max)
declare @SQL nvarchar(max)
declare @CountDef nvarchar(max)
declare @FieldCount int
declare fieldNames cursor local fast_forward for
select c.name
from syscolumns c
inner join sysobjects o on c.id=o.id
where o.xtype='U'
and o.Name=@tableName
open fieldNames
fetch next from fieldNames into @FieldName
while (@@fetch_status=0)
begin
set @SQL=N'select @Count=count(*) from "'+@TableName+'" where "'+@FieldName+'" is not null'
SET @CountDef = N'@Count int output';
exec sp_executeSQL @SQL, @CountDef, @Count = @FieldCount output
if (@FieldCount=0)
begin
set @SQL = 'alter table '+@TableName+' drop column '+@FieldName
/* exec sp_executeSQL @SQL */
print @SQL
end
fetch next from fieldNames into @FieldName
end
close fieldNames
end
Identify Primary / Large Tables
The below query will identify any large tables, both in file size, and in rows. This will help you quickly find the most important tables data-wise, and identify any tables that may store files in a binary format, which will need to be extracted.
USE [IMPORTDB] -- replace with your dbname
GO
SELECT
s.Name AS SchemaName,
t.Name AS TableName,
p.rows AS RowCounts,
CAST(ROUND((SUM(a.used_pages) / 128.00), 2) AS NUMERIC(36, 2)) AS Used_MB,
CAST(ROUND((SUM(a.total_pages) - SUM(a.used_pages)) / 128.00, 2) AS NUMERIC(36, 2)) AS Unused_MB,
CAST(ROUND((SUM(a.total_pages) / 128.00), 2) AS NUMERIC(36, 2)) AS Total_MB
FROM sys.tables t
INNER JOIN sys.indexes i ON t.OBJECT_ID = i.object_id
INNER JOIN sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
INNER JOIN sys.allocation_units a ON p.partition_id = a.container_id
INNER JOIN sys.schemas s ON t.schema_id = s.schema_id
GROUP BY t.Name, s.Name, p.Rows
ORDER BY s.Name, t.Name
GO
Free Tool – Apex SQL Search
An amazing free tool I found for digging through data is ApexSQL Search, an add-on to both SSMS and Visual Studio that helps you search for pieces of information that you know exists in a database, but not sure where it lives.
I normally use ApexSQL to find the tables and columns certain pieces of information are stored in based off of client-provided screenshots, which help me understand how the information is presented to the user.
ApexSQL can be downloaded for free by clicking this link.
Not Free But Awesome – Cloud VPS (AWS, Azure, Google)
Sometimes, when handling extremely large databases, restore times can suck. For example, restoring our 150GB production database on an AWS r5d.xlarge (4 Cores, 16 GB RAM) can take anywhere from 30 minutes to an hour to complete! Now certainly your database (or the client’s DB) may not be as big, but some times it’s nice to have a little extra horsepower available to do the processing when needed.
Cloud VPS from providers such as Amazon Web Services (AWS) are great way to quickly spin-up short term instances to do heavy processing, at much less of a cost than owning the PC yourself. For example, if there was some large-scale task you needed to run (DB Restore/modification/ Back up), followed by lighter tasks (data analysis, file transfer of backup, etc), you can follow the outline below to accomplish this task very quickly and cost-efficient.
- Start Large Server
- Perform Heavy Tasks
- Downsize Server
- Perform Rest of Lighter Tasks
- Terminate Servers/ Drives when complete
You can end up saving tons of money, both in time and resources! If you know you won’t be using the PC for more than 6 hours, you may want to look into purchasing Spot Instances, as those are often much cheaper as well.
Have any tips you would like to share as well? Share them in the comments below!