Data Warehousing IIS Logs

Previously, I discussed using LogParser 2.2 to load IIS logs into SQL Server. Continuing on with my prototype for reporting on IIS Log data, I roughed out a data warehouse for the log data – the schema is below.

The fact table “Log Fact” is largely comprised of IIS Log data as the measurements – specifically timetaken, scBytes, csBytes.  Normally, I may prefer a cleaner naming convention for my fact table but I thought it might be beneficial to keep the IIS Log naming convention so we can refer back to the IIS Log specification with less guess work involved.

The dimensions – Users, Urls, Companies – represent the context of the application for which these logs belong, such as company, user and office data. Keep in mind that this is all sample data I created and I didn’t go to such lengths as to create a fancy fictitious context.  In your applications the dimensions you may add for context would likely have more interesting data.

Speaking of application context, what is it? What I mean by application context is the application concepts (e.g. domain entities, events, etc) that add meaning or value to the IIS log entries.  In this sample, we have Users, Companies and Offices as our application context so we may know what users for what company (and potentially what office) are accessing pages at specific dates and times.  I have chosen to associate the IIS Logs to the application context based on IP Address. So, when considering warehousing your IIS Logs, you’ll want to make sure you have some way to associate your log data and application data.

When deciding to begin logging application related data more is not always better, and the same goes for deciding what IIS Log data to capture.  The more verbose your logging, the more data you must clean and maintain.  If you don’t need it, don’t log it.  A great way to manage data greed is to only capture information that helps answer the question(s) for the business – put another way, only collect the data that adds immediate value. This will save you time warehousing, cubing and otherwise crunching the data that is important – especially if you’re resource constrained (e.g. storage space or hardware).

Below are the tables I am working with.  Recall in the previous post I moved all of the IIS Logs into the Logs table via LogParser.  There are a few ways to go about loading data into the warehouse.

One way would be to edit our LogParser script to load directly into the LogFact table by making our SELECT statement.  The Pros is that we don’t need to store the data twice, once in a “raw” table and a second time in the fact table.  The cons is that you’re control of your ETL package is entirely dependent on LogParser, which arguably is not meant for ETL, but instead for querying log data via SQL. Also, LogParser won’t help with merging IIS Logs and application context so a second step or a stronger ETL strategy would be necessary.

In this case I loaded the data into the Logs table and then created a SQL ETL script to move select data out of the Logs table while merging it with my application context and loading the data into the warehouse schema. For huge volume situations the bigger toys like SSIS may be useful – which may even replace LogParser at that point.

Here is the ETL script I fashioned for this prototype:

--Gets most current IP list
SELECT DISTINCT IP, MAX([DateTime]) AS LastLogin
INTO #IPs
FROM LoginLogs
GROUP BY IP
ORDER BY IP DESC

--Gets Company and User IDs based on most current IP list (above)
SELECT DISTINCT ll.IP, o.CompanyID, ll.UserID
INTO #TempIPs
FROM LoginLogs ll
JOIN #IPs i ON (i.IP = ll.IP)
JOIN OfficeUsers ou ON (ou.UserID = ll.UserID)
JOIN Offices o ON (o.OfficeID = ou.OfficeID)
WHERE ll.[DateTime] >= i.LastLogin
ORDER BY IP DESC

INSERT INTO LogFact
SELECT
	l.LogID,
	l.cIp,
	l.csUriStem,
	l.[date],
	CAST(l.[time] AS time(0)) AS [time],
	DATEPART(hour, l.[time]) AS [hour],
	ROUND(l.timetaken, -2) AS timetaken,
	ROUND(l.scBytes, -2) AS scBytes,
	ROUND(l.csBytes, -2) AS csBytes,
	tmp.UserID,
	tmp.CompanyID
FROM #TempIPs tmp
JOIN Logs l ON (l.cIp = tmp.IP)
WHERE timetaken IS NOT NULL
AND LogID > COALESCE((SELECT MAX(LastProcessedLog) FROM LastProcessedLogs), 0)
GO
--Make an entry noting the last record processed so we don't process the data more than once
IF (SELECT MAX(LastProcessedLog) FROM LastProcessedLogs) < (SELECT MAX(LogID) FROM LogFact) OR (SELECT MAX(LastProcessedLog) FROM LastProcessedLogs) IS NULL
BEGIN
	INSERT INTO LastProcessedLogs SELECT MAX(LogID) as LastProcessedLog, GETDATE() as DateProcessed FROM LogFact
END
GO

DROP TABLE #TempIPs
DROP TABLE #IPs

In a future post, I will be discussing how to create a data cube from our new warehouse, which gets us a step closer to getting some real cool knowledge from our data!

Advertisements

2 Responses to “Data Warehousing IIS Logs”

  1. You could definitely see your expertise within the article
    you write. The sector hopes for more passionate writers such as you who are not
    afraid to mention how they believe. At all times go after your heart.

Trackbacks/Pingbacks

  1. Cubing IIS Logs « The B^2 Blog - July 20, 2011

    […] Data Warehousing IIS Logs » […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: