Previously, I discussed using LogParser 2.2 to load IIS logs into SQL Server. Continuing on with my prototype for reporting on IIS Log data, I roughed out a data warehouse for the log data – the schema is below.
The fact table “Log Fact” is largely comprised of IIS Log data as the measurements – specifically timetaken, scBytes, csBytes. Normally, I may prefer a cleaner naming convention for my fact table but I thought it might be beneficial to keep the IIS Log naming convention so we can refer back to the IIS Log specification with less guess work involved.
The dimensions – Users, Urls, Companies – represent the context of the application for which these logs belong, such as company, user and office data. Keep in mind that this is all sample data I created and I didn’t go to such lengths as to create a fancy fictitious context. In your applications the dimensions you may add for context would likely have more interesting data.
Speaking of application context, what is it? What I mean by application context is the application concepts (e.g. domain entities, events, etc) that add meaning or value to the IIS log entries. In this sample, we have Users, Companies and Offices as our application context so we may know what users for what company (and potentially what office) are accessing pages at specific dates and times. I have chosen to associate the IIS Logs to the application context based on IP Address. So, when considering warehousing your IIS Logs, you’ll want to make sure you have some way to associate your log data and application data.
When deciding to begin logging application related data more is not always better, and the same goes for deciding what IIS Log data to capture. The more verbose your logging, the more data you must clean and maintain. If you don’t need it, don’t log it. A great way to manage data greed is to only capture information that helps answer the question(s) for the business – put another way, only collect the data that adds immediate value. This will save you time warehousing, cubing and otherwise crunching the data that is important – especially if you’re resource constrained (e.g. storage space or hardware).
Below are the tables I am working with. Recall in the previous post I moved all of the IIS Logs into the Logs table via LogParser. There are a few ways to go about loading data into the warehouse.
One way would be to edit our LogParser script to load directly into the LogFact table by making our SELECT statement. The Pros is that we don’t need to store the data twice, once in a “raw” table and a second time in the fact table. The cons is that you’re control of your ETL package is entirely dependent on LogParser, which arguably is not meant for ETL, but instead for querying log data via SQL. Also, LogParser won’t help with merging IIS Logs and application context so a second step or a stronger ETL strategy would be necessary.
In this case I loaded the data into the Logs table and then created a SQL ETL script to move select data out of the Logs table while merging it with my application context and loading the data into the warehouse schema. For huge volume situations the bigger toys like SSIS may be useful – which may even replace LogParser at that point.
Here is the ETL script I fashioned for this prototype:
--Gets most current IP list SELECT DISTINCT IP, MAX([DateTime]) AS LastLogin INTO #IPs FROM LoginLogs GROUP BY IP ORDER BY IP DESC --Gets Company and User IDs based on most current IP list (above) SELECT DISTINCT ll.IP, o.CompanyID, ll.UserID INTO #TempIPs FROM LoginLogs ll JOIN #IPs i ON (i.IP = ll.IP) JOIN OfficeUsers ou ON (ou.UserID = ll.UserID) JOIN Offices o ON (o.OfficeID = ou.OfficeID) WHERE ll.[DateTime] >= i.LastLogin ORDER BY IP DESC INSERT INTO LogFact SELECT l.LogID, l.cIp, l.csUriStem, l.[date], CAST(l.[time] AS time(0)) AS [time], DATEPART(hour, l.[time]) AS [hour], ROUND(l.timetaken, -2) AS timetaken, ROUND(l.scBytes, -2) AS scBytes, ROUND(l.csBytes, -2) AS csBytes, tmp.UserID, tmp.CompanyID FROM #TempIPs tmp JOIN Logs l ON (l.cIp = tmp.IP) WHERE timetaken IS NOT NULL AND LogID > COALESCE((SELECT MAX(LastProcessedLog) FROM LastProcessedLogs), 0) GO
--Make an entry noting the last record processed so we don't process the data more than once IF (SELECT MAX(LastProcessedLog) FROM LastProcessedLogs) < (SELECT MAX(LogID) FROM LogFact) OR (SELECT MAX(LastProcessedLog) FROM LastProcessedLogs) IS NULL BEGIN INSERT INTO LastProcessedLogs SELECT MAX(LogID) as LastProcessedLog, GETDATE() as DateProcessed FROM LogFact END GO DROP TABLE #TempIPs DROP TABLE #IPs
In a future post, I will be discussing how to create a data cube from our new warehouse, which gets us a step closer to getting some real cool knowledge from our data!