Feeds:
Posts
Comments

Archive for July, 2009

Slowly Changing Dimensions are another topic I’ve kept bumping into lately, so I thought I’d do a brief summary of my infusions here.

First, a recap.

SCDs are typically placed in the context of a MOLAP solution, or even a DW dimensional model, but the issue is not exclusive to those models so much as any storage of historical information.

The question is: how do you store history?  Data often changes over time, for various reasons.  The change may be due to intent (a customer changing address) or error (a customer’s name may be entered incorrectly).  The old information may be meaningful or not – and here an error may become meaningful if, for example, you remitted a parcel to an incorrect address.

Ideally, all historical information would be stored as it was when used in production, but this can blow out the cost and complexity of storing and of retrieving the information.  As always, the solution lies in the balance between differing aims – the technical cost vs the business cost, for example.  In an ideal world, the business imperatives are paramount, but such a conflict can be put in business terms, if the alternatives are expressed as business costs.

Again, the world is less than ideal, as business stakeholders are far from interested in technical costs of holding/using the information.  Educated guesswork on the part of the data custodian (whether BI, DW, or governance) – and the issue should at least be flagged to the business stakeholder.

So, we have alternatives for storing changing information [as dimensions].
SCD type 1: overwrite the old data.  Make the decision that the historical aspect of that particular datum is sufficiently unimportant.  Example: customer address.  There are few situatations where maintenance of the old address is warranted (however, a counter-example is the tracking outstanding orders or issues – although there are other ways to effect this).
SCD type 2: save all history for that datum.  Three possible approaches a) store prior versions of the data as version 0, 1, 2, … [and make the current one either the highest or lowest – either comes at a cost, depending again on business issues]; b) store each version with a from and to date – the most robust option, but sometimes the most complex to live with; c) flag a particular version as current (which may or may not set up a foolproof ordering for previous version, albeit not date specific).  Example: if a salesperson leaves, assigning all their customers to other salespeople will mask the individual’s effect on sales.  Yet this is something you may well need to analyse into the future.
SCD type 3: a ‘previous version’ is saved as a separate data item (column).  This limits the number of versions of history to a particular decision point, since adding further versions (columns) can be quite problematic.Example:  one may wish to compare performance of a supplier with that for the previous supplier of a product (if exists).  Such a comparison is unlikely to be carried out for more than the two instances.  (Yet it needn’t be restricted to ‘current’ and previous’ – it could, for example, include a ‘future’ instance.  The point here is that this approach is appropriate where a specifically finite set of historical data points is kept.)
Those are the main types, in common use – but there are others.

Type 0 – the most common! – is the default, that is, do nothing specific.  How the data is updated depends on the original design of the historical data store (DW), which means in some cases it may be possible to obtain historical information.

Type 4 involves a separate history table for non-current data – not common, but it does happen.

Some also distinguish a Type 6, which is effectively my type 2b above. Wikipedia’s SCD article describes it thus, and attributes it to Ralph Kimball.  However, Kimball has his name to a 2005 article in Intelligent Enterprise which catagorises it as a type 2 variant.

Of course, the most appropriate approach depends entirely on the business uses for the data – but again there’s no ideal answer, as business uses can change.

The traditional approach for deciding these issues has been for the BI/DW/data professional to consult with stakeholders ad hoc.  More appropriate, on an enterprise scale, is formalised data governance: that is, significant decisions being devolved to a data governance body (or bodies) comprised of the relevant stakeholders (typically representatives from technical and one or more business areas, incorporating business data ownership).

In the end, it’s a balancing act between business needs and resources (including the human resources to implement/maintain).  That should be a business decision: technical budgets should originate with business stakeholders, and be subject to revision.  But our world is never ideal, and the nexus is never sufficiently robust – at least in part because business stakeholders are like drivers that want the car to be in good order, but don’t want costly fixes, and don’t want to know the technical details.  Decisions that are significant should be flagged thus by the business stakeholders, but failing that it would be good to at least have a business eye cast over the alternatives.  At the very least, fall back on documenting reasons for design decisions.

In fact, there is definitely extra complexity involved in implementing (and maintaining!) the higher level types, so it is not recommended to push the business requirements for the sake of an attractive (but convoluted) solution.

There’s been a discussion on changing between types 1 and 2, here.  If it becomes necessary to consider both options, I pull my hair at the thought of incorporating both as separate dimensions in one model.  I’m very much in favour of keeping models as simple as possible – other, more important spanners will undoubtedly get thrown into the works to ramp up complexity.  If both are needed – usually for different business purposes – then separate data marts are the go.  The more fungible the better, as Karen Heath might say.

Advertisements

Read Full Post »

This is a tale of a frustrating experience installing a Microsoft product – which details failings in three separate areas.  That in itself may not surprise many, although it should.  Lessons to be learnt on both sides.

Hurdle 1 came in the purchase.  Microsoft Australia’s sales office told me it would cost at least $4,000 to purchase their database product, SQL Server, for a single-computer home installation.  Further, they said it would only run on Server editions of Windows, and not on Vista.  I ended up talking to a couple of Microsoft sales people, both of whom said the same thing.

That did not accord with my experience or recollection, so I went through unofficial Microsoft channels (thanks, Dave).  Of course, the sales people were wrong.  I ended up buying the Developer’s edition ($75) directly from an Australian retailer – and it ran on my Vista laptop.

Why did Microsoft not give me the right information?  The word is that a) their direct sales channel is not good at handling developers; b) they do not admit to SQL Server running on non-Server Windows because they would prefer to see it run on server boxes, which tend to be much more highly configured.  That latter is fair – in an enterprise context.  However, the whole experience demonstrates the difficulty Microsoft has in interfacing with professionals who don’t use Microsoft’s developer channel – as well as the fact that they don’t adequately gear this product to small business, which would certainly benefit from it, given the chance.

Hurdle 2: when run, the install process specifies to enter the product key (serial number) as printed on the Certificate of Authenticity.  That’s a small sticker panel on the outside of the software packaging – which doesn’t, unfortunately, contain the product key.  That is to be found on a smaller sticker discreetly sitting on the inside of the box.  Minor, but a show-stopper until it’s sorted.

Hurdle 3: When installed, SQL Server Management Studio (the crucial administration tool) wouldn’t run (because I’d had a previous version of SQL Server installed – although it was deinstalled).  Searching around a while, I found a Microsoft web page that said this was a ‘known bug’.  They gave instructions for fixing it… which were wrong.  I searched around some more, and found a non-Microsoft site that gave correct instructions for the bug fix.

One may ask “why use Microsoft”?  Unfortunately, it is the only feasible database tool for small professional use, in terms of market penetration (there are free/open source products such as MySQL, but they are not sufficiently widespread to be worthwhile in honing one’s skills on a day-to-day basis; to my knowledge the other market leaders, Oracle, Teradata, and DB2 do not – or no longer – offer viable options in this context).

It’s not enough to say “typical bloody Microsoft”, nor that they treat their customers with contempt.  Sometimes there is evidence they do, but my frequent experience has been that they attempt to treat their professional community in particular with some respect.  Yes, it is risible that there are three different hurdles in the process of starting up a Microsoft product.  Any of those hurdles could be overcome by throwing sufficient time or money at them – but that is risible too.  Surely such a large seller to both consumer and professional communities could get it right?

And that’s part of the problem with such a monolithic beast.  It is hard to run a business that large without an organisational bureaucracy, and in my experience the larger the bureaucracy, the more things can go wrong – no matter what type of organisation it is.  Still, one would expect them to run quality systems to ensure an appropriate level of QA is happening.  But my experience in implementing a quality system (to ISO 9000) is that quality is an ongoing process of improvement, with no endpoint (not least because organisations are constantly changing).  Yet I suspect Microsoft’s level of market dominance ipso facto engenders an opposite effect, working against efforts to improve quality.  Even in product areas where Microsoft isn’t (yet) market dominant.

For my part, this experience gives me three reminders: to persist, to innovate in seeking solutions… and, when installing new software/hardware, to try to chill – and stop expecting smooth sailing.

Read Full Post »

Data quality, data quality.  It should be high on the agendas of both IT and business stakeholders.  Yet it can trip up in two ways: not putting enough resources into it, or putting too much resource into it.

I’ve been besieged recently by many voices on the matter, so I thought I might make a start on rounding them up.

First, a consensus that I can vouch for from personal experience: beware of the least important data getting the most attention.  Blogger Andrew Sturt quotes Kalida‘s Lowan Chetty: “If you try to perfect the quality of all your data you inevitably end up spending the most time on the least important data”, and goes on: “that’s because no one has really been looking at it and that has allowed significant entropy. Spending time focused on poor quality and unimportant data is a poor use of resources.”

Yes!  My strong experience is that data is cleanest when it is valued and frequently used, and poorest quality when it is seldom touched.  Still, that shouldn’t be license to simply scour only high-traffic data once, then leave it for users to clean.  On the one hand, it pays to be proactive with important data.  An initial sweep may uncover slack business practices that can be addressed, but that’s not to say further errors won’t creep in over time.  On the other hand, can low-traffic data be entirely neglected?  A data warehouse is just that: a warehouse where anyone could come poking around for any information, and have an expectation that what they find will not be dusty and unusable.  Business needs always change. (Brush that up for you?  Give me a few weeks!)

Sturt ultimately advocates an agile DW development process, which implies that by the time you get past the important data, prioritisation will dictate how quickly you get around to the less important stuff.

But a project-based approach to data quality can spell doom.  As OCDQ illustrates (missed it by that much), a project-based approach may improve quality, but it can leave business stakeholders losing confidence in the system if they are still focused on the gap and not the main game.

Ultimately, the answer is good data governance.  That is, all data having ownership, and IT and business collaborating to maintain the quality not just of the data, but especially of the meta-data.  Structured and, yes, bureaucratic, but a necessary bureaucracy in an organisation (especially at enterprise level) that values their data as a crucial business resource.

A paper from Melissa Data gives one example of how to work up to the governance process, in six steps:
1.  Profiling
2. Cleansing
3.  Parsing and standardisation
4. Matching
5. Enrichment
6. Monitoring

On an ongoing basis, a data governance group would focus on monitoring, but as new data comes, or data gets used in new ways, steps 1 to 5 would often need to be iterated.  This can largely be done by IT, but for my money it should remain under the aegis of a data governance body.

I’d like to put in a word for profiling.  Some might find it an unnecessary step to build a picture of the data when you should already have it, but believe me, the time taken in cleansing can be reduced by an order of magnitude by insights gained in profiling first.

If interested, you can read a survey and debate on the relationship between data quality efforts and data governance – by no means nailed down yet.

That same site also has an assessment of the use of ETL tools for data quality, which uses the Data Quality Assessment framework created by Arkady Maydanchik (the Melissa Data paper above purports to a similar end, which is not really borne out in the content).

Well, I’ve already got to the end, but there’s so much more to say.  Another day.  Meanwhile, here’s some links on the topic that have useful insights:
Efformation (Andrew Sturt)
OCDQ (Jim Harris)
Data Quality Pro – bloggers are only a part of this site
IAIDQ – fundamentals
– and how would about data quality in the context of The Art Of War, or the Tower Of Babel?
– plus more good blogs referenced by Jim Harris here.

Read Full Post »

Is there a single future for Data Warehousing?  Obviously the path is not set in concrete, but some clear trends are emerging when people advocate advancement.  Here, I quickly sketch the alternatives listed last week, and synthesise a direction.

I have rearranged the listed order of last week’s future visions, to group them more logically.

Classic DW (#1): no innovation, but in its various warehouse/mart permutations it will survive for some time to come.

ROLAP (#7): Not only do we see the ‘is ROLAP dead’ debate, but the ‘ROLAP is back’ movement (see ‘is MOLAP dead?’).  Something to the effect of a simple copy of production data, instead of transformation into a star/snowflake form.  The argument here is that improvements in hardware processing/storage obviate the need for ETL.  Simple, yes, and I heard this argument only a few months ago from a consultant at a TDWI meeting: “why a data warehouse at all? Not necessary with current technology”.  Not so fast: there will always be a need to optimise from transaction processing to querying, particularly with larger amounts of data.

Bus architecture (#3):  This is also described as a Kimball architecture, Ralph Kimball being one of the two original DW prophets.  His original article on this dates back to 1998, and can be found here.  Also discussed here, described as not advocating centralised DWs, and consisting of two types of data marts, aggregated and atomic, while incorporating star schemas.  Although I’ve seen this mentioned recently as an alternative to a DW, it’s not exactly new.

MPP Warehouse appliance (#2): That’s Massive Parallel Processing, and a warehouse appliance is simply a packaged product – in theory, just feed it the data.  This debate could be framed in the same language as ERP commodification, an accepted reality – who still builds/maintains a customer ERP from scratch?  Likewise DWs.  Implementation would be an admixture of customisation and adjusting an enterprise’s own business processes to meet the software – SAP being the archetype here.  Is it the future?  One version, yes.  This is no conceptual advance, just more of the same, albeit heavily commodified.

Column-oriented databases: Yes, that’s opposed to row-oriented.  See the overview in Wikipedia.   There are particular efficiencies in a Data Warehouse context, somewhat reminiscent of cubing.

Enterprise Information Integration (#6): In a nutshell, EII entails the abstraction of data from consumers, and the provision of a standardised interface.  This will be familiar as APIs such as ODBC and OLE DB.

Data Delivery Platform (#4):  This from Rick van der Lans.  He starts by enumerating flaws in the classical architecture, such as issues of latency, redundacy, flexibility, unstructure data, and non-shareable specifications.  His Platform, echoing EII, is a mediation between data sources and data consumers, such that the consumers request information without regard to the structure or source – which, presumably, can change without affecting consumer or provider.  He doesn’t advocate removal of the DW, just abstraction of the data from its consumers.  He doesn’t seem to prescribe the exact form the Platform (and its data) will take at any time, and that is the point – that the model contains sufficient flexibility to be able to change the inherent storage architecture, or even hardware/software technologies, without affecting consumer or provider.  Rick’s description here.

Data Provisioning (#5):  Similar to the Data Delivery Platform, this was described in a previous post.  In a nutshell, it revolves around a staging area where the data is clean and governed, and data is delivered any way desired, especially as fungible (disposable) DWs.  I had asked its progenitor, Karen Heath, for some further documentation, but all she sent me was Ricks’ Data Delivery Platform above, from which it obviously originated (so no web reference available for this).  Not to detract from Rick’s work, it stands in contrast with its additions, and would be a work in progress.

DW 2.0 (#8): This term has actually been trademarked by Bill Inmon, the other DW prophet.  A central issue is to come to grips with unstructured data, and it looks like Inmon goes for integration via a(n amount of) parsing of the documents as they are stored (as ETL processing, presumably).  This to include extracting/standardising dates and other measures found in the data.  Presumably extraction of titular information is also not beyond the pale.  This would seem to be an advance, rather than a restructuring, of the classic DW, to accomodate the already-pressing issue of unstructured data.  (See Inmon’s exposition here.)

In summary, the “alternatives” frequently offered up in general entail either more of the same, or a decoupling of data consumers from the specific underlying technology or sources.  That is probably the best takeaway: that some sort of SOA abstraction could well come to predominate in the DW business – which ties in nicely with the “anywhere” philosophy of cloud computing.  I for one, would be happy if my clients the data consumers could extract their business information/knowledge without needing to be trained/retrained on technologies or underlying data structures.

It makes sense to abstract repository from user.  Whether there’s a separate staging area or not, where the cleaning and transforms are done, whether untransformed data is kept, are relatively minor points of separation.  I for one would like a) access to untransformed data, for reality checks, and b) strong data governance, which entails business ownership of data, an overarching responsibility for the data’s quality and standards. (This is something business stakeholders usually seem to flick to I.T., who are not of themselves fully skilled to make business sense of the data.)

Abstraction, governance, and access to untransformed data.  That’s a healthy wish list.

Update 30-Jul-09:  IBM’s new Smart Analytics Optimiser is yet another option – in a uniquely IBM vein.  It sounds like a rather involved way to overlay analytics on production data.  I’m unconvinced it will last, but you can read about it here.

Update 12-Aug-09:  Added Column-based architecture to the above list.’

Update 21-Oct-09: In my latest post, BI Readings 2, I outline and link to TDWI’s paper on next-gen DWs.

Read Full Post »

A while back, I started to make a list of all the concepts that were being presented to me as superceding the data warehouse.

Why?  Well, why not?  The name of the game is constant technological improvement and innovation; if there’s something better to be had, then it’s always worth looking at it.

The alternatives presented to me over the last couple of months are:

1. Retain the “classic” data warehouse”

2. MPP warehouse appliance

3.  Bus [Kimball] architecture

4. Data Delivery Platform

5. Data Provisioning

6. EII

7. ROLAP (!)

8. DW 2.0

This is obviously quite a mixed bag, that should cause some head scratching.  Are they even commensurate concepts?  Most definitely not, so there’s some explaining to do… in the next post.

Update 30-Jul-09

9.  IBM’s Smart Analytics Optimizer (read about it here: as usual, a uniquely IBM approach!)

Read Full Post »