Archive for April, 2009

At the same TDWI meeting at which I was introduced to the Data Provisioning paradigm, I asked a fellow consultant: Why a Data Warehouse?

His response: why indeed? With current hardware and technology, there is no real reason to invest so much in data infrastructure: more storage, more resources is all that is needed.

That’s not really the answer, but it’s part of it.

When I was first properly introduced to business intelligence, my brief was to support the delivery and development of reportage to (some of) the business units in the organisation. The toolset was Brio, and the database was a “development” copy of the production ERP’s Oracle database. It was not a data warehouse – the relational tables remained in the normalised OLTP format – and certainly no MOLAP (cubing) was involved.

Advantages: quick turnaround time; cheap delivery. I also developed an innovative solution that pretty much amounted to the fungible data marts that Ms Heath was talking about in Data Provisioning: effectively a takeaway, disposible app-and-data file.

Disadvantage: only practical for the smaller scale. When it came to delivering a more sophisticated dashboard-type solution, the response times became quite unwieldy, and I doubt that solution was widely adopted.

Stick with an OLTP relation database? Report off a star-format data warehouse? Report off a MOLAP cube? There’s no single answer.

A manager once said to me that technology delivery could in practice, not delivery everything: You could have it fast, cheap, or reliable, or even two out of those three – but never all three. You would have to sacrifice one of the corners of this demand triangle. The solution could be fast, cheap, but not accurate. Or it could be fast, accurate – but not cheap. Or cheap, accurate, but not fast.

Depending on the context, “fast” could refer to either development time, or response time for the end user. But the point is that no single answer delivers on all of a business’ requirements. Otherwise, we’d all be doing that.

But we don’t all report from MOLAP data, nor ROLAP or HOLAP. MOLAP gives fast response times, but the development costs are higher, and build time is, too (which can be an issue when timely data is needed). ROLAP solutions give slower user response, but can report off near-realtime data. HOLAP, more of a balance between response and build times is a compromise that can be good for non-recent data in particular.

As for a Data Warehouse, it can fulfill several purposes (but it’s important to note that not all such purposes are intrinsic to the consequent star schema). Yes, it costs in terms of development time and ETL. But the denormalised star schema is better suited for query transactions (as opposed to adds, updates, deletes). The different logical format can also be easier to navigate – although if there are meaningful but more complex ways to navigate the OLTP database, they can easily be lost.

And a data warehouse is at once a) a repository for data from multiple sources; b) a locus for the enforcement of corporate data governances; and c) an opportunity to apply some data cleansing. This is not to mention the dimensional cubes that can be built off it for even faster analytical processing.

I’m not wedded to cubes, or even data warehouses. I retain a natural suspicion of any transformation that obscures meaning behind the original data – although cleansing can be involved, ideally I would like the ability to navigate – when necessary – the original formats and (sometimes dirty) values. Yet on the other hand, I love the theoretical opportunity to draw data together from multiple sources, clean it, and apply corporate data governance policies.

But even a data professional can’t always have everything.

Read Full Post »

Lord knows everyone wants it at the moment.  Enterprises want to cut costs, and small businesses – although they don’t usually appreciate it – want affordable BI (which is a whole other story).

There’s plenty of open source products to cover the range of tools used to effect business intelligence, from MySQL to Mondrian, Pentaho, Jaspersoft, and the likes of Talend for ETL.  Open source in itself does not necessarily constitute free BI, as labour, infrastructure and support costs remain.  Depending on requirements and existing resources, they can even be more costly than paid-for tools.

There’s also MicroSoft.  If you happen to have an enterprise copy of SQL Server lying around (as many companies do), then you have out-of-the-box availability of the BI products, SQL Server Integration Services (ETL), Reporting Services, and Analysis Services (cubes).

That’s the theory.  It’s tempting… too tempting, sometimes.

Recently, I talked to a government department who wanted to implement a data warehouse/business intelligence system from ETL to cubes and reports.  They had a business analyst, a consultancy lined up for the ETL, licensing for SQL Server, and… a few spare people they had lying around who could be brought up to speed on the toolset.  All they needed was someone to realise the DW/cube/reporting environments… in a few months… and train up those people to take over.  Oh, and did I mention the spare staff was non-technical?

I have nothing against non-technical people.  Particularly those with an analytical temperament.  And everyone else just represents a challenge to achieve for.  But of all the BI tools I’ve worked with, Microsoft would be the last toolset I would foist on the unsuspecting.  They’re just not that geared to the business user – nor the neophyte.  Even apart from the need to come to grips with the development environment BIDS (a version of Visual Studio, MS’ IDE), there are conceptual and practice-based experience hurdles to overcome.  Oh, and did I mention the capacity of BIDS to overwhelm a non-technical user?

All this because they were looking for a quick and inexpensive route to implementation of their BI/DW project – and they happened to have SQL Server lying around.

The two biggests risks I saw were bringing the putative BI staff up to speed – and the ETL project.

ETL can account for maybe 80% of a BI project, and there may be virtue in sequestering the complexities to a consultancy.  On the other hand, they will merrily acheive their task, and come up with a theoretically pristine ETL solution… that may provide a theoretical solution, but leave the client with a front end that only performs its task well in theory.  I’ve seen this happen at least twice before (and I’ve picked up the pieces) – where a consultancy built a structure, leaving behind a theoretically accomplished mission.  In each case they left no documentation, and a system that a) could not easily be adapted to changing business conditions, b) may not have sufficiently engaged the source business stakeholders, and c) may handle only the majority of the data – if that – while leaving in limbo non-conforming data, ie a large part of the ETL project.  Consultants paid, unquantified work still remaining.

Other challenges abounded, but they were just challenges.  Possible ways through may involve at least some of the following:
a) Have a tech and business savvy person (ideally) work with alongside the ETL consultants, then take over that aspect of the work on an ongoing basis;
b) Choose a more business friendly toolset, or hire some people who were already not too far off being able to do the BI/DW work on an ongoing basis;
c) Hire a relatively experienced BI person to run with the project and then stay on to manage the system and the changing demands on it, mentoring existing staff to the best of their capabilities
d) Allow for a longer implementation schedule;
e) Narrow the scope of the project;
f) Accept the need for a bigger budget for the requisite outcomes.

It pains me to have the size of a project blow out – I would like to deliver streamlined – and effective – BI; versions thereof should be available to all levels of need. Yet on the other hand, it’s too easy for business managers to grit their teeth and say “this is what we are going to achieve” without either building slack into the project or at least inject some relevant expertise at the project planning phase.

As proposed, they would end up with a half-working ETL process and a bunch of people who would struggle to maintain the initial environment, far less meet evolving business needs.

Last heard, that government department was still looking for someone to run the project.

What do you think?

Read Full Post »

A presentation at a Sydney meeting of The Data Warehouse Institute (TDWI) introduced to me the concept of Data Provisioning.

Per se, that term is not uncommon: a google of “data provisioning” will give thousands of results with varying meanings and degrees of precision. Yet the speaker, Karen Heath, had a particular – new – meaning that she was attaching to it, so in deference we shall refer to this meaning as Data Provisioning – capitalised. Here I present my understanding of her model; I take full responsibility for all errors and omissions.

Karen brings with her a formidable depth of experience in data technology industries, and she depicts her take on that history with an S-curve chart.

The idea of an S curve originates with Everett Rogers‘ 1962 book Diffusion of Innovation. Rogers postulated that innovation “would spread through a community in an S curve, as the early adopters select the innovation first, followed by the majority, until a technology or innovation has reached its saturation point in a community.”

Karen’s presentation of successive technology waves was similar to that below (courtesy of AVG Aerospace), where the bottom axis is time, and the advent of a new technology overlaps the maturing (and eventual waning) of the old.


Such characterisation could be applied in any number of ways but in Karen’s, successive curves represented Executive Information Systems, then Data Marts, then Enterprise Data Warehouses – then Data Provisioning.

The trend over time was from inflexible reporting for managers through a process of democratisation of data and reportage availability. Yet at the – current – time of ascendancy of EDWs, data/information/intelligence is portrayed as still inflexible and distant in terms of availability to business stakeholders. The need for Data Provisioning, crucially, is driven by the ‘time to market’ issue: business owners need their data now, not in six months’ time. (this resonates with me, from my time working in a large enterprise where, from the business side of the divide, I was fulfilling managers’ information needs, but was only able to access data from a cube – i.e. two levels of remove from the production information. Whereas the I.T. people swore all the necessary data was there, it wasn’t, and a) the cube redevelopment cycle was in the order of months, and b) I could not, in any case, fully navigate the data as I could if I had access to the production system’s relational tables.)

That’s the narrative and the motivation for a new paradigm. In Karen’s model of Data Provisioning, Master Data Management is crucial – and overarching this is the necessity for strong Data Governance: “to be successful you need a governance organisation driven and owned by the business”. I.T. needs to say to business people “I.T. is just the nanny, it’s your baby”. Business stakeholders “need to have some skin in the game [for it] to be successful.”

(It’s gratifying to see strongly burgeoning interest in data governance, and I expect it to rightly gain ascendency in the next few years. I.T. should not own data – it’s a business asset that to date has been thrust in the hands of I.T. by the business stakeholders, partly out of disinterest, partly out of lack of insight, and partly due to I.T.’s over-willingness to take control of and manage something that they do not understand in a business sense.)

So far so good. Karen presented some fairly complex diagrams, some of which I should do her justice by reproducing. Yet out of reluctance to misconstrue her model, I tried to distill Karen’s it even further, particularly in contrasting it with the typical EDW model.

After the presentation, I asked her if a sketch I drew represented her thoughts. She corrected it, and the much simplified kernel becomes thus:

data sources -> meta data (transforms) -> staging area -> provisioning to business users

One of the business users would effectively be the Enterprise Datawarehouse. Contrast this with the typical EDW model:

data sources -> staging -> metadata -> EDW -> business users

So in the Data Provisioning model, business users’ access to data is not intermediated by the EDW (unless specifically desired); they get [more immediate] access to the data, untransformed into data warehouse format, yet transformed from the data sources by data models that are under a corporate-wide governance.

With more time, I would have argued the toss to a point of clear understanding. But I can lay no claims to a dialectic understanding, so right now I can only test the model in isolation. And the following are my questions.

1) Is this new paradigm simply a matter of agglomeration of data sources under the auspices of good data governance, plus a putting out to pasture of the data warehouse?
2) If ‘time to market’ is crucial, is this the best and fastest model?
From a perspective of data analysis, I would always want the ability to get as close to the source data as possible. And my instinct would lead me to a model where source data was drawn raw into a common staging area, with data management/governance transforms made either before or during the provisioning process. My first instinct was that this would enable faster data access for business users than in the sketch outlined above, but I can’t fully justify this feeling. So I express what I can as follows.

Corporate data governance is crucial, whatever the model. This is not always the same as master data management – the former can amount to applying minimum corporate standards, while I see the latter as moving towards a more maximal process, and this including some overhead (that may or may not be essential).

So, I already profess to an amount of skepticism over the Enterprise Datawarehouse. In this light, I can’t see the Data Provisioning model as being significantly different from the simple sidelining of the EDW. And I still need to be convinced that all transforms should be applied before data reaches the staging area. Ultimately, therefor, I’m left wondering what I’ve missed if Data Provisioning is truly a paradigm shift.

(There remain a few other peripheral conversations to be had to round out the discussion. These will come.)

Comments welcome.

8-Aug-09 Addendum: I note that this is the most popular post on this blog. Karen hasn’t (to my knowledge) published anything about this model, but you can reach her here via LinkedIn.

Read Full Post »