Feeds:
Posts
Comments

Archive for the ‘data warehouse’ Category

It’s hard to get through 2010 without stumbling across the term ‘agile’, which is being spilt everywhere.  Like most bandwagonesque ideas, the exact meaning is by turns carelessly mislaid, blithely trampled on, or deliberately stolen.

The origins of “agile software development” goes right back to 2001, published in The Agile Manifesto.  In theory, anything not referencing it is either wilful, ignorant, or indifferent.  But language is organic; these things will happen.

The Wikipedia definition of agile software development accords with the Manifesto.  And an example of the breakout process comes from Maureen Clarry of CONNECT: “Confusing Agile with a capital A and agility is a common mistake. Agile as a methodology is a small piece compared to organizational agility. Closely related to that, we sometimes see BI organizations that use Agile methodology as an excuse so that they don’t have to define standards or document anything. This is another example of trading speed and adaptability for standardization and reuse. It does not need to be an either/or proposition.”

Ouch.  The battle lines are clearly drawn; it can’t be surprising to see it in the business intelligence arena.

This current discussion will look at the capital A, which has definition.  As such, the Agile Manifesto is not for everyone.  Up front, they say:

“we have come to value:

Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan”.

That’s not motherhood – and it’s obviously not universally applicable.  Enterprise-level organisations will necessarily favour processes and tools, simply because they need good communication, integration between parts of the body to make it work –and grow – well.  In that context, the Manifesto could be seen as permission for cancer to grow: it may be successful, but out of step with the rest of the body.  On the other hand, it may be good for pilots where they don’t need tight integration with the body corporate.

The Agile Principles should be viewed in full here, but a short version could be summarised as:

  1. Highest priority: to satisfy the customer through early and continuous delivery
    of valuable software.
  2. Embrace changing requirements, even late in development.
  3. Deliver working software frequently.
  4. Business people and developers must work together daily.
  5. Build projects around motivated individuals, and resource them.
  6. Face-to-face meetings!
  7. Working software is the primary measure.
  8. Sustainable development: the ability to maintain a constant pace indefinitely.
  9. Continuous attention to good design.
  10. Simplicity: maximise the amount of work not done.
  11. Self-organising teams.
  12. Reflect as a team at regular intervals, on how to be more effective.

MIP, an Australian data management consultancy, are the ones who first brought MicroStrategy, Brio and Informatica to Australia.  Recently they gave a presentation on Agility in its formal sense, in the context of presenting RED, a data warehouse development tool from a New Zealand company called WhereScape.

WhereScape RED has:

–          automatic creation of surrogate keys, load timestamps, etc;

–          code generation, execution, re-execution;

–          a source repository;

–          change management/ version control, including comparison tools;

–          generated user and technical documentation, with auto commenting, diagrams, glossaries;

–          prototyping facilities;

–          notification of data issues (although it is not a data quality tool per se, it uses an error mart).

MIP presented WhereScape RED as inextricably linked to Agile development; a simpler IDE than Microsoft’s Visual Studio, and an intuitive ETL tool.  It has been customer-quoted as a “perfect complement” to SQL Server technology (albeit I can’t say how well it fits in with other database technology).

What I saw did look good.  It makes sense that it would suit an Agile development project.  I noted one caveat at the time: that with such tools and methodology, it would be easy to lose the development thread in the process of rapid reiteration.  A challenge, but not an insurmountable one, for the data professional.

Update 05-Aug-10: The Data Warehouse Institute’s World Conference has Agile as its theme.  Some of the adjunct discussions can be seen to muddy the waters somewhat (is it a methodology? a process? a style? – depends on who’s talking, and how loose their language is).  An earlier discussion – “Being” Agile vs. “Doing” Agile – is salient, especially the comments.  One of the author’s own comments is worth repeating, that promoting Agile on the basis of speed specifically is “wrong-headed”:

“When speed is the primary objective, quality and value often suffer. But when the focus is on incremental value delivery (of high quality); increased productivity occurs naturally”.

Read Full Post »

A quick listing of HP’s latest analysis of trends within Business Intelligence:


1.  Data and BI program governance

– ie managing BI [and especially data] more strategically.


2. Enterprise-wide data integration

– recognising the value of such investment.


3. (the promise of) semantic technologies

– especially taking taxonomical (categorising) and ontological (relating) approaches to data.


4. Use of advanced analytics

– going beyond reporting/OLAP, to data mining, statistical analysis, visualisation, etc.


5. Narrowing the gap between operational systems and data warehouses


6. New generation, new priorities in BI and DW – ie updating BI/DW systems

– HP identifies renewals of systems, greater investment in new technology – perhaps in an emerging economic recovery context.


7. Complex event processing

– correlating many, varied base events to infer meaning (especially in the financial services sector)


8. Integrating/analysing content

– including unstructured data and external sources.


9. Social Computing [for BI]

– yet at the moment it takes great manual effort to incorporate such technology into BI


10. Cloud Computing [for BI]


You can find the full 60-minute presentation here.  HP noted that these points are very much inter-related.  I would also add a general tenor that I got from the discussion: that these are clearly more aspirational trends than widespread current initiatives.  HP’s research additionally highlighted the four most important current BI initiatives separately:

– data quality

– advanced analytics [again]

– data governance

– Master Data Management


Other current buzzwords, such as open source, Software as a Service, and outsourcing, didn’t emerge at the forefront of concerns.  For the first two, the comment was made that these were more background enabling technologies.  As for outsourcing, it looked like those who were going to do it had largely done it, and there was current stability around that situation.


Business Intelligence has obviously moved away from simple reporting from a single repository.   Concerns are now around data quality, integration/management – and making greater sense of it, particularly for decision-making.  Those trends are clear and current.  But I’d also like to note one small point almost buried in the above discussion: the use of external data sources.  Business value of data must inevitably move away from simple navel-gazing towards facing the whole of the world, and making business sense of it.  That’s a high mountain, and we’re only just becoming capable of moving towards that possibility in a meaningful way.

Read Full Post »

Why the buzz over columnar databases recently?  They’ve been around since the 1970s at least.  At the moment it remains the realm of niche players, although Sybase has had such a product for years, in Sybase IQ.

As far back as 2007, Gartner has been giving it a big tick.

Yet for some reason, I’ve been assailed by the concept from several disparate sources over the past month or so, some of which are heavy on the blah blah blah advantages but light on specifics such as reasons for those advantages.

I don’t pretend to have done the research, so I’ll just present a quick overview and links.  (At the very least, you can revert to Wikipedia’s article at top.)

Overview
In a nutshell, it is as it says, in that data is stored by column rather than by row (however, retrieval implementation seems to vary, with both row- and column-based querying variously supported).  Typified as meaningful in an OLAP more than OLTP context, it is said to be particularly beneficial when frequently working with a small subset of columns (in a table which has a large number of columns).  And, you guessed it, aggregation particularly thrives.

Links

  • There’s a simple overview in a blog by Paul Nielsen, who puts his hand up for an implementation in SQL Server;
  • There’s a small simulation in Oracle, in a blog by Hemand Chitale  (with caveat in a comment);

Read Full Post »

Bill Inmon is one of the two gurus of data warehousing.  His claim is to have invented the modern concept of data warehousing, and he favours the top-down approach to design.

[Ralph Kimball is the other modern guru, who is credited with dimensional modelling – facts and dimensions.  He favours bottom-up design, first building data marts.]

Inmon is associated with the BeyeNetwork, maintaining his own “channel” there, on Data Warehousing.

Recently discussing data quality, he canvassed the issue of whether to correct data in a warehouse when it’s known to be wrong.

One approach is that it is better to correct data – where known to be in error – before it reaches the warehouse (Inmon credits Larry English for this perspective).

In contrast, there’s the notion that data should be left in the warehouse as it stands, incorrect but accurately representing the production databases. Inmon attributes this approach to Geoff Holloway.

Of course, Inmon easily demonstrates cases for both perspectives.  This is understandable because both versions of the data – corrected or incorrect – provide information.  On the one hand, business data consumers would want correct information, no mucking around.

But on the other hand, incorrect data is an accurate reflection of production values – and it can be misleading to represent it otherwise.  In particular, bad data highlights the business process issues that led to the entry the errors, and that in itself is valuable business information.

And here’s where I branch beyond Inmon.  I would argue the case for both forms of the data to be preserved in one form or another.

We have all experienced the exasperation of being faced with poor quality data flowing into business reports/information.  On a day-to-day basis, the information consumer doesn’t want to know about errors – they just want to use the information as it should rightly be, as a business input.  They may well be aware of the issues, but prefer to put them to one side, and deal with BAU* as it stands.

What this is saying is that the approach to data quality fixes should really be a business decision.  At the very least, the relevant business stakeholders should be aware of the errors – expecially when systemic – and make the call on how to approach them.  In fact, ideally this is a case for… a Data Governance board – to delegate as they see fit.  But unless the issues are fully trivial, errors should not be fully masked from the business stakeholders.

So if the stakeholders are aware of the data issues, but the fix is not done and they don’t want to see the errors on day to day reportage, how to deal the need to fix – at least as the data is represented?

I see four options here, and I think the answer just pops out.

Option 1: correct the data in the reports
Option 2: correct the DW’s representation of the data with a view
Option 3: correct the data itself in the DW
Option 4: correct it in ETL processing

Option 1 is fully fraught.  I have done this on occasion when it has been demanded of me, but it is a poor contingency.  You’re not representing the data as it exists in the DW, but more importantly, if you have to run a transform in one report, you may well have to reproduce that transform.  Over and over.

Option 2: creating a view is adding a layer of complexity to the DW that is just not warranted.  It makes the schema much harder to maintain, and it slows down all processing – both ETL and reporting.

Fixing the DW data (option 3) is done.  But again, it may have to be done over and over, if ETL overwrites it again with the bad data.  And there is a very sensible dictum I read recently, paraphrased thus: any time you touch the data, you can introduce more errors.  Tricky.  Who can say with certainty that they have never done that?

Of course, I would favour handling it in ETL.  More specifically, I would like to see production data brought to rest in a staging area that is preserved, then transformed into the DW.  That way, you have not touched the data directly, but you have executed a repeatable, documentable process that performs the necessary cleansing.

Not always possible, with resource limitations.  Storage space is one problem, but it may be more likely (as I have experienced) that the ETL processing window is sufficiently narrow that an extra step of ETL processing is just not possible.  Oh well.  There’s no perfect answer; the solution always has to fit the circumstance.  Again, of course, it’s a matter of collaboration with the business (as appropriate via the data steward or DG board).

Oh, and most importantly: get back to the business data owner, and get them working (or work with them) on the process issue that led to the bad data.

*BAU=Business As Usual – at risk of spelling out the obvious.  I find acronyms anathemic, but spelling them out can interrupt the flow of ideas.  So I will endeavour to spell them out in footnotes, where they don’t have to get in the way.

.

Read Full Post »

Slowly Changing Dimensions are another topic I’ve kept bumping into lately, so I thought I’d do a brief summary of my infusions here.

First, a recap.

SCDs are typically placed in the context of a MOLAP solution, or even a DW dimensional model, but the issue is not exclusive to those models so much as any storage of historical information.

The question is: how do you store history?  Data often changes over time, for various reasons.  The change may be due to intent (a customer changing address) or error (a customer’s name may be entered incorrectly).  The old information may be meaningful or not – and here an error may become meaningful if, for example, you remitted a parcel to an incorrect address.

Ideally, all historical information would be stored as it was when used in production, but this can blow out the cost and complexity of storing and of retrieving the information.  As always, the solution lies in the balance between differing aims – the technical cost vs the business cost, for example.  In an ideal world, the business imperatives are paramount, but such a conflict can be put in business terms, if the alternatives are expressed as business costs.

Again, the world is less than ideal, as business stakeholders are far from interested in technical costs of holding/using the information.  Educated guesswork on the part of the data custodian (whether BI, DW, or governance) – and the issue should at least be flagged to the business stakeholder.

So, we have alternatives for storing changing information [as dimensions].
SCD type 1: overwrite the old data.  Make the decision that the historical aspect of that particular datum is sufficiently unimportant.  Example: customer address.  There are few situatations where maintenance of the old address is warranted (however, a counter-example is the tracking outstanding orders or issues – although there are other ways to effect this).
SCD type 2: save all history for that datum.  Three possible approaches a) store prior versions of the data as version 0, 1, 2, … [and make the current one either the highest or lowest – either comes at a cost, depending again on business issues]; b) store each version with a from and to date – the most robust option, but sometimes the most complex to live with; c) flag a particular version as current (which may or may not set up a foolproof ordering for previous version, albeit not date specific).  Example: if a salesperson leaves, assigning all their customers to other salespeople will mask the individual’s effect on sales.  Yet this is something you may well need to analyse into the future.
SCD type 3: a ‘previous version’ is saved as a separate data item (column).  This limits the number of versions of history to a particular decision point, since adding further versions (columns) can be quite problematic.Example:  one may wish to compare performance of a supplier with that for the previous supplier of a product (if exists).  Such a comparison is unlikely to be carried out for more than the two instances.  (Yet it needn’t be restricted to ‘current’ and previous’ – it could, for example, include a ‘future’ instance.  The point here is that this approach is appropriate where a specifically finite set of historical data points is kept.)
Those are the main types, in common use – but there are others.

Type 0 – the most common! – is the default, that is, do nothing specific.  How the data is updated depends on the original design of the historical data store (DW), which means in some cases it may be possible to obtain historical information.

Type 4 involves a separate history table for non-current data – not common, but it does happen.

Some also distinguish a Type 6, which is effectively my type 2b above. Wikipedia’s SCD article describes it thus, and attributes it to Ralph Kimball.  However, Kimball has his name to a 2005 article in Intelligent Enterprise which catagorises it as a type 2 variant.

Of course, the most appropriate approach depends entirely on the business uses for the data – but again there’s no ideal answer, as business uses can change.

The traditional approach for deciding these issues has been for the BI/DW/data professional to consult with stakeholders ad hoc.  More appropriate, on an enterprise scale, is formalised data governance: that is, significant decisions being devolved to a data governance body (or bodies) comprised of the relevant stakeholders (typically representatives from technical and one or more business areas, incorporating business data ownership).

In the end, it’s a balancing act between business needs and resources (including the human resources to implement/maintain).  That should be a business decision: technical budgets should originate with business stakeholders, and be subject to revision.  But our world is never ideal, and the nexus is never sufficiently robust – at least in part because business stakeholders are like drivers that want the car to be in good order, but don’t want costly fixes, and don’t want to know the technical details.  Decisions that are significant should be flagged thus by the business stakeholders, but failing that it would be good to at least have a business eye cast over the alternatives.  At the very least, fall back on documenting reasons for design decisions.

In fact, there is definitely extra complexity involved in implementing (and maintaining!) the higher level types, so it is not recommended to push the business requirements for the sake of an attractive (but convoluted) solution.

There’s been a discussion on changing between types 1 and 2, here.  If it becomes necessary to consider both options, I pull my hair at the thought of incorporating both as separate dimensions in one model.  I’m very much in favour of keeping models as simple as possible – other, more important spanners will undoubtedly get thrown into the works to ramp up complexity.  If both are needed – usually for different business purposes – then separate data marts are the go.  The more fungible the better, as Karen Heath might say.

Read Full Post »

Is there a single future for Data Warehousing?  Obviously the path is not set in concrete, but some clear trends are emerging when people advocate advancement.  Here, I quickly sketch the alternatives listed last week, and synthesise a direction.

I have rearranged the listed order of last week’s future visions, to group them more logically.

Classic DW (#1): no innovation, but in its various warehouse/mart permutations it will survive for some time to come.

ROLAP (#7): Not only do we see the ‘is ROLAP dead’ debate, but the ‘ROLAP is back’ movement (see ‘is MOLAP dead?’).  Something to the effect of a simple copy of production data, instead of transformation into a star/snowflake form.  The argument here is that improvements in hardware processing/storage obviate the need for ETL.  Simple, yes, and I heard this argument only a few months ago from a consultant at a TDWI meeting: “why a data warehouse at all? Not necessary with current technology”.  Not so fast: there will always be a need to optimise from transaction processing to querying, particularly with larger amounts of data.

Bus architecture (#3):  This is also described as a Kimball architecture, Ralph Kimball being one of the two original DW prophets.  His original article on this dates back to 1998, and can be found here.  Also discussed here, described as not advocating centralised DWs, and consisting of two types of data marts, aggregated and atomic, while incorporating star schemas.  Although I’ve seen this mentioned recently as an alternative to a DW, it’s not exactly new.

MPP Warehouse appliance (#2): That’s Massive Parallel Processing, and a warehouse appliance is simply a packaged product – in theory, just feed it the data.  This debate could be framed in the same language as ERP commodification, an accepted reality – who still builds/maintains a customer ERP from scratch?  Likewise DWs.  Implementation would be an admixture of customisation and adjusting an enterprise’s own business processes to meet the software – SAP being the archetype here.  Is it the future?  One version, yes.  This is no conceptual advance, just more of the same, albeit heavily commodified.

Column-oriented databases: Yes, that’s opposed to row-oriented.  See the overview in Wikipedia.   There are particular efficiencies in a Data Warehouse context, somewhat reminiscent of cubing.

Enterprise Information Integration (#6): In a nutshell, EII entails the abstraction of data from consumers, and the provision of a standardised interface.  This will be familiar as APIs such as ODBC and OLE DB.

Data Delivery Platform (#4):  This from Rick van der Lans.  He starts by enumerating flaws in the classical architecture, such as issues of latency, redundacy, flexibility, unstructure data, and non-shareable specifications.  His Platform, echoing EII, is a mediation between data sources and data consumers, such that the consumers request information without regard to the structure or source – which, presumably, can change without affecting consumer or provider.  He doesn’t advocate removal of the DW, just abstraction of the data from its consumers.  He doesn’t seem to prescribe the exact form the Platform (and its data) will take at any time, and that is the point – that the model contains sufficient flexibility to be able to change the inherent storage architecture, or even hardware/software technologies, without affecting consumer or provider.  Rick’s description here.

Data Provisioning (#5):  Similar to the Data Delivery Platform, this was described in a previous post.  In a nutshell, it revolves around a staging area where the data is clean and governed, and data is delivered any way desired, especially as fungible (disposable) DWs.  I had asked its progenitor, Karen Heath, for some further documentation, but all she sent me was Ricks’ Data Delivery Platform above, from which it obviously originated (so no web reference available for this).  Not to detract from Rick’s work, it stands in contrast with its additions, and would be a work in progress.

DW 2.0 (#8): This term has actually been trademarked by Bill Inmon, the other DW prophet.  A central issue is to come to grips with unstructured data, and it looks like Inmon goes for integration via a(n amount of) parsing of the documents as they are stored (as ETL processing, presumably).  This to include extracting/standardising dates and other measures found in the data.  Presumably extraction of titular information is also not beyond the pale.  This would seem to be an advance, rather than a restructuring, of the classic DW, to accomodate the already-pressing issue of unstructured data.  (See Inmon’s exposition here.)

In summary, the “alternatives” frequently offered up in general entail either more of the same, or a decoupling of data consumers from the specific underlying technology or sources.  That is probably the best takeaway: that some sort of SOA abstraction could well come to predominate in the DW business – which ties in nicely with the “anywhere” philosophy of cloud computing.  I for one, would be happy if my clients the data consumers could extract their business information/knowledge without needing to be trained/retrained on technologies or underlying data structures.

It makes sense to abstract repository from user.  Whether there’s a separate staging area or not, where the cleaning and transforms are done, whether untransformed data is kept, are relatively minor points of separation.  I for one would like a) access to untransformed data, for reality checks, and b) strong data governance, which entails business ownership of data, an overarching responsibility for the data’s quality and standards. (This is something business stakeholders usually seem to flick to I.T., who are not of themselves fully skilled to make business sense of the data.)

Abstraction, governance, and access to untransformed data.  That’s a healthy wish list.

Update 30-Jul-09:  IBM’s new Smart Analytics Optimiser is yet another option – in a uniquely IBM vein.  It sounds like a rather involved way to overlay analytics on production data.  I’m unconvinced it will last, but you can read about it here.

Update 12-Aug-09:  Added Column-based architecture to the above list.’

Update 21-Oct-09: In my latest post, BI Readings 2, I outline and link to TDWI’s paper on next-gen DWs.

Read Full Post »

Lord knows everyone wants it at the moment.  Enterprises want to cut costs, and small businesses – although they don’t usually appreciate it – want affordable BI (which is a whole other story).

There’s plenty of open source products to cover the range of tools used to effect business intelligence, from MySQL to Mondrian, Pentaho, Jaspersoft, and the likes of Talend for ETL.  Open source in itself does not necessarily constitute free BI, as labour, infrastructure and support costs remain.  Depending on requirements and existing resources, they can even be more costly than paid-for tools.

There’s also MicroSoft.  If you happen to have an enterprise copy of SQL Server lying around (as many companies do), then you have out-of-the-box availability of the BI products, SQL Server Integration Services (ETL), Reporting Services, and Analysis Services (cubes).

That’s the theory.  It’s tempting… too tempting, sometimes.

Recently, I talked to a government department who wanted to implement a data warehouse/business intelligence system from ETL to cubes and reports.  They had a business analyst, a consultancy lined up for the ETL, licensing for SQL Server, and… a few spare people they had lying around who could be brought up to speed on the toolset.  All they needed was someone to realise the DW/cube/reporting environments… in a few months… and train up those people to take over.  Oh, and did I mention the spare staff was non-technical?

I have nothing against non-technical people.  Particularly those with an analytical temperament.  And everyone else just represents a challenge to achieve for.  But of all the BI tools I’ve worked with, Microsoft would be the last toolset I would foist on the unsuspecting.  They’re just not that geared to the business user – nor the neophyte.  Even apart from the need to come to grips with the development environment BIDS (a version of Visual Studio, MS’ IDE), there are conceptual and practice-based experience hurdles to overcome.  Oh, and did I mention the capacity of BIDS to overwhelm a non-technical user?

All this because they were looking for a quick and inexpensive route to implementation of their BI/DW project – and they happened to have SQL Server lying around.

The two biggests risks I saw were bringing the putative BI staff up to speed – and the ETL project.

ETL can account for maybe 80% of a BI project, and there may be virtue in sequestering the complexities to a consultancy.  On the other hand, they will merrily acheive their task, and come up with a theoretically pristine ETL solution… that may provide a theoretical solution, but leave the client with a front end that only performs its task well in theory.  I’ve seen this happen at least twice before (and I’ve picked up the pieces) – where a consultancy built a structure, leaving behind a theoretically accomplished mission.  In each case they left no documentation, and a system that a) could not easily be adapted to changing business conditions, b) may not have sufficiently engaged the source business stakeholders, and c) may handle only the majority of the data – if that – while leaving in limbo non-conforming data, ie a large part of the ETL project.  Consultants paid, unquantified work still remaining.

Other challenges abounded, but they were just challenges.  Possible ways through may involve at least some of the following:
a) Have a tech and business savvy person (ideally) work with alongside the ETL consultants, then take over that aspect of the work on an ongoing basis;
b) Choose a more business friendly toolset, or hire some people who were already not too far off being able to do the BI/DW work on an ongoing basis;
c) Hire a relatively experienced BI person to run with the project and then stay on to manage the system and the changing demands on it, mentoring existing staff to the best of their capabilities
d) Allow for a longer implementation schedule;
e) Narrow the scope of the project;
f) Accept the need for a bigger budget for the requisite outcomes.

It pains me to have the size of a project blow out – I would like to deliver streamlined – and effective – BI; versions thereof should be available to all levels of need. Yet on the other hand, it’s too easy for business managers to grit their teeth and say “this is what we are going to achieve” without either building slack into the project or at least inject some relevant expertise at the project planning phase.

As proposed, they would end up with a half-working ETL process and a bunch of people who would struggle to maintain the initial environment, far less meet evolving business needs.

Last heard, that government department was still looking for someone to run the project.

What do you think?

Read Full Post »