Archive for the ‘data’ Category

Although data quality exercises have a variety of business paybacks, they are often low on the radar until a particular business need (or failure!) arises.  That can be short-sighted; they can be often enough justified by a cost-benefit analysis of the existing data quality.  Business decision-makers that don’t want to allocate budget should be properly aware of the ramifications of saying no – all too often, the case is not presented clearly enough in a business context.  But once undertaken a data quality project should eventually transform into ongoing data quality processes, including audits and governance, which are far less costly than revisiting the same issues later when stemming from different failures.

Data quality projects can emerge from many affective issues.  General examples are problems with:

  • accuracy
  • completeness
  • timeliness
  • consistency

which can especially derail change in an organisation, whether new business development or new IT functionality.

A talk given through TDWI by Chris King illustrated some typical experiences with a data quality project, in the context of a regional level of an international hotel chain.  There were 8,000 employees in the region, half of whom could directly impact the data.

The starting point was the ‘single customer view’ objective.  This is a very common confluence of business need with IT strategy: as common as it is to hear that a company’s customers are its most significant business resource, it is almost axiomatic that the biggest data quality issues are to be found with customer data.  Customer data tends to come from a variety of sources – of variable quality; too often the customer can enter their own information without human mediation.

Yet the relationship between a customer-centric businesses and its data quality strategy is variable.  Jim Harris at OCDQ has a tragicomic tale to relate  about an MDM/EDW* project with 20 customer sources.  He characterised the company as having a business need to identify its most valuable customers, yet they “just wanted to get the data loaded” (sounds familiar) and intended to rely on MDM and “real-time data quality” via the ETL processing.

How valid is that approach?  It should be decided by the key business stakeholders, with input from the technical analysts on current data quality (and project constraints).  From the sound of it, that’s not how the decision-making was done – yet even if so, how confident were the key business stakeholders that they had a good handle on the issues (and weren’t obfusticated by the technical details)?

In the case of the hotel chain, 40% of bookings arrived centrally, while 60% were people applying directly.  Generally there was better quality data in the former, as it tended to be repeat customers with an established history – resulting in some informal cleansing in the past.

Issues were sourced to the variety of collection points, such as:

  • call centre: cost containment requirements had crunched call time, with an attendant reduction in data capture;
  • third-party collectors of information, such as travel websites: they may have their own data capture requirements, but they’re just as likely to regard the customer as their own, and forward minimal details;
  • email marketing: less focus on eliciting the full gamut of customer details.

Mandation of fields presents a typical quandary: you want as much as possible, but people will always find a reason to circumvent them, and a way.  But what’s worse than no data? Bad data – especially when shuffled into good data.  Among the ideas tested were simply highlighting some fields rather than mandating them, and a trial of requesting drivers licenses.

They separated information from I.T., as Information Services.  This to better deliver information management, champion data quality, and support decision-making.  As opposed to Jim Harris’ example above, they worked on data quality before data integration projects – which can significantly reduce the cost of such projects when it comes their turn.  In fact, Chris commented that once the objectives of the data quality project were well understood, it was both far easier to introduce the changes, and softened up the stakeholders for other objectives like integration.

Data Stewardship is an important part of the ongoing process.  Once you’ve brought people together initially, it’s easier to set up a structure to manage data continuously, not just as a centralised dictionary, but as a necessary and useful dialogue with affected stakeholders.  This can prevent in advance situations Chris uncovered, such as finding one person’s VIP code has been set up by someone else to flag inclusion in a blacklist.

Data quality thresholds were addressed by incentives as basic as ice cream in call centres, through to General Manager bonuses.

Chris commented that there remained some wider business issues for resolution, such as tracking business vs leisure travel, and upselling into different brands [of hotel].  But as I said, further developments are less likely to be stymied by poor data, with a cleaning exercise under one’s belt and a quality structure in place.

* Master Data Management, Enterprise Data Warehouse

Read Full Post »

A quick listing of HP’s latest analysis of trends within Business Intelligence:

1.  Data and BI program governance

– ie managing BI [and especially data] more strategically.

2. Enterprise-wide data integration

– recognising the value of such investment.

3. (the promise of) semantic technologies

– especially taking taxonomical (categorising) and ontological (relating) approaches to data.

4. Use of advanced analytics

– going beyond reporting/OLAP, to data mining, statistical analysis, visualisation, etc.

5. Narrowing the gap between operational systems and data warehouses

6. New generation, new priorities in BI and DW – ie updating BI/DW systems

– HP identifies renewals of systems, greater investment in new technology – perhaps in an emerging economic recovery context.

7. Complex event processing

– correlating many, varied base events to infer meaning (especially in the financial services sector)

8. Integrating/analysing content

– including unstructured data and external sources.

9. Social Computing [for BI]

– yet at the moment it takes great manual effort to incorporate such technology into BI

10. Cloud Computing [for BI]

You can find the full 60-minute presentation here.  HP noted that these points are very much inter-related.  I would also add a general tenor that I got from the discussion: that these are clearly more aspirational trends than widespread current initiatives.  HP’s research additionally highlighted the four most important current BI initiatives separately:

– data quality

– advanced analytics [again]

– data governance

– Master Data Management

Other current buzzwords, such as open source, Software as a Service, and outsourcing, didn’t emerge at the forefront of concerns.  For the first two, the comment was made that these were more background enabling technologies.  As for outsourcing, it looked like those who were going to do it had largely done it, and there was current stability around that situation.

Business Intelligence has obviously moved away from simple reporting from a single repository.   Concerns are now around data quality, integration/management – and making greater sense of it, particularly for decision-making.  Those trends are clear and current.  But I’d also like to note one small point almost buried in the above discussion: the use of external data sources.  Business value of data must inevitably move away from simple navel-gazing towards facing the whole of the world, and making business sense of it.  That’s a high mountain, and we’re only just becoming capable of moving towards that possibility in a meaningful way.

Read Full Post »

This week I have pointers to three discussions I’ve been reading.

BI Workspace: another ‘future of BI’ is discussed here, traceable back to a report from industry analysts Forrester (executive summary here).  What it is: the concept of a fully navigable data environment, geared specifically to the power user who has sufficient understanding of the data and its business context to make rational use of the full extent of data exploration.

Data Quality as an issue of contextA discussion at the always-useful OCDQ on data quality being a wider issue than simply accuracy.  Data accuracy was fully acknowledged, but other dimensions raised.  My contribution to the discussion focused (as usual) on the quality – fitness – of the data as a business resource: including timeliness, format, usability, relevance – and delivery mechanisms. (To give the discussion its due, it was prompted by Rick Sherman’s report on a TDWI Boston meeting.)

Quality Attributes as a point of architecture: An ambitious point was raised as a discussion point on LinkedIn’s TDWI group.  The essence was a suggestion that data quality dimensions defined as standards or Architectural Criteria when designing repositories.  Should standards such as ‘availability’, ‘portability’, ‘recovery’ be built into a data repository’s initial design?  Sounds laudible, but how practical is it to define it to measurable detail?  How intrinsic should such measures be to such a project’s SLAs?

Finally, a comment by Atif Abdul-Rahman (blog Knowledge Works) on my previous post linking business intelligence to business process improvement.  Atif effectively said BI+EPM=BPM.  My first reaction was to treat it as spam 🙂   – what do you think?

Read Full Post »

Bill Inmon is one of the two gurus of data warehousing.  His claim is to have invented the modern concept of data warehousing, and he favours the top-down approach to design.

[Ralph Kimball is the other modern guru, who is credited with dimensional modelling – facts and dimensions.  He favours bottom-up design, first building data marts.]

Inmon is associated with the BeyeNetwork, maintaining his own “channel” there, on Data Warehousing.

Recently discussing data quality, he canvassed the issue of whether to correct data in a warehouse when it’s known to be wrong.

One approach is that it is better to correct data – where known to be in error – before it reaches the warehouse (Inmon credits Larry English for this perspective).

In contrast, there’s the notion that data should be left in the warehouse as it stands, incorrect but accurately representing the production databases. Inmon attributes this approach to Geoff Holloway.

Of course, Inmon easily demonstrates cases for both perspectives.  This is understandable because both versions of the data – corrected or incorrect – provide information.  On the one hand, business data consumers would want correct information, no mucking around.

But on the other hand, incorrect data is an accurate reflection of production values – and it can be misleading to represent it otherwise.  In particular, bad data highlights the business process issues that led to the entry the errors, and that in itself is valuable business information.

And here’s where I branch beyond Inmon.  I would argue the case for both forms of the data to be preserved in one form or another.

We have all experienced the exasperation of being faced with poor quality data flowing into business reports/information.  On a day-to-day basis, the information consumer doesn’t want to know about errors – they just want to use the information as it should rightly be, as a business input.  They may well be aware of the issues, but prefer to put them to one side, and deal with BAU* as it stands.

What this is saying is that the approach to data quality fixes should really be a business decision.  At the very least, the relevant business stakeholders should be aware of the errors – expecially when systemic – and make the call on how to approach them.  In fact, ideally this is a case for… a Data Governance board – to delegate as they see fit.  But unless the issues are fully trivial, errors should not be fully masked from the business stakeholders.

So if the stakeholders are aware of the data issues, but the fix is not done and they don’t want to see the errors on day to day reportage, how to deal the need to fix – at least as the data is represented?

I see four options here, and I think the answer just pops out.

Option 1: correct the data in the reports
Option 2: correct the DW’s representation of the data with a view
Option 3: correct the data itself in the DW
Option 4: correct it in ETL processing

Option 1 is fully fraught.  I have done this on occasion when it has been demanded of me, but it is a poor contingency.  You’re not representing the data as it exists in the DW, but more importantly, if you have to run a transform in one report, you may well have to reproduce that transform.  Over and over.

Option 2: creating a view is adding a layer of complexity to the DW that is just not warranted.  It makes the schema much harder to maintain, and it slows down all processing – both ETL and reporting.

Fixing the DW data (option 3) is done.  But again, it may have to be done over and over, if ETL overwrites it again with the bad data.  And there is a very sensible dictum I read recently, paraphrased thus: any time you touch the data, you can introduce more errors.  Tricky.  Who can say with certainty that they have never done that?

Of course, I would favour handling it in ETL.  More specifically, I would like to see production data brought to rest in a staging area that is preserved, then transformed into the DW.  That way, you have not touched the data directly, but you have executed a repeatable, documentable process that performs the necessary cleansing.

Not always possible, with resource limitations.  Storage space is one problem, but it may be more likely (as I have experienced) that the ETL processing window is sufficiently narrow that an extra step of ETL processing is just not possible.  Oh well.  There’s no perfect answer; the solution always has to fit the circumstance.  Again, of course, it’s a matter of collaboration with the business (as appropriate via the data steward or DG board).

Oh, and most importantly: get back to the business data owner, and get them working (or work with them) on the process issue that led to the bad data.

*BAU=Business As Usual – at risk of spelling out the obvious.  I find acronyms anathemic, but spelling them out can interrupt the flow of ideas.  So I will endeavour to spell them out in footnotes, where they don’t have to get in the way.


Read Full Post »

What is Data Governance?  How does it relate to data stewardship? Meta-data management?

“Data Governance is a quality control discipline for managing, using, improving and protecting organizational information.

…It is an outcome oriented approach to treating data as a balance sheet:
asset (value)  –  liability (risk)”
Steve Adler

Although the terms at top are related, Data Governance presents an overarching philosophy for enterprise-level management of a business’ data resources.

It was, I believe, initiated by IBM‘s Steven Adler who, responding to unaddressed management issues he encountered, in 2004 set up a Data Governance Council.  This includes a number of IBM customers (including top financial organisations such as Citigroup, Amex, Bank of America) and partners, plus some academic representatives, so yes, it was originally an IBM initiative.  But as a concept it has broken out of the box, so to speak, and there are initiatives all over, such as The Data Governance Institute, a fully vendor neutral organisations.

My first encounter with the concept was a presentation by Adler at a 2008 IBM conference*, in which he gave his articulation of the various strands and mechanisms inherent in a ‘data governance organisation’.

Adler’s presentation started with a talk on the concept of toxic data (largely reproduced here, although he also discussed its role in the Global Financial Crisis), and its potential impact on an organisation and its customers and public.

Data Governance certainly appeals to those of us whose work intersects business and data issues.  It is concerned with managing the quality of an organisation’s data, mitigating the risks attached to data issues.  A Data Governance Committee constitutes a level of administration below executive level, but overseeing Data Stewards, who in turn mediate with the data consumers.

For my money, a DG committee should include a C-level sponsor, ideally both business and technology focused, such as CIO/CTO and CFO.  It should also include representatives of the business data owners, and data stewards, as well as, I believe, representatives at a data user level.  Obviously these voices would have differential weight on such a committee, but all those voices would contribute to the requisite quality outcome.

Data Governance is a business issue: data is an inherent part of a business and its processes.  There is no firm boundary between business and data – they flow into each other; they should reflect each other accurately.

DG is about identifying risks, implementing business policy, providing an auditable framework for management of data resources, and overseeing the actual management.  This is not as simple as managing bad data (although a committee can develop policy and accountabilities, and act as an escalation point).  Yet importantly, as Adler says, it is be a nexus for maintaining confidence – trust – in an organisation’s data resources.

But a DG comitttee can also be a forum for mediating competing claims on data (who owns it, how it should be represented).  It can define metrics and processes, including tolerance threshholds.  Data issues covered should include accuracy, completeness, currency, reasonability, consistency, and identifiability/uniqueness – although in practice, detail work can be devolved to other roles reporting to the committee, such as stewards or owners.  The important thing is to have a forum in which potentially competing interests (such as I.T., finance, and different business units) can come to agreement in a structured, auditable way.  Not only can competing interests be mediated, but potential gaps in responsibilities can be identified and covered.

According to Dave Loshin, best practices include:
– clarifying sematics
– evaluating business impacts of failure (putting a value on data assets – this can also protect data quality/governance initiatives)
– formalising quality rules
– managing SLAs
– defining roles
– deploying tools such as data profiling, mapping, etc
– [overseeing management of] metadata repositories
– scorecarding.

That last point is something both Adler and Loshin place importance on.  Scorecarding is a way of encapsulating initiatives and achievements in a way that can be socialised from C-level down.

Of the resources listed below, I most strongly recommend Adler’s presentation: it has copious detail.
Useful resources

*In a conversation afterwards, Steve proved to be a really nice guy.  As well as discussing business, we found we shared the burden of a famous namesake in the music industry (as you can see via Wikipedia or Google). He also had met Obama, and expressed appreciation for him at a time he was yet to prove himself properly in the US primaries.

Interestingly, as I write, his latest blog entry includes a cryptically brief comment about the effect of frameworks on innovation, and that he’s now “working on new ideas”.

Read Full Post »

I was once tasked with increasing revenue through data analysis.  Has all our sales resulted in ongoing service contracts?  Catch the opportunity for selling service contracts when the opportunity first arises, and identify any past sales that are not currently covered.

Sound easy?  Well the first part was.  It was a matter of writing reports on impending service contract (or warranty) expiries.

The second part became a hair-puller because it uncovered a number of different general data quality issues.  We’re talking about historical data here.  It’s easy to say that we now enter all service contract information, but since when?  Why was there masses of blanks for warranty or contract end dates?

Data quality issues are business issues.  Not just because business stakeholders care about it (often they don’t, if it doesn’t touch them), but because they reveal issues with business processes.  If we expect a field to have a value in it, then why doesn’t it?

The warranty end date is a good illustration.  It turned out that a) we sold some equipment second-hand, with no warranty; b) some minor parts did not come with warranties.  But if we already know data quality is patchy, then we don’t have a way of telling whether the equipment was sold without warranty or whether the data wasn’t entered.  I eventually traced this to source.  I got agreement from the person who entered the sales information; she found it simplest to enter ‘NONE’ if equipment was sold without warranty.  (albeit process changes should be finalised – if not initiated – in a structured way with the business unit’s manager, to ensure retention as a permanent procedure.

Okay, reports written, business process updated, what’s left?  Trawling through historical data for service contract opportunities.  This proved more elusive than it sounded – because of data quality issues.  We didn’t know whether a customer still had the equipment – although they should be covered by a Service Account Manager, some SAMs were better than others, and customers were under no obligation to notify.  Nor was it easy to identify high-value opportunities – we didn’t have access to a master list of equipment and their value – only one that covered currently sold items.  The list of hurdles went on… but finally we got to a point where we felt opportunity returns would be too low for further pursuit.

Some time after this, I attended a presentation – under the aegis of an earlier incarnation of the Sydney branch of TDWI – on data profiling.  It illustrated how a quick pass with data profiling tools could replace months of manual analysis.  You beauty!, I thought.  I wish I’d had such a tool.  Saved labour costs would far exceed expenditure, and I can’t see that spending that time down and dirty with the data gave me enough extra insight that a tool wouldn’t provide.

Some of the lessons in this:
– understand the quality of the data before embarking on an extended journey in data analysis;
– data profiling should be a first step in a data quality exercise;
– data profiling tools rock!
– attempt an ROI for such an exercise, and try to quantify the end point (albeit sometimes there is a “just do it” command; for example in the above case, the business unit needed to increase revenue);
– poor data quality can generally be traced back to an originating business processes; yet bad data sometimes reflects only historical practices that no longer happen;
– poor data quality often only surfaces when a business stakeholder deems a new use (or old use with renewed vigour!) for the data in question.

Data quality issues are business issues – unless the technical people are goofs, quality issues originate with business processes. This is great: identifying root cause is most of the battle, and the solutions are usually the easiest part.  However that doesn’t make the investigation mission critical; that represents cost the business must be willing to bear.

Of course, it should be the business stakeholders rather than the technical analyst who decides the scope or magnitude (priority) of a data quality issue.  That doesn’t make it so, unfortunately.  The flipside of “just do it” is “don’t bother me” – then when the data proves to be bad, it’s possible to take just as much flak for not doing anything (based on business direction) as for inappropriately prioritising tasks.  Still, the technical analyst needs to remain mindful of getting caught up in the “last mile” of quality assurance when it takes an inordinate effort or there are potentially higher priorities.

I recommend a blog entry on cleansing priority: see Cleanse Prioritisation for Data Migration Projects – Easy as ABC?.  While aimed at data migration projects, it gives some good suggestions for placing qualitative priorities on data cleansing tasks, especially where a deadline is involved.  It’s a matter of attributing a business impact to not doing each cleansing task, and inter alia it flags the trap of spending time on the easy/obvious over the task with greater business impact.  The Importance Of Language, too: if you couch the priorities in sufficiently clear business impact terms, it’s easier to avoid the other great trap of rehashing old ground.  “Target Of Opportunity”, for example, accords no business impact, but it’s not like distracting the business with “I won’t ever bother addressing this”.  Then again, there are pitfalls if too much bad data falls into a lower priority bucket; there’s little worse than a stakeholder’s loss of confidence in the data.

Read Full Post »

Older Posts »