Posts Tagged ‘data quality’

Although data quality exercises have a variety of business paybacks, they are often low on the radar until a particular business need (or failure!) arises.  That can be short-sighted; they can be often enough justified by a cost-benefit analysis of the existing data quality.  Business decision-makers that don’t want to allocate budget should be properly aware of the ramifications of saying no – all too often, the case is not presented clearly enough in a business context.  But once undertaken a data quality project should eventually transform into ongoing data quality processes, including audits and governance, which are far less costly than revisiting the same issues later when stemming from different failures.

Data quality projects can emerge from many affective issues.  General examples are problems with:

  • accuracy
  • completeness
  • timeliness
  • consistency

which can especially derail change in an organisation, whether new business development or new IT functionality.

A talk given through TDWI by Chris King illustrated some typical experiences with a data quality project, in the context of a regional level of an international hotel chain.  There were 8,000 employees in the region, half of whom could directly impact the data.

The starting point was the ‘single customer view’ objective.  This is a very common confluence of business need with IT strategy: as common as it is to hear that a company’s customers are its most significant business resource, it is almost axiomatic that the biggest data quality issues are to be found with customer data.  Customer data tends to come from a variety of sources – of variable quality; too often the customer can enter their own information without human mediation.

Yet the relationship between a customer-centric businesses and its data quality strategy is variable.  Jim Harris at OCDQ has a tragicomic tale to relate  about an MDM/EDW* project with 20 customer sources.  He characterised the company as having a business need to identify its most valuable customers, yet they “just wanted to get the data loaded” (sounds familiar) and intended to rely on MDM and “real-time data quality” via the ETL processing.

How valid is that approach?  It should be decided by the key business stakeholders, with input from the technical analysts on current data quality (and project constraints).  From the sound of it, that’s not how the decision-making was done – yet even if so, how confident were the key business stakeholders that they had a good handle on the issues (and weren’t obfusticated by the technical details)?

In the case of the hotel chain, 40% of bookings arrived centrally, while 60% were people applying directly.  Generally there was better quality data in the former, as it tended to be repeat customers with an established history – resulting in some informal cleansing in the past.

Issues were sourced to the variety of collection points, such as:

  • call centre: cost containment requirements had crunched call time, with an attendant reduction in data capture;
  • third-party collectors of information, such as travel websites: they may have their own data capture requirements, but they’re just as likely to regard the customer as their own, and forward minimal details;
  • email marketing: less focus on eliciting the full gamut of customer details.

Mandation of fields presents a typical quandary: you want as much as possible, but people will always find a reason to circumvent them, and a way.  But what’s worse than no data? Bad data – especially when shuffled into good data.  Among the ideas tested were simply highlighting some fields rather than mandating them, and a trial of requesting drivers licenses.

They separated information from I.T., as Information Services.  This to better deliver information management, champion data quality, and support decision-making.  As opposed to Jim Harris’ example above, they worked on data quality before data integration projects – which can significantly reduce the cost of such projects when it comes their turn.  In fact, Chris commented that once the objectives of the data quality project were well understood, it was both far easier to introduce the changes, and softened up the stakeholders for other objectives like integration.

Data Stewardship is an important part of the ongoing process.  Once you’ve brought people together initially, it’s easier to set up a structure to manage data continuously, not just as a centralised dictionary, but as a necessary and useful dialogue with affected stakeholders.  This can prevent in advance situations Chris uncovered, such as finding one person’s VIP code has been set up by someone else to flag inclusion in a blacklist.

Data quality thresholds were addressed by incentives as basic as ice cream in call centres, through to General Manager bonuses.

Chris commented that there remained some wider business issues for resolution, such as tracking business vs leisure travel, and upselling into different brands [of hotel].  But as I said, further developments are less likely to be stymied by poor data, with a cleaning exercise under one’s belt and a quality structure in place.

* Master Data Management, Enterprise Data Warehouse


Read Full Post »

This week I have pointers to three discussions I’ve been reading.

BI Workspace: another ‘future of BI’ is discussed here, traceable back to a report from industry analysts Forrester (executive summary here).  What it is: the concept of a fully navigable data environment, geared specifically to the power user who has sufficient understanding of the data and its business context to make rational use of the full extent of data exploration.

Data Quality as an issue of contextA discussion at the always-useful OCDQ on data quality being a wider issue than simply accuracy.  Data accuracy was fully acknowledged, but other dimensions raised.  My contribution to the discussion focused (as usual) on the quality – fitness – of the data as a business resource: including timeliness, format, usability, relevance – and delivery mechanisms. (To give the discussion its due, it was prompted by Rick Sherman’s report on a TDWI Boston meeting.)

Quality Attributes as a point of architecture: An ambitious point was raised as a discussion point on LinkedIn’s TDWI group.  The essence was a suggestion that data quality dimensions defined as standards or Architectural Criteria when designing repositories.  Should standards such as ‘availability’, ‘portability’, ‘recovery’ be built into a data repository’s initial design?  Sounds laudible, but how practical is it to define it to measurable detail?  How intrinsic should such measures be to such a project’s SLAs?

Finally, a comment by Atif Abdul-Rahman (blog Knowledge Works) on my previous post linking business intelligence to business process improvement.  Atif effectively said BI+EPM=BPM.  My first reaction was to treat it as spam 🙂   – what do you think?

Read Full Post »

Bill Inmon is one of the two gurus of data warehousing.  His claim is to have invented the modern concept of data warehousing, and he favours the top-down approach to design.

[Ralph Kimball is the other modern guru, who is credited with dimensional modelling – facts and dimensions.  He favours bottom-up design, first building data marts.]

Inmon is associated with the BeyeNetwork, maintaining his own “channel” there, on Data Warehousing.

Recently discussing data quality, he canvassed the issue of whether to correct data in a warehouse when it’s known to be wrong.

One approach is that it is better to correct data – where known to be in error – before it reaches the warehouse (Inmon credits Larry English for this perspective).

In contrast, there’s the notion that data should be left in the warehouse as it stands, incorrect but accurately representing the production databases. Inmon attributes this approach to Geoff Holloway.

Of course, Inmon easily demonstrates cases for both perspectives.  This is understandable because both versions of the data – corrected or incorrect – provide information.  On the one hand, business data consumers would want correct information, no mucking around.

But on the other hand, incorrect data is an accurate reflection of production values – and it can be misleading to represent it otherwise.  In particular, bad data highlights the business process issues that led to the entry the errors, and that in itself is valuable business information.

And here’s where I branch beyond Inmon.  I would argue the case for both forms of the data to be preserved in one form or another.

We have all experienced the exasperation of being faced with poor quality data flowing into business reports/information.  On a day-to-day basis, the information consumer doesn’t want to know about errors – they just want to use the information as it should rightly be, as a business input.  They may well be aware of the issues, but prefer to put them to one side, and deal with BAU* as it stands.

What this is saying is that the approach to data quality fixes should really be a business decision.  At the very least, the relevant business stakeholders should be aware of the errors – expecially when systemic – and make the call on how to approach them.  In fact, ideally this is a case for… a Data Governance board – to delegate as they see fit.  But unless the issues are fully trivial, errors should not be fully masked from the business stakeholders.

So if the stakeholders are aware of the data issues, but the fix is not done and they don’t want to see the errors on day to day reportage, how to deal the need to fix – at least as the data is represented?

I see four options here, and I think the answer just pops out.

Option 1: correct the data in the reports
Option 2: correct the DW’s representation of the data with a view
Option 3: correct the data itself in the DW
Option 4: correct it in ETL processing

Option 1 is fully fraught.  I have done this on occasion when it has been demanded of me, but it is a poor contingency.  You’re not representing the data as it exists in the DW, but more importantly, if you have to run a transform in one report, you may well have to reproduce that transform.  Over and over.

Option 2: creating a view is adding a layer of complexity to the DW that is just not warranted.  It makes the schema much harder to maintain, and it slows down all processing – both ETL and reporting.

Fixing the DW data (option 3) is done.  But again, it may have to be done over and over, if ETL overwrites it again with the bad data.  And there is a very sensible dictum I read recently, paraphrased thus: any time you touch the data, you can introduce more errors.  Tricky.  Who can say with certainty that they have never done that?

Of course, I would favour handling it in ETL.  More specifically, I would like to see production data brought to rest in a staging area that is preserved, then transformed into the DW.  That way, you have not touched the data directly, but you have executed a repeatable, documentable process that performs the necessary cleansing.

Not always possible, with resource limitations.  Storage space is one problem, but it may be more likely (as I have experienced) that the ETL processing window is sufficiently narrow that an extra step of ETL processing is just not possible.  Oh well.  There’s no perfect answer; the solution always has to fit the circumstance.  Again, of course, it’s a matter of collaboration with the business (as appropriate via the data steward or DG board).

Oh, and most importantly: get back to the business data owner, and get them working (or work with them) on the process issue that led to the bad data.

*BAU=Business As Usual – at risk of spelling out the obvious.  I find acronyms anathemic, but spelling them out can interrupt the flow of ideas.  So I will endeavour to spell them out in footnotes, where they don’t have to get in the way.


Read Full Post »

“we had the data, but we did not have any information”
– CIO to Boris Evelson (Forrester), on the global financial crisis.

Vendor marketing messages have been said to contend that only 20% of employees in BI-using organisations are actually consuming BI technologies (“and we’re going to help you break through that barrier”).

Why is the adoption of BI so low?

That was my original question, brought about by a statistic from this year’s BI Survey (8).  As discussed in a TDWI report, in any given organisation that uses business intelligence, only 8% of employees are using BI tools.

But does it matter?  Why should we pump up the numbers?  It should not be simply because we have a vested interest.

The questions are begged:

What is BI, and why is it important?
BI is more than the query, analysis and reporting from a database:

“Business intelligence (BI) refers to skills, technologies, applications and practices used to help a business acquire a better understanding of its commercial context” – Wikipedia

It’s a very broad definition.  A rather more technical one from Forrester:

“Business intelligence is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information used to enable more effective strategic, tactical, and operational insight and decision-making. . . .”

But it can be explained more simply as:

data -> information -> knowledge -> insight -> wisdom

Data can be assembled into information.  Information provides knowledge.  Knowledge can lead to insights (deeper knowledge), which can beget wisdom.  Is there any part of an organisation that would not benefit from that process?  If there are any roles sufficiently mundane that insights won’t help them improve the job, improve their service delivery, then I guess those roles would not benefit from BI.  Yet I would suggest they are few and far between, and they should be automated as soon as possible, because you can bet that employees filling those roles won’t feel fulfilled, won’t feel motivated.

Business intelligence has a part to play in that whole process above.  At the lowest level, it can provide data for others to analyse.  But at every step of the process of generating wisdom from data, BI has a part to play.  In that sense, it is both intrinsic to an organisation’s aims, and everyone has a part to play in it.

I started into this subject aiming to canvas the reasons behind poor BI takeup.  After some research and reflection on my own experiences, though, I found a whole book’s worth of material in that simple question.  So it’s not something I can lay out simply, in one take.

First, let’s see an example of good use of data – one, in fact, that demonstrates both the adding of value to the data, and the presentation and impartment of insight.

That wonderful organisation TED (“ideas worth spreading”) has a presentation by Hans Rosling, a Swedish professor of International Health.  Start with Rosling’s entry at TED, and look at any one of the presentations there.  The first has the most oomph, but they are all good.  Why?  Meaningful data, good presentation tools and a Subject Matter Expert.  (Thanks to Mike Urbonas for the reference).

Rosling’s presentations are a prime example of business intelligence done right.  The data was gathered from multiple sources, its quality assessed, it was assembled and presented in a fashion that gave its audience insights. In fact, the presentation tool he uses, Trendalyzer, although later bought by Google was originally developed by his own foundation Gapminder.org.  (There are similar tools such as Epic System‘s Trend Compass; MicroStrategy also has a similar tool)

Much as it might look like it, I wouldn’t say the job began and ended with Rosling.  Whatever other parts he played, here his role is SME.  Yet his presentations clearly demostrate the involvement of other roles, from data analyst to system integrator to vendor/software developer.

Barriers to BI takeup

So where to start?  Everyone has an opinion.

Rosling: “people put prices on [the data], stupid passwords, and boring statistics”.  In other words, he wanted data to be free, searchable, and presentable.  Integration and system issues aside, he found his barriers to be data availability and the expressiveness of his tools.

Pendse:  he gave a number of barriers, including “security limitations, user scalability, and slow query performance… internal politics and internal power struggles (sites with both administrative and political issues reported the narrowest overall deployments)… hardware cost is the most common problem in sites with wide deployments; data availability and software cost;… software [that] was too hard to use…”

In grouping together the issues, I found the opportunity to apportion the responsibility widely.  All roles are important to the successful dissemination of a business’ intelligence: CEO, CIO, CFO, IT Director, IT staff, BI manager, BI professional (of whatever ilk), implementation consultant, vendor, SME (too often under- or not rated!), all the way down to the information consumer.

Comments welcome.  See part two for some discussion about gaps that exist in the delivery of BI.

Read Full Post »

What is Data Governance?  How does it relate to data stewardship? Meta-data management?

“Data Governance is a quality control discipline for managing, using, improving and protecting organizational information.

…It is an outcome oriented approach to treating data as a balance sheet:
asset (value)  –  liability (risk)”
Steve Adler

Although the terms at top are related, Data Governance presents an overarching philosophy for enterprise-level management of a business’ data resources.

It was, I believe, initiated by IBM‘s Steven Adler who, responding to unaddressed management issues he encountered, in 2004 set up a Data Governance Council.  This includes a number of IBM customers (including top financial organisations such as Citigroup, Amex, Bank of America) and partners, plus some academic representatives, so yes, it was originally an IBM initiative.  But as a concept it has broken out of the box, so to speak, and there are initiatives all over, such as The Data Governance Institute, a fully vendor neutral organisations.

My first encounter with the concept was a presentation by Adler at a 2008 IBM conference*, in which he gave his articulation of the various strands and mechanisms inherent in a ‘data governance organisation’.

Adler’s presentation started with a talk on the concept of toxic data (largely reproduced here, although he also discussed its role in the Global Financial Crisis), and its potential impact on an organisation and its customers and public.

Data Governance certainly appeals to those of us whose work intersects business and data issues.  It is concerned with managing the quality of an organisation’s data, mitigating the risks attached to data issues.  A Data Governance Committee constitutes a level of administration below executive level, but overseeing Data Stewards, who in turn mediate with the data consumers.

For my money, a DG committee should include a C-level sponsor, ideally both business and technology focused, such as CIO/CTO and CFO.  It should also include representatives of the business data owners, and data stewards, as well as, I believe, representatives at a data user level.  Obviously these voices would have differential weight on such a committee, but all those voices would contribute to the requisite quality outcome.

Data Governance is a business issue: data is an inherent part of a business and its processes.  There is no firm boundary between business and data – they flow into each other; they should reflect each other accurately.

DG is about identifying risks, implementing business policy, providing an auditable framework for management of data resources, and overseeing the actual management.  This is not as simple as managing bad data (although a committee can develop policy and accountabilities, and act as an escalation point).  Yet importantly, as Adler says, it is be a nexus for maintaining confidence – trust – in an organisation’s data resources.

But a DG comitttee can also be a forum for mediating competing claims on data (who owns it, how it should be represented).  It can define metrics and processes, including tolerance threshholds.  Data issues covered should include accuracy, completeness, currency, reasonability, consistency, and identifiability/uniqueness – although in practice, detail work can be devolved to other roles reporting to the committee, such as stewards or owners.  The important thing is to have a forum in which potentially competing interests (such as I.T., finance, and different business units) can come to agreement in a structured, auditable way.  Not only can competing interests be mediated, but potential gaps in responsibilities can be identified and covered.

According to Dave Loshin, best practices include:
– clarifying sematics
– evaluating business impacts of failure (putting a value on data assets – this can also protect data quality/governance initiatives)
– formalising quality rules
– managing SLAs
– defining roles
– deploying tools such as data profiling, mapping, etc
– [overseeing management of] metadata repositories
– scorecarding.

That last point is something both Adler and Loshin place importance on.  Scorecarding is a way of encapsulating initiatives and achievements in a way that can be socialised from C-level down.

Of the resources listed below, I most strongly recommend Adler’s presentation: it has copious detail.
Useful resources

*In a conversation afterwards, Steve proved to be a really nice guy.  As well as discussing business, we found we shared the burden of a famous namesake in the music industry (as you can see via Wikipedia or Google). He also had met Obama, and expressed appreciation for him at a time he was yet to prove himself properly in the US primaries.

Interestingly, as I write, his latest blog entry includes a cryptically brief comment about the effect of frameworks on innovation, and that he’s now “working on new ideas”.

Read Full Post »

I was once tasked with increasing revenue through data analysis.  Has all our sales resulted in ongoing service contracts?  Catch the opportunity for selling service contracts when the opportunity first arises, and identify any past sales that are not currently covered.

Sound easy?  Well the first part was.  It was a matter of writing reports on impending service contract (or warranty) expiries.

The second part became a hair-puller because it uncovered a number of different general data quality issues.  We’re talking about historical data here.  It’s easy to say that we now enter all service contract information, but since when?  Why was there masses of blanks for warranty or contract end dates?

Data quality issues are business issues.  Not just because business stakeholders care about it (often they don’t, if it doesn’t touch them), but because they reveal issues with business processes.  If we expect a field to have a value in it, then why doesn’t it?

The warranty end date is a good illustration.  It turned out that a) we sold some equipment second-hand, with no warranty; b) some minor parts did not come with warranties.  But if we already know data quality is patchy, then we don’t have a way of telling whether the equipment was sold without warranty or whether the data wasn’t entered.  I eventually traced this to source.  I got agreement from the person who entered the sales information; she found it simplest to enter ‘NONE’ if equipment was sold without warranty.  (albeit process changes should be finalised – if not initiated – in a structured way with the business unit’s manager, to ensure retention as a permanent procedure.

Okay, reports written, business process updated, what’s left?  Trawling through historical data for service contract opportunities.  This proved more elusive than it sounded – because of data quality issues.  We didn’t know whether a customer still had the equipment – although they should be covered by a Service Account Manager, some SAMs were better than others, and customers were under no obligation to notify.  Nor was it easy to identify high-value opportunities – we didn’t have access to a master list of equipment and their value – only one that covered currently sold items.  The list of hurdles went on… but finally we got to a point where we felt opportunity returns would be too low for further pursuit.

Some time after this, I attended a presentation – under the aegis of an earlier incarnation of the Sydney branch of TDWI – on data profiling.  It illustrated how a quick pass with data profiling tools could replace months of manual analysis.  You beauty!, I thought.  I wish I’d had such a tool.  Saved labour costs would far exceed expenditure, and I can’t see that spending that time down and dirty with the data gave me enough extra insight that a tool wouldn’t provide.

Some of the lessons in this:
– understand the quality of the data before embarking on an extended journey in data analysis;
– data profiling should be a first step in a data quality exercise;
– data profiling tools rock!
– attempt an ROI for such an exercise, and try to quantify the end point (albeit sometimes there is a “just do it” command; for example in the above case, the business unit needed to increase revenue);
– poor data quality can generally be traced back to an originating business processes; yet bad data sometimes reflects only historical practices that no longer happen;
– poor data quality often only surfaces when a business stakeholder deems a new use (or old use with renewed vigour!) for the data in question.

Data quality issues are business issues – unless the technical people are goofs, quality issues originate with business processes. This is great: identifying root cause is most of the battle, and the solutions are usually the easiest part.  However that doesn’t make the investigation mission critical; that represents cost the business must be willing to bear.

Of course, it should be the business stakeholders rather than the technical analyst who decides the scope or magnitude (priority) of a data quality issue.  That doesn’t make it so, unfortunately.  The flipside of “just do it” is “don’t bother me” – then when the data proves to be bad, it’s possible to take just as much flak for not doing anything (based on business direction) as for inappropriately prioritising tasks.  Still, the technical analyst needs to remain mindful of getting caught up in the “last mile” of quality assurance when it takes an inordinate effort or there are potentially higher priorities.

I recommend a blog entry on cleansing priority: see Cleanse Prioritisation for Data Migration Projects – Easy as ABC?.  While aimed at data migration projects, it gives some good suggestions for placing qualitative priorities on data cleansing tasks, especially where a deadline is involved.  It’s a matter of attributing a business impact to not doing each cleansing task, and inter alia it flags the trap of spending time on the easy/obvious over the task with greater business impact.  The Importance Of Language, too: if you couch the priorities in sufficiently clear business impact terms, it’s easier to avoid the other great trap of rehashing old ground.  “Target Of Opportunity”, for example, accords no business impact, but it’s not like distracting the business with “I won’t ever bother addressing this”.  Then again, there are pitfalls if too much bad data falls into a lower priority bucket; there’s little worse than a stakeholder’s loss of confidence in the data.

Read Full Post »

Data quality, data quality.  It should be high on the agendas of both IT and business stakeholders.  Yet it can trip up in two ways: not putting enough resources into it, or putting too much resource into it.

I’ve been besieged recently by many voices on the matter, so I thought I might make a start on rounding them up.

First, a consensus that I can vouch for from personal experience: beware of the least important data getting the most attention.  Blogger Andrew Sturt quotes Kalida‘s Lowan Chetty: “If you try to perfect the quality of all your data you inevitably end up spending the most time on the least important data”, and goes on: “that’s because no one has really been looking at it and that has allowed significant entropy. Spending time focused on poor quality and unimportant data is a poor use of resources.”

Yes!  My strong experience is that data is cleanest when it is valued and frequently used, and poorest quality when it is seldom touched.  Still, that shouldn’t be license to simply scour only high-traffic data once, then leave it for users to clean.  On the one hand, it pays to be proactive with important data.  An initial sweep may uncover slack business practices that can be addressed, but that’s not to say further errors won’t creep in over time.  On the other hand, can low-traffic data be entirely neglected?  A data warehouse is just that: a warehouse where anyone could come poking around for any information, and have an expectation that what they find will not be dusty and unusable.  Business needs always change. (Brush that up for you?  Give me a few weeks!)

Sturt ultimately advocates an agile DW development process, which implies that by the time you get past the important data, prioritisation will dictate how quickly you get around to the less important stuff.

But a project-based approach to data quality can spell doom.  As OCDQ illustrates (missed it by that much), a project-based approach may improve quality, but it can leave business stakeholders losing confidence in the system if they are still focused on the gap and not the main game.

Ultimately, the answer is good data governance.  That is, all data having ownership, and IT and business collaborating to maintain the quality not just of the data, but especially of the meta-data.  Structured and, yes, bureaucratic, but a necessary bureaucracy in an organisation (especially at enterprise level) that values their data as a crucial business resource.

A paper from Melissa Data gives one example of how to work up to the governance process, in six steps:
1.  Profiling
2. Cleansing
3.  Parsing and standardisation
4. Matching
5. Enrichment
6. Monitoring

On an ongoing basis, a data governance group would focus on monitoring, but as new data comes, or data gets used in new ways, steps 1 to 5 would often need to be iterated.  This can largely be done by IT, but for my money it should remain under the aegis of a data governance body.

I’d like to put in a word for profiling.  Some might find it an unnecessary step to build a picture of the data when you should already have it, but believe me, the time taken in cleansing can be reduced by an order of magnitude by insights gained in profiling first.

If interested, you can read a survey and debate on the relationship between data quality efforts and data governance – by no means nailed down yet.

That same site also has an assessment of the use of ETL tools for data quality, which uses the Data Quality Assessment framework created by Arkady Maydanchik (the Melissa Data paper above purports to a similar end, which is not really borne out in the content).

Well, I’ve already got to the end, but there’s so much more to say.  Another day.  Meanwhile, here’s some links on the topic that have useful insights:
Efformation (Andrew Sturt)
OCDQ (Jim Harris)
Data Quality Pro – bloggers are only a part of this site
IAIDQ – fundamentals
– and how would about data quality in the context of The Art Of War, or the Tower Of Babel?
– plus more good blogs referenced by Jim Harris here.

Read Full Post »