摘自ibm:
Data Miner
Solution-Based Data Mining
Bob Angell
Specific, solution-based data marts could save businesses time and money.
On a recent consulting engagement, my team was asked to solve a specific business problem for a financial services company. We had to format, manipulate, transform, summarize, and calculate the company-provided customer data to prepare for data mining, a process that took several months. And at the end of all the time and effort, the company learned ... nothing. Or, more accurately, it learned that the information it had was not the right information for solving that particular business problem.
Projects such as this one squander a company's time and resources, but itdo
esn't have to be that way. If companies consider what information they might need for each particular business problem they wish to solve at the data mart design stage, they could save time and money - and make the data mining analyst's job a lot easier. This solution-based approach to database design, development, and implementation plays off the current generation of data mining tools to maximize the success of data mining projects.
I'll explain how a solution-based design can help, plus offer examples of three common business problems and the design considerations for each.
The Foundation
New developments in data mining tools combined with current databases' ability to accommodate knowledge discovery activities create the foundation for a solution-based approach. Although it wasn't long ago that analysts had to generate their own data sets, data mining tools have come a long way since then
. Flat-files used in some of the earlier data mining tools have given way to data repositories. Most data mining tools today read and write to most federated databases, and some can manipulate, update, and mine directly from these data sources.
Now that a few of the data mining tools are mature, data miners can turn their attention to how to construct a data mart that will simplify data mining efforts. Thinking about the data elements needed to solve a particular business problem before building the data mart and considering what extracting, processing, transformations, and analysis will be needed will help data mining projects succeed. Although data miners aren't generally responsible for data mart design, getting involved in the process early will help those who will be designing, developing, and deploying the data marts get it right the first time.
Solution-based data mining leads to smaller, more concise data marts: Each one stores only the information necessary to solve the business problem at hand. A few specialized data marts would serve as extensions of your data warehouse.
Many older database designs are insufficient or inappropriate for decision support, though they might be fine for transaction processing. But even different decision-support tasks require different information. To identify your most profitable customers, the data extracts, loads, and analysis you'd perform differ greatly from those you'd use to find out which products your customers are purchasing. Answering these questions requires extremely different approaches. Analyst involvement in the design process is essential to get the design right the first time.
Design Solutions
Many companies suffer through years of evolving data structures. Often, their struggles result from overlooking some basics, such as constraint checking or orphaned dependencies. Another common error is to create molecular data, when successful data mining requires atomic data. Combining critical data elements creates a molecular mess. (For example, when the data element "Mr. John Robert Smith Jr." is manifested as a single entry in a database field, it is considered molecular data. In contrast, when all the information is in individual pieces in separate fields, the data is considered atomic.) Calculations, deltas, standard deviations, and other additional modifications fit into an atomic database structure. You simply need to be able to find the elements that created the modifications.
These considerations hold true for almost any data mart. Now I'll explain some design considerations for specific, common business problems.
Customer profitability. To answer the question, "Who are my most profitable customers?", you first need to define what your business considers a profitable customer. Because profitability can be defined and calculated in many different ways, this step is critical. For this example, let's assume you're calculating profitability at the customer (rather than household or account) level. After you settle on a definition of profitability, you can use data-driven discovery techniques (such as clustering) to segment the customer records. The database records for this problem should be rolled up into a summary format for the appropriate level, giving the analyst only one record per level (or customer, in this example). In addition, the database must contain complete demographic information (such as name, city, state, ZIP Code, and so on). Calculating age, duration with the company (most likely represented in days), and profitability score (usually from 0 to 1, with 1 representing the more profitable the customer) would help the analyst save time when solving this business problem. Bydo
ing everything possible ahead of time, the solution-based data mart lets the analyst spend less time manipulating and transforming the data to be mined. And the preparations for the data mart can reveal missing data, wrong data elements, and other factors that could seriously undermine any data mining project. Once the business problem is solved, the process can be automated.
Fraud detection. Because of the many implications for the word "fraud," you might hear this task referred to as entity profiling. This business problem consists of profiling an entity (usually a person) and creating an alert when the entity triggers an event that deviates too much (or too little) from the normal range. Credit card profiling is a common example. Several years ago, decision-support techniques revealed that the following scenario would most likely occur with a lost or stolen card:
An unauthorized individual recovers the card.
The individual goes to a gas station and tries to place a $3 to $5 transaction to determine if it is a "hot" card. The small purchase attempt initially was not enough to trigger suspicions at the credit card company.
If the transaction is successful, the individual will try to max out the card within a short period of time. If the transaction is unsuccessful, the individual will make an excuse (that the card is old, for example) and pay cash instead. Generally, the clerk suspects nothing.
Today, many companies have extensive profiles on purchase behavior and other customer propensities. I live in Salt Lake City and recently traveled to Houston for business. Late one night, I received a phone call from my credit card company minutes after making a large purchase. They noticed that a series of recent transactions on my card didn't fit the past credit card behavior and didn't take place where I was most likely to use the card. The sequence of events flagged what is called a "statistical outlier," a deviation from the normal parameters in place from my previous credit card activity. In this example, although it appeared that the credit card was stolen, legitimate but unusual circumstances created the outlier. The company's database will update my profile to reflect this new behavior.
When designing for fraud detection, make sure the data mart is designed to record every transaction and capture sequences of events (for example, the small charge followed by a series of large charges detailed in the steps I mentioned). An identifier for flagging fraudulent transactions is also very important. Other calculations and deltas might be needed, depending on the quality and quantity of data that exists. Once again, the entity profile data mart could become an automated solution, allowing companies to respond quickly to potential business risks and safeguard their business assets.
Loyalty card profiling. Many companies want to understand not only who their customers are, but also what kind of behavior they might exhibit. Combining customer purchase histories with demographics and other information can help companies design retail store layout, develop cross-sell and up-sell opportunities, and create targeted or customized marketing opportunities. As a result, many retail outlets are now tracking customer purchases through the use of what is called a "loyalty card." This card lets the company link your items purchased (your market basket) with your demographic and, in some cases, psychographic data. Understanding customer preferences and behavior is the goal of most businesses today.
Database design is the key to market basket analysis and loyalty card profiling. The data mart must collect and store all point-of-sale (POS) data. It must also link the appropriate loyalty card number with the items purchased. Once you know the basket relationships, you can incorporate all the demographic information to create a profile of the individuals creating these relationships. The POS data should be formatted so that each item purchased is a separate record. Once a solution is properly constructed, an analyst could take advantage of combining it with other solution-based data marts for other retail problems. Solution-based data marts are modular and can be used as building blocks to help solve a more complex set of problems.
Making an Impact
Despite your best efforts, you may not be able to convince your company to use separate solution-based data marts. Nevertheless, you can still have a positive impact on database design by making sure these questions are addressed:
Will the design accommodate easy access to data (for example, addressing internal and external data, multiple sources, and so on)?
Will the design make it easier to extract data?
How easy will it be to manipulate data (such as roll-ups, joins, and derived attributes)?
Which tools will be included for seeing the data (for example, OLAP tools, reports, descriptive statistics, and interactive visualization)?
Will it be easy to store and manage data? Is temporary storage available? What should bedo
ne with remaining information? Can data preparation steps be archived for reuse as model metadata?
Does the design accommodate sharing work with others (by providing code libraries, collaboration tools, source code control, and so on)?
Are elements in place for distributing results (such as model libraries for scoring in production, models that know about data preparation requirements, libraries that make it easy to run scoring of archived models without intervention from the original analyst)?