SIZE: Breaking the Warehouse Barrier

By Philip J. Gill


E-Business Demands Rapid Increases in Data-Warehouse Capability and Size

With e-commerce sites open 24 hours a day to accept transactions from millions of people, the data warehouses that support these sites are growing at internet speed-and must run at internet speed as well. For companies such as digital-marketing-services provider MatchLogic, a database with the scalability and reliability of Oracle8i is critical.

As with everything else it touches, the internet is causing big changes in data warehouses and data marts, particularly those that support e-commerce sites. The busiest Web sites can experience millions of mouse clicks per day, and because they capture every click cybershoppers make, data warehouses and data marts are growing in size literally at internet speed. According to the IT research and advisory firm; META Group, Inc. (www.metagroup.com), 30 percent of the world's 2,000 largest companies—META's Global 2000—already have data warehouses larger than one terabyte, and several exceed 100 terabytes. E-commerce data is one of the factors driving up data-warehouse sizes.

More dramatic, however, is the change in the fundamental nature of data warehouses and data marts. Businesses have always regarded these systems as critical, because they help companies understand their operations, customers' behavior, market dynamics, and other factors. But until now, companies had never viewed these information storehouses in the same way as they viewed operational systems—such as online-transaction-processing (OLTP) systems—that must be running at peak performance at all times.

The internet changed all that. Most conventional brick-and-mortar retailers close their doors at night, at least for a few hours—but "e-tailers" don't. Like all-night drugstores, they're open for business 24 hours a day, seven days a week, every day of the year. As a result, e-business data warehouses are taking on some of the characteristics of OLTP systems.

"E-commerce is not only making data warehouses larger but it's making them more operational as well," says Aaron Zornes, vice president at the META Group. Like OLTP systems, he says, data warehouses and data marts that support e-commerce sites need to be up and running 24x7x365.

The sizing of systems, adds Zornes, amplifies all other issues, including backup and recovery, performance, and tuning requirements. And because e-commerce requires data warehouses to be online around the clock, database administrators need to keep the systems running at top speed. That makes advances such as greater scalability and partitioning crucial for Web e-commerce. With partitioning, for instance, DBAs can block off a single segment of the overall database for backup and recovery, allowing the rest of the system to continue operating.

Features such as preprocessed queries are also important for maintaining high performance and delivering shorter response times in these very large Web e-commerce systems. (See "Using Materialized Views to Speed Up Queries.")


Targeting Cyberconsumers: MatchLogic

Typical of this new wave of operational data warehouses are those operated by MatchLogic, Inc., a wholly owned subsidiary of Excite@Home (www.excite.com) that provides digital-marketing services to Fortune 100 companies, direct marketers, and digital merchants. MatchLogic helps advertisers target banner ads—those ubiquitous ads that pop up or scroll across Web pages—at cyberconsumers.

Snapshot: Matchlogic

MatchLogic, Inc. (www.matchlogic.com), based in Westminster, Colorado, was founded in 1996. As a full-service digital-marketing company, it offers online-advertisement serving, site registration, customer-data acquisition and analysis, and e-mail marketing campaigns. MatchLogic's ad servers support more than 1,500 Web sites, and the company tracks and records more than 300 million Web-site impressions daily. MatchLogic's core data-mart services help customers analyze ad response, customer buying trends, and more. The data marts all rely on Oracle8i Database Server, for its performance, scalability, and manageability—manageability that allows MatchLogic to run almost 12 terabytes of Oracle databases with only six DBAs.

Hardware

  • Digital Alpha servers
  • EMC 39/30 Storage Server
  • Sun Microsystems Ultra Enterprise 6500 and 4500 servers
  • Sun Microsystems A-1000 StorageArray

Software & Services

  • Oracle8i Database Server
  • Oracle Parallel Server
  • Oracle Designer
  • Oracle Discoverer
  • Oracle Enterprise Manager
  • Ardent DataStage Suite
  • BusinessObjects' BusinessObjects
  • Digital UNIX
  • EMC TimeFinder and PowerPath
  • Sun Solaris 2.6

For consumers and business buyers, the internet offers an easy, convenient way to shop, allowing them to log on at any time to surf for products, services, and best prices. For marketers and advertisers, it's a bit more complicated. Manufacturers and retailers know the internet to be a convenient new sales and distribution medium that lets them reach potential customers directly, quickly, and inexpensively. But they also know it's not always easy to find the right place to advertise. Although the internet is a powerful communications medium and distribution channel that reaches more than 190 million people worldwide, according to International Data Corporation (www.idc.com), the very magnitude of its reach and scope has become a central part of the marketing dilemma: Not all Web surfers are potential customers.

"On the internet, nobody knows if you're a 12-year-old child, a dog, or a serious shopper," says Jack Garzella, MatchLogic's director of core systems engineering. Of course, he adds, advertisers want to reach as many serious cybershoppers as possible. To that end, MatchLogic uses sophisticated, large-scale data warehouses and data marts based on Oracle8i to help some of the Web's largest and most prestigious marketers and advertisers serve up ads, collect customer data, and track consumer behavior. The company's goal is to help advertisers reach the largest number of potential buyers with their marketing investments.


New Medium, New Measures

With traditional mass-media outlets, such as television, advertisers rely on popular numerical measures—particularly the Neilsen Ratings—to measure their success with consumers. With many Web sites relying on advertising revenue to support themselves, similar measures have emerged as an important indication of a site's popularity and, therefore, its desirability as an advertising location. Some sites track the number of users frequenting the site, through either daily or monthly unique impressions. Other sites, particularly the popular portals such as Yahoo and Lycos, count the number of registered users who have voluntarily submitted their names, e-mail addresses, and demographics to the site. The higher these numbers, the better those sites supposedly are at attracting cyberbuyers.

But a recent News.com (www.news.com) article points out a glaring discrepancy in some of the numbers—when all the user-registration numbers from the largest internet portals are combined, they far exceed the actual number of internet users worldwide. "The numbers of unique users or impressions are becoming less and less useful measures of how well advertisers are reaching audiences on the Web," says Christopher Charron, research director for new media at Forrester Research Corporation (www.forrester.com). "The typical marketing measures of the old media—eyeballs—just aren't effective in measuring the return on investment of Web advertising."

On the Web, Charron says, "marketers are looking for the only metrics that count—results." That means targeting the sites that deliver high click-through rates that ultimately lead to online purchases.


Beyond the Numbers

MatchLogic's various services help advertisers get beyond the eyeball numbers. The company provides not only advertisement serving but also ad targeting, ad data acquisition and analysis, mass-distribution e-mails, and other services that help Web retailers reach serious shoppers. With a customer list that includes General Motors Corporation, Procter & Gamble, Intel Corporation, Charles Schwab & Co., Beyond.com, Peapod, and Worldprints.com, MatchLogic serves up ads to more than 1,500 Web sites around the world and tracks more than 300 million impressions per day from Web sites.

Since its founding in 1996, MatchLogic has collected more than 72 million profiles of internet users, matching the 72 million cookies (anonymous ID tags embedded in all Web-browser software) associated with those profiles with six demographics—age, income, gender, education, marital status, and parent/nonparent status. This data lets MatchLogic's clients target ad placement according to demographic as well as geographic criteria.

In addition, a MatchLogic data-acquisition service helps companies collect, store, and analyze information from Web-site visitors. MatchLogic collates the information in an Oracle8i data warehouse and uses a high-level, summary view of the data to help companies identify potential customers.

The company also offers an e-mail service that is based on information in an Oracle8i data warehouse. It sends out two or three mass e-mailings each day, targeted at potential customers who have already indicated their willingness to receive unsolicited e-mail messages.


Most Bang for the Buck

To help companies get the highest ROI for their ad dollars, MatchLogic offers its DataMart product line, a series of Oracle8i data warehouses and data marts, which provides three types of outsourced internet data warehouses—two anonymous and one self-reported.

For maximum effect, MatchLogic's customers combine data from the different data warehouses. For example, one of the anonymous data warehouses, the Aggregate Summary Warehouse, holds all the cookie-level data the company has collected over the years. "We don't know who you are; we just know that this browser has this ID on it and it has seen ads and clicked on these ads and gone to this Web site"that sort of thing," says Garzella.

The Aggregate Summary Warehouse provides a high-level, amalgamated view of click-stream data. It's a general-reporting warehouse that uses only summary-level data. For instance, says Garzella, the warehouse "reports that this campaign showed 15 creatives yesterday and this creative had 100 impressions, 15 clicks, and 2 purchases," says Garzella.

The Aggregate Summary Warehouse summarizes raw data collected by the other anonymous warehouse, the Research & Analysis (R&A) Warehouse. With the R&A warehouse in particular, says Garzella, Oracle8i's scalability shines. "With click-stream data, you very quickly get into billion-row tables," says Garzella.

In contrast, the self-reported warehouse, the Profile Repository, contains contact history, campaign information, response history, e-commerce purchase information, e-mail messages, cookies and IDs, and more. It stores more than 12 million names, postal addresses, and e-mail addresses of people who have voluntarily supplied that data to MatchLogic. With this information, MatchLogic and its customers can perform overlays to identify what statistics match specific demographics.

The Aggregate Summary Warehouse currently contains about 200GB of data and is growing at a relatively stable rate of about 5GB per week, says Garzella. The Profile Repository is much larger, about 500GB, and growing rapidly toward the terabyte mark. "A year and a half ago, it was 150GB," Garzella notes. He believes that its growth path will remain on a similar, upward trajectory.

MatchLogic also provides customer-specific Profile Repositories and R&A Warehouses for customers that want them. The largest of these R&A Warehouses belongs to General Motors, exceeds 1.5 terabytes, and contains a fact table with more than 2 billion rows; using a star schema, it has 5 or 6 primary dimensions and about 20 secondary dimensions.


Shining Through

All the MatchLogic data warehouses use Oracle8i and generally use several Oracle8i options, including Oracle Parallel Server, objects, advanced queuing, and partitioning. MatchLogic chose to base its digital-marketing services on Oracle Database Server, for its scalability, performance, manageability, the vendor relationship, and the dominance of third-party support for the Oracle architecture.

"Scalability and performance are big issues for us—that's scalability from a size and a staffing perspective," explains Garzella. "Both are very good on the Oracle8i platform. I'm running almost 12 terabytes of Oracle databases with six DBAs."

With the General Motors database, he notes, "We can do a full-table scan, looking at every row, and clear that 2-billion-row table in about 15 minutes."

For browsers and buyers in the online-commerce world, a company's data-warehouse capabilities are invisible—until they cause performance or response problems. But for e-business, the pressure is constant: Provide 24x7 availability with a database that is growing at the speed of light, or your customers will find a company that can. Like MatchLogic, those companies need to build from the ground up with a scalable, reliable database server created with the internet in mind.

>Philip J. Gill ([email protected]