Data Domain Systems
Post date: Oct 07, 2013 2:37:3 AM
EMC Data Domain deduplication storage systems provide a next-generation backup and recovery solution for big data that allows users to enjoy the retention
and recovery benefits of inline deduplication as well as the offsite disaster recovery protection of replication over the wide area network (WAN). Data Domain systems
reduce the amount of disk storage needed to retain and protect data by 10x to 30x. Data on disk is available online and onsite for longer retention periods, and restores become fast and reliable. Storing only unique data on disk also means that data can be cost-effectively replicated over existing networks to remote sites for DR. EMC further extends these benefits through EMC Data Domain Boost software (DD Boost). DD Boost enables advanced integration between Data Domain systems and Greenplum Databases for faster, more efficient backup and recovery.
This section also provides details on Data Domain system integration and administration.
Faster, more efficient backup
• Distributed deduplication process dramatically increases throughput
• Reduced network bandwidth utilization
• Cost-efficient disaster recovery
• Encrypted replication
• Up to 99 percent bandwidth reduction
• Faster “time-to-DR” readiness
• Configured using native Greenplum Database backup and restore utilities
Ultra-safe storage for fast and reliable recovery
• Data Invulnerability Architecture
• Continuous recovery verification, fault detection, and healing
• End-to-end data integrity
Scalable deduplication storage
EMC Data Domain is the industry’s fastest deduplication storage system for enterprise backup and archiving workloads. With a throughput of up to 31 TB/hour, Data Domain systems can protect up 28.5 petabytes of logical capacity, enabling more backups to complete sooner while putting less pressure on limited backup windows.
Data Domain is qualified with all leading enterprise backup software and archiving applications. It easily integrates into existing software infrastructures without change
for either data center or distributed office data protection.
Data Domain systems integrate easily into existing data centers. All Data Domain systems can be configured as storage destinations for leading backup and archiving applications using NFS, common internet file system (CIFS), Data Domain Boost, or virtual tape library (VTL) protocols. Consult the compatibility matrices for information about the applications that work with the different configurations. Multiple backup servers can share one Data Domain system.
Integration into an existing Greenplum DCA
The DCA architecture is designed to help you easily integrate Data Domain systems in a nondisruptive, seamless manner. You can reserve Port 19 in each of the Interconnect switches for Data Domain connectivity. All that is required is to connect the Data Domain system directly into the DCA environment and start the configuration steps for NFS or DD Boost.
Data Domain systems are simple to install and manage. Connect an appliance to the backup server either as a file server via Ethernet or as a VTL via Fibre Channel. All three interfaces can be used simultaneously. Data Domain Boost is also compatible with many other backup applications. For more information, see the EMC Data Domain Boost Compatibility Matrix at the Data Domain support portal.
The Data Domain Data Invulnerability Architecture provides ultra-safe storage for reliable recovery and continuous protection. It provides the industry’s best defense
against data integrity issues. Continuous recovery verification, along with extra levels of data protection, continuously detect and protect against data integrity issues
during the initial backup and throughout the data lifecycle. Unlike any other enterprise array or file system, each appliance ensures recoverability is verified and
then continuously re-verified.
The Data Domain operating system (DD OS) includes extra levels of data protection to protect itself against storage-related faults that threaten data recoverability. Dual disk parity RAID 6 is part of the foundation for continuous fault detection and healing on DD OS. RAID 6 protects against two simultaneous disk faults, can rebuild a failed disk even if there are read errors on other sectors, and can detect and correct errors on-the-fly during reads. This added protection ensures the highest levels of data availability.
In determining global uniqueness, DD OS leverages very strong cryptographic hashing for speed and security. But it does not stop there—a universal hash ensures
against random and malicious hash collisions. An append-only write policy guards against overwriting valid data. After a backup is completed, a validation process looks at what was written to disk to check that all file segments are logically correct within the file system and that the data is the same on the disk as it was before being written to disk. In the background, the Online Verify operation continuously checks that the data on the disks is correct and unchanged since the earlier validation process.
The back-end storage is set up in a double parity RAID 6 configuration (two parity drives). Additionally, hot spares are configured within the system. Each parity stripe
has block checksums to ensure that the data is correct. The checksums are constantly used during the online verify operation and when data is read from the Data Domain system. With double parity, the system can fix simultaneous errors on up to two disks.
To keep data synchronized during a hardware or power failure, the Data Domain system uses non-volatile RAM (NVRAM) to track outstanding I/O operations. An
NVRAM card with fully-charged batteries (the typical state) can retain data for a minimum of 48 hours. When reading data back on a restore operation, the DD OS uses multiple layers of consistency checks to verify that restored data is correct.
The DD OS stores only unique data. Through Global Compression™, a Data Domain system pools redundant data from each backup image. The storage of unique data is invisible to backup software, which sees the entire virtual file system. DD OS data compression is independent of a data format. Data can be structured, such as databases, or unstructured, such as text files. Data can be from file systems or raw volumes. Typical compression ratios are 20:1 on average over many weeks. This assumes weekly full and daily incremental backups. A backup that includes many duplicate or similar files (files copied several times with minor changes) benefits the most from compression. Depending on backup volume, size, retention period, and rate of change, the amount of compression can vary.
The best compression happens with backup volume sizes of at least 10 mebibytes (MiB—a unit of data storage that is exactly 1,048,576 bytes, the base 2 equivalent of MB). To take full advantage of multiple Data Domain systems, a site that has more than one Data Domain system should consistently back up the same client system or set of data to the same Data Domain system. For example, if a full backup of all sales data goes to Data Domain system A, the incremental backups and future full backups for sales data should also go to Data Domain system A.
A Data Domain system compresses data at two levels:
• Global compression–compares received data to data already stored on disk. Duplicate data does not need to be stored again, while new data is locally compressed before being written to disk.
• Local compression–a Data Domain system uses a local compression algorithm developed specifically to maximize throughput as data is written to disk. The default algorithm (lz) allows shorter backup windows for backup jobs but uses more space. Local compression options provide a trade-off between performance and space usage.
Data Domain SISL enables high throughput, inline deduplication. SISL identifies 99 percent of the duplicate segments in RAM, inline, before storing to disk. In addition, it stores related segments and fingerprints together, so large groups can be read at once. With these patented techniques, Data Domain can utilize the full capacity of large SATA disks for data protection and minimize the number of disks needed to deliver high throughput. In the long term, SISL allows dramatic Data Domain system performance improvements as CPU speeds increase.
Multipath and load-balancing configuration
Data Domain systems that have at least two 10 GbE ports can support multipath configuration and load balancing. In a multipath configuration on the Data Domain system, each of the two 10 GbE ports on the system is connected to a separate port on the backup server.
EMC Data Domain Boost significantly increases performance by distributing parts of the deduplication process to the backup server, simplifies disaster recovery
procedures, and serves as a solid foundation for additional integration between backup applications and Data Domain systems.
Retention of data, frequency, rate of change, and backup policies influence the decision when determining the amount of storage required in the Data Domain system. For this solution, the initial capacity was chosen to accommodate a simulated 10 weeks of backup of the DCA.
Data Domain Enterprise Manager
All Data Domain systems run the DD OS, which includes Data Domain Enterprise Manager, a simple web-based rich Internet application for managing Data Domain
systems. DD System Manager provides both a GUI and a command line interface (CLI) for configuration management and monitoring all system operations. The web-based GUI, available through Ethernet connections, can manage up to 20 Data Domain systems (depending on the model) at any location. DD System Manager provides a single, consolidated management interface that allows for the configuration and operation of many system features and settings.
DD System Manager also provides real-time graphs and tables that enable users to monitor the status of system hardware components and configured features. Additionally, a command set that performs all system functions is available to users through the CLI. Commands configure system settings and provide displays of system hardware status, feature configuration, and operation.
The CLI is available through a serial console when a keyboard and monitor are directly attached to the Data Domain system, or remotely through an Ethernet connection using SSH or Telnet. For more information on Data Domain Enterprise Manager, refer to the Data Domain Operating System (DD OS) Administration Guide.
Data Domain file system
Data Domain systems are designed to be a highly reliable “storage of last resort” to provide longer-term onsite retention of backups. As new backups are added to the
system, old backups are aged out. Such removals are normally done under the control of backup software (on the backup server) based on the configured retention period. This process is similar to configuring tape retention policies in which older backups are retired and the tapes are reused for new backups.
When backup software removes an old backup from a Data Domain system, the space on the Data Domain system becomes available only after the Data Domain
system cleans the retired disk space. A good way to manage space on a Data Domain system is to retain as many online backups as possible, with some empty space (about 20 percent of the total space available) to comfortably accommodate backups until the next scheduled cleaning run. Space utilization on a Data Domain system is primarily affected by:
• The backup policy and redundancy in the data
• The size, redundancy, and rate of change of the backup data
• The retention period specified in the backup software
High levels of compression result when backing up datasets with many duplicates and retaining them for long periods of time. The Data Domain file system supports the following interfaces:
• Data Domain Boost
For more information on the file system, refer to the Data Domain Operating System (DD OS)Administration Guide.