Who invented data deduplication




















For example, as relating to the code representing the context, a backup-server may choose to classify files by file types. It is the backup-server prerogative to decide what it is that the backup-server wants to group together. Another probable grouping could be of files by the file's owners such that all of a first users files have one color and a second users files have another color.

Furthermore, the preferred character is represented using the code such that the representation retains an original meaning of the character. In one embodiment, the present invention identifies similarities between data chunks encoded using the aforementioned coding scheme by comparing the metadata represented by the Unicode characters and performing deduplication based on the level of similarity.

Using the file coloring, not only is data bearing the same color is quickly distinguished but the intensity of the file colors is rapidly identified. Thus, the above description assists deduplicating appliances in marking and then locating similar [candidate] files instead of relying on hash values alone.

However, another embodiment described herein pertaining to the present invention, adds yet another dimension to the file coloring processes, which allows for sub-grouping into smaller and more manageable classes e. An addition enhancement is provided instead of relying on hash values, by providing a hybrid of proximity and identity similarity based deduplication. In other words, not only are the file colors of data blocks distinguished, but the intensity of the file colors are identified to improve the classification of colored files.

The color intensity is used to improve the classification of files within a given file color group e. A signature e. In file coloring, unique binary representations in a file e. It should be noted that the present invention is indifferent to the semantics of the data, but only interested in the data stream file color and the color intensity for sub-classification of data streams, which happen to share the same color.

In so doing, similar data streams are identified in order to improve the deduplication-hit rate. Efficiency is increased in the deduplication process and is unaware of the files' type since the present invention focuses on the attributes common to all files, especially text files, which are the best candidates for deduplication.

Measuring the intensity of the colored files enhances the classification of data files that were group by the colors of the data files through comparison of the color intensity of the colored files. The color intensity is used to improve the classification of files within a given color group. As such, the present invention uses file-coloring to assist the process of finding similar data chunks, which are based not solely on hash signatures.

The present invention assists in clustering potentially similar chunks together and in doing so, creates an hybrid solution of proximity—similarity and identity similarity put together —and facilitates an increased efficiency of a deduplication process as well as minimize the chances for hash collisions that could lead to data loss. A hybrid of proximity and identity similarity based deduplication is used in a data deduplication system using a processor device in a computing environment.

A frequency distribution map of characters may also be built. However, instead of only doing so as a basis for using the frequency distribution map of characters to build hash values that are less likely to change if minor changes in the data do happen, the distribution of the various characters that appear in the data are used as the basis for the suggested approximate hash.

Each such data chunk will be analyzed as to the distribution of the bytes forming it and the frequencies of the data chunks. The sequence of different bytes is defined according to the order by the byte's frequency of occurrence in the chunk, as the c-spectrum of C, and the corresponding sequence of frequencies as the f-spectrum of C.

In addition, the sequence of different byte pairs are considered, and ordered by the different byte pair's frequency of occurrence in the chunk, and also calls the sequence of different byte pairs of the p-spectrum of C.

The suggested approximate hash function ah C will be a combination of certain elements of these spectra. The reasoning behind the decision of relying on these color distributions e. On the other hand, small perturbations in the data will often have no impact, or just a minor impact on the corresponding spectra, which is the goal the present invention achieves in designing an approximate hash. Frequency alone cannot be used for similarity approximation but, frequency and intensity of the character distribution e.

The file-color based technique requires a single processing of the data to come up with the chunk's color-attributes. The file-color technique does not use frequency distribution as means to another end but rather as an end by itself. Consider a scenario where a museum that owns several original Van Gogh paintings is looking for new [e. Assume that the three Van Gogh paintings e. For example, continuing with the painting analogy, the museum naturally records all of the museum's original paintings and keeps an index of the unique signatures e.

A more thorough test is required to establish the authenticity of a piece that passed the earlier similarity test. The detailed comparison is done in the case of the painting by comparing the actual fractions of the two paintings to one another or as done by other vendors , by comparing the respective signatures of painting's fractions, to see maybe if it is a new piece of work, or even worse, a poor copy of the original painting.

Also, the museum does not spend its monetary resources up front therefore only keeping brief descriptions of each of its paintings. The brief descriptions might be the same for a number of paintings e. Using this analogy as a backdrop to describe the present invention, the file-coloring technique, as described herein, may require comparing a number of similar descriptions against a new piece of data. The file-coloring operation generally characterizes data e.

It should be noted that similar tests by definition may not lead to a data loss where a piece of data is considered identical to new data arriving at the system even if the piece of data is not identical to the new data since a thorough test is conducted to ensure the identity of respective data blocks.

A similarity test for identifying data chunks may be evaluated according to four categories. Failing in this category will result is less than optimal deduplication as the duplicate data chunks will be created instead of referencing earlier ones. The data chunks' colors are used as a means for identifying and then comparing the data chunks for similarity. The hybrid-based similarity test leverages the initial color grouping; category 1. Using the file-coloring for testing the similarity of data chunks allows for hierarchical structure such that if an alleged similar chunk is found to be a false candidate, than other chunks in the same proximity, according to their attributes, may be used.

Such is different than using hash values alone where the hash value do not have any inherit description of the data they represent and there is no probable relation to other possible similar data chunk candidates.

For executing a hybrid proximity and identity of similarity-based deduplication in a data deduplication, the color intensity of colored files are compared for additional classification enhancement of the colored files that are grouped together by file coloring. The files may be colored by representing by a preferred character for the file coloring by using a code selected from a multiplicity of codes that represent a variety of contexts.

A multistep similarity search operation may be used for searching first for a dominant characteristic of the colored files and searching second for associated characteristics of the of colored files. The dominant characteristic may include a dominant color and the associated characteristics may include at least color intensity and distribution of colors to the colored files.

A signature may be used for identifying similarly colored files for classifying similarly colored files together. The file coloring of data chunks may be embedded in data streams.

The file coloring may include shapes, colors for one of the servers, file owners, and a applications. The colored files may be are compared by comparing vectors of at least two colored files. The color intensity is used for comparing the colored files by measuring a ratio between an actual average distance of the colored file divided by an optimal average distance of the colored file for comparing distribution of colors in data chucks of the colored files. The optimal average distance is equal to a file size divided by a total number of the file colors that appear within the colored files.

The color intensity includes a distribution pattern characteristic of the file coloring. The similarities are identified between data chunks of the colored files using the color intensity. The data chunks identified as having a similar color intensity are then classified as similar data chunks.

A file color group contained in at least one of a multiplicity of file color permutations is identified for incoming data chunks. Generally, the permutations include the act of permuting rearranging objects or values e. Informally, a permutation of a set of objects is an arrangement of those objects into a particular order.

As another example, an anagram of a word is a permutation of its letters. The incoming data chunks are compared to existing data chunks in the one of the multiplicity of file color permutations, and the incoming data chunks are compared to existing data chunks in one of the multiplicity of file color permutations or to an alternative one of the multiplicity of file color permutations if all of the incoming data chunks fail to be substantially similar e.

The color intensity may be calculated by measuring a ratio between the actual average distance of a color of the colored file divided by the calculated average distance e. The calculated average distance of the colors of the colored file assumes the color of the colored file is evenly distributed in the colored file. If the calculated-average-distance is larger than the actual average-distance then the color is identified as being clustered in the data file and not evenly spread.

The hybrid based technique file coloring described herein is used to improve and fine-tune the hash's clusters following on the heels of the induced file coloring also described herein.

An n-tuple is defined inductively using the construction of an ordered pair. Each tuple may host the respective data chunks, which are characterized by the tuple. The tuples may be ordered or not ordered. In an ordered tuple the weights of the file colors are important.

For the non-ordered tuple, a file color is merely part of a tuple but the weight of the color within that tuple is not recorded. This approach for the non-ordered tuple does take into account the possibility that changes may occur in the data chunks and may cause the color order to change, which is more realistic even though it extends the number of chunks per tuple it. In the case of the ordered-tuple a need may exist to actively search number of related tuples before exhausting the search for similar chunks.

The hybrid approach described herein repeatedly uses the hashes, and groups the hashes for increased processing efficiency while reducing the search time. In one embodiment, a maximum number of color-based trees used is ! Despite these large numbers the colored-based trees are expected to be sparse and radically smaller in number. The order by which the chunks should be compared is two fold. First, vectors of two chunks colors' are compared if the majority of the colors do appear in both chunks.

The actual average distance is the actual distance of the colored file from end to end. The optimal average distance is equal to a file size divided by a total number of the file colors that appear within the colored files, wherein the color intensity includes a distribution pattern characteristic of the file coloring e.

Data chunks with comparable color intensity are more likely to be similar than those that have different level of color intensity of character distribution pattern. Turning now to FIG. The computer system 10 includes central processing unit CPU 12 , which is connected to communication port 18 and memory device The communication port 18 is in communication with a communication network The communication network 20 and storage network may be configured to be in communication with server hosts 24 and storage systems, which may include storage devices Memory device 16 may include such memory as electrically erasable programmable read only memory EEPROM or a host of related devices.

Memory device 16 and storage devices 14 are connected to CPU 12 via a signal-bearing medium. In addition, CPU 12 is connected through communication port 18 to a communication network 20 , having an attached plurality of additional computer host systems In addition, memory device 16 and the CPU 12 may be embedded and included in each component of the computing system Host computers , , , are shown, each acting as a central processing unit for performing data processing as part of a data storage system The hosts, , , and may be local or distributed among one or more locations and may be equipped with any type of fabric or fabric channel not shown in FIG.

Data storage system is accordingly equipped with a suitable fabric not shown in FIG. Data storage system is depicted in FIG. The cluster hosts , , and may include cluster nodes. To facilitate a clearer understanding of the methods described herein, storage controller is shown in FIG.

It is noted that in some embodiments, storage controller is comprised of multiple processing units, each with their own processor complex and system memory, and interconnected by a dedicated network within data storage system Storage labeled as a , b , and n in FIG. In some embodiments, the devices included in storage may be connected in a loop architecture. Storage controller manages storage and facilitates the processing of write and read requests intended for storage The system memory of storage controller stores program instructions and data, which the processor may access for executing functions and method steps of the present invention for executing and managing storage as described herein.

In one embodiment, system memory includes, is in association with, or is in communication with the operation software for performing methods and operations described herein. As shown in FIG. During the deduplication processing, an index table of unique digests is created from the data blocks that are identified as candidates for deduplication. Referring to FIG. In a deduplication domain, each storage extent contains a range of data blocks. For example, in FIG. Within a data storage system 70 , there may be multiple deduplication domains such as deduplication domain- 1 , and deduplication domain- 2 Within a deduplication domain, a goal of a deduplication process is to maintain only a single copy of each unique set of data.

Software or other logic executing the deduplication process examines data in the deduplication domain in fixed sized chunks and determines whether the data stored in a chunk is the same as the data stored in another chunk in the same deduplication domain.

If so, an address map for the LUNs is manipulated so that respective address map entries for the chunks reference the same physical chunk of data, and then the chunks that currently hold the extra copies of the data are freed up as unused storage. The address map for the LUNs stores a mapping of logical block addresses to physical block addresses.

In at least some embodiments of the current technique, the fixed sized chunk can be a data block. In at least one embodiment of the current technique, deduplication director is a process that iterates through deduplication domains including logical units and schedules data deduplication jobs based on deduplication policies to perform data deduplication.

Further, deduplication director works in conjunction with deduplication engine to perform data deduplication on deduplication domains , Thus, deduplication director is a component responsible for coordinating data deduplication operations. As a result, deduplication director identifies data deduplication domains, manages storage space for performing data deduplication, and manages deduplication engine to process each data deduplication domain.

In at least one embodiment of the current technique, deduplication director performs operations such as discovering a deduplication domain configuration in a storage system, preparing system resources for scheduling deduplication jobs , scheduling execution of the deduplication jobs based on policies , performing the deduplication jobs on deduplication domains by working in conjunction with deduplication engine , monitoring status of the deduplication jobs and providing information regarding execution of the deduplication jobs , and managing the system resources for performing data deduplication.

Further, data deduplication engine executes a deduplication job by performing data deduplication on a deduplication domain by iterating through data blocks of the deduplication domain, obtaining digests for the data blocks, identifying deduplication candidates, and issuing deduplication requests to deduplication server In at least one embodiment of the current technique, deduplication server is a component that provides services to deduplication director to iterate over sets of data in a set of deduplication domains , Deduplication server also computes digests and remaps blocks after the deduplication technique is applied to remove duplicate blocks of data.

A deduplication database e. Deduplication engine communicates with the deduplication server to iterate through the set of deduplication domains , and computes digests for data blocks that are iterated through. A digest is created for each chunk of data e.

Deduplication engine detects potential duplicate copies of data and issues a request to the deduplication server to deduplicate the data. The deduplication database is stored on one of the storage extents that include one or more LUNs. An index table may also be maintained on a LUN located in the same pool as the deduplication domain In at least some implementations, an index table is a persistent hash-table of chunk-IDs keyed by the digest of the data stored in the chunk.

The index table need not contain entries for every data chunk in the deduplication domain, but the effectiveness of deduplication is a function of the number of entries stored in the index table The more entries in the index table, the more likely that duplicate blocks will be detected during deduplication processing. To accommodate more entries, the index table requires more memory and storage resources. Additionally, if the amount of storage used by the user is in terabytes, it can take days to identify chunks of data for such a large address space of the storage.

Thus, the index table typically contains an incomplete set of entries and does not include digests for all of the data inside all of the storage in the deduplication domain. Further, each iteration of a data deduplication process may perform data deduplication on each deduplication domain of a storage system for a fixed period of time.

Generally, a large amount of time such as days may be required to iterate each data block of a large deduplication domain. For example, 24 hours may be required to iterate over a deduplication domain which is 4 terabytes TB in size. Thus, in order to make progress towards data deduplication in each deduplication domain of a storage system, a timer may be defined such that each deduplication job executes for a specific period of time defined by the timer and then waits for the next iteration to start.

Further, multiple deduplication jobs may be executed concurrently during the time period indicated by a timer. Thus, in at least one embodiment of the current technique, deduplication director schedules deduplication jobs in a time-period based approach such that a first subset of deduplication jobs are executed concurrently for a specific time period and upon expiration of the specific time period, a next subset of deduplication jobs is scheduled.

Thus, for example, if a storage system includes 10 deduplication domains, the first iteration of data deduplication may include scheduling four deduplication jobs that may execute concurrently for a specific period of time to iterate through four deduplication domains. In such a case, a fifth deduplication job waits until either one of the four deduplication jobs finishes or the specific period of time ends for one of the four deduplication jobs.

Further, in such an example, when either the first four deduplication jobs finishes or the specific period of time ends, next four deduplication jobs are scheduled.

However, in such an example, as soon as any one of the four deduplication jobs finishes, the next deduplication job may be scheduled. Further, when each of the ten deduplication jobs is executed during the first iteration, the second iteration starts execution of first four deduplication jobs again. Thus, the second iteration of data deduplication is performed similar to the first iteration.

Further, when deduplication jobs are initially started, each deduplication job is scheduled and executed with the same priority. However, based on the rate at which data of a deduplication domain is deduplicated, the priority of a deduplication job which performs data deduplication on deduplication domain is adjusted. Thus, at the end of an iteration of data deduplication, the priority of a deduplication job may be adjusted based on information regarding how a deduplication domain is deduplicated by the deduplication job.

Further, during an iteration of data deduplication, deduplication director may decide to skip scheduling low priority deduplication jobs. For example, deduplication director may schedule three deduplication jobs out of the total of five deduplication jobs during iteration. Further, deduplication director schedules deduplication jobs based on a policy which may include the priority of a deduplication job.

Further, deduplication engine maintains statistical information regarding data deduplication performed by deduplication jobs on deduplication domains. Such statistical information may be used by deduplication director for adjusting priority of deduplication jobs Further, deduplication director may make a determination as to how long ago a low priority deduplication domain has been iterated through and based on the determination, may forcibly schedule a data deduplication job for the low priority deduplication domain.

Further, deduplication director may set a time delay between execution of two deduplication jobs in order to avoid performing repeated and frequent iterations of data blocks. Further, deduplication director may increase the time interval between executions of two iterations of a deduplication job on a deduplication domain if the rate at which data of the deduplication domain is deduplicated is less than a specific threshold.

Deduplication of data happens in two logically distinct operations: detection and remapping. The detection operation identifies blocks containing the same data. The remapping operation updates address maps that record physical locations of logical units of data so that a single block of data is shared by multiple LUNs or by multiple positions within the same LUN. Detection is accomplished by building a database e.

When two blocks have the same digest they have a sufficiently high probability of containing the same data to warrant a bit-for-bit comparison to confirm they are exact duplicates.

Remapping leverages dynamic block-mapping technology of file system mapping driver A file system allows dynamic manipulation of the address maps that connects LUN's logical address space to its physical address space.

The file system also allows mapping a single block of storage at multiple locations within the file system, and allows handling of writes to shared blocks by allocating new storage and updating the shared address mappings. Thus, deduplication engine and deduplication server working in conjunction with one another identify data blocks for deduplication, compare data digest information of the data blocks, identify candidate data blocks for deduplication, issue deduplication requests, and maintain index table File system mapping driver performs a deduplication operation by freeing up redundant instances of a deduplicated data block.

With reference also to FIG. Deduplication engine executes each deduplication job of the set of deduplication jobs for deduplicating data of the set of deduplication domains step For each deduplication job of the set of deduplication jobs scheduled by the deduplication director , characteristics of data deduplication performed on a deduplication domain by the deduplication job is evaluated step Based on the evaluation of the characteristics of data deduplication performed on the deduplication domain by the deduplication job, priority of the the deduplication job is updated step Deduplication server computes a digest for the data block in order to deduplicate the data block step Deduplication engine processes the digest and sends a request to deduplication server to deduplicate the data block upon identifying a candidate data block to which the data block may be deduplicated step If a matching digest is found in the index table step , deduplication engine sends a deduplication request to deduplication server indicating that a candidate data block associated with the matching digest has been identified for deduplicating the data block step A matching digest found in the index table indicates that the data block likely contains exact same data as data stored in the candidate data block corresponding to the matching digest.

However, if no matching digest is found in the index table , the digest of the data block is added to the index table step Further, deduplication engine issues a request to the deduplication server to update block mapping information in order to deduplicate the data block with the candidate data block.

The deduplication server issues a read request for the candidate data block as well. The read request for the candidate data block is processed identically to the first read request. When the second read request completes, deduplication server compares the data read from the data block with the data read from the candidate data block step If the data of the data block is not same as the data of the candidate data block, the request to deduplicate the data blocks fails and an error is returned back to the deduplication engine step If the data blocks are successfully deduplicated, the address mapping of the data block is updated to point to a single copy of the data i.

If the data blocks are not successfully deduplicated, an error is returned back to the deduplication engine to update its index table accordingly step Further, a data block may be deduplicated to more than one data block. In such a case, at step in FIG. Upon determining that the data block has been deduplicated previously, the matching entry found in the index table is updated to indicate the recent data block that has been deduplicated to this data block in step Thus, in at least one embodiment of the current technique, the current technique improves the efficiency of a data deduplication process by scheduling deduplication jobs on deduplication domains that have a high probability of being deduplicated.

Thus, in at least one embodiment of the current technique, maximum and minimum thresholds are maintained for managing time intervals between scheduling two subsequent iterations of a deduplication job for a deduplication domain. Further, a priority is maintained for a deduplication job indicating how much time deduplication director waits before scheduling the next iteration of the deduplication job again and determines which deduplication job to schedule during the next iteration of data deduplication.

In such a case, when the deduplication job stops, the priority of the deduplication job is adjusted based on information such as the number of digests generated during iteration of data blocks of the deduplication domain and the number of successful deduplication requests performed during the execution of the deduplication job. The priority of the deduplication job is increased if the number of digests generated during iteration of data blocks of the deduplication domain and the number of successful deduplication requests exceeds a specific threshold value.

Similarly, the priority of the deduplication job is decreased if the number of digest generated during iteration of data blocks of the deduplication domain and the number of successful deduplication requests are below a specific threshold value. Further, the priority of a deduplication job may also be adjusted based on specific events that may indicate addition of a large amount of new data to the deduplication domain.

Thus, by focusing system resources on deduplication domains where data blocks have a high probability of getting deduplicated, utilization of system resources is improved while increasing the efficiency of data deduplication because data deduplication is dynamically adjusted between multiple iterations of data blocks based on how data is deduplicated during each of those iterations.

While the invention has been disclosed in connection with preferred embodiments shown and described in detail, their modifications and improvements thereon will become readily apparent to those skilled in the art. Accordingly, the spirit and scope of the present invention should be limited only by the following claims. Disk vendors, on the other hand, will have had a lukewarm response to the new standard but nonetheless show a willingness to respond if in fact real market momentum develops behind it.

We expect however that it will take six months to a year for them to move toward compliance to the standard. One reason users will have been forcing this issue is the realization that, at this point in technological development, data deduplication provides an easier and more natural path to significant cost savings than thin provisioning. Interest remains high in thin provisioning, however, and we still expect more action in this area in the future.

Finally, we will have seen the first hints of business and government action to push standards as a way to ensure that they can remain in control of their data and not become locked into proprietary traps in the face of increasing concern about the realities of managing information assets and liabilities. Specifically, the European Union will have started the first hearings toward potential regulatory requirements for data quality, data control and data distribution, forcing large institutions to drive their compliance efforts down closer to the technologies that service the applications that are central to business activity.

Action Item: It will have appeared that users will have adopted data deduplication as an important first step in facilitating the transformation in how they administer storage in a storage market that currently is experiencing significant change. However, they should not regard deduplication as a cure-all for out-of-control data growth. Dedup is most effective in data backup and recovery applications, where large volumes of unchanged data are being stored over and over.

In these applications users will be best advised to wait for the development of thin provisioning technology. However, one person's savings could easily cost other people plenty. There are clear trade-offs between lowering disk costs, impacting application performance, increasing RPOs , and increasing bandwidth costs.

It will be easy for poorly aligned budget systems to drive inappropriate decisions. Action Item: IT executives should not let deduplication out as an IT infrastructure standard any time soon, especially if no deduplication standard is in place. CTO skunk works or controlled experiments on the total systems impact of deduplication on a few specific applications is appropriate initially to build up practical experience and pragmatic guidelines.

IT executives should initially counter excessive vendor hype and dampen expectations for deduplication both within IT and outside IT. Tape is the most ancient of storage life forms, yet it, too, is poised to have to learn a new trick: data de-duplication.

Very likely, this will be the technology that demarcates a new generation of tape products, as tape vendors incorporate data de-duplication into a new class of tape controllers. In evaluating this option, users should evaluate carefully the full implications of such a move. There are clear advantages in automation and data recovery for high value data, but equally clear disadvantages in longer RPO 's more data lost in the case of a disaster and higher costs.

If data duplication standards emerge, it will be good news for either option. Bandwidth requirements will be lower, and tape backups will be quicker and take less time. Action Item: Tape vendors are likely to introduce a wave of new invention around technologies like data de-duplication in response both to real high-end customer requirements and revenue pressures.

True integration of deduplication into the IT infrastructure would allow data to be held in deduplicated format except when it is being processed. The data could be created, copied locally, copied remotely, migrated, backed up and recovered in deduplicated form. Applications could know about and control whether deduplication is evoked, and control any parameters.



0コメント

  • 1000 / 1000