Computer science: Deduplication

The following essay is a sample paper for an essay on Computer science: Deduplication. It should not be used as a ready paper for your assignment as it is already in our website. In case you want an original paper on the same topic please order for the essay at our site and our able writers will work on it from the scratch.

Computer science: Deduplication

Computer Science


In the present world, virtually everything is determined by the level of technology, and as such, it has become apparent that the humans cannot progress without the assistance of the technology.  One of the technologies that have put order in world development is the computer-integrated technologies where, almost all new world orders in development have computer assistance.  In computing, one of the concepts that have come is the data compression technique to eliminate the coarse-grained redundant data and to improve the storage utilization.  This concept is referred to as Data Deduplication where Cohen (2009) summarizes it as a kind of technology that is fundamentally capable of changing the role of computer disks in the back up of files.  This paper looks at the concept of Data Deduplication, the philosophy and a number of issues that are associated with the concept.

The Concept of Data Deduplication

Computer users always look for a cost effective methodology that can help retain their information on their disks for months and where they can make to restore their files in an easy and fast way.  The back-up appliances which are disk based use the deduplication technology and this brings in a number of advantages such as the ease of use, retention of files that is extended, restoration of files in fast and efficient manner and the cost savings.  According to Lee (2008), the deduplication mechanism has the traditional forms and the modern forms, where, the traditional idea of doing this was to target the file level only with the modern idea being that of a block level.  Essentially, this technology is used in the reduction of the ESI redundancy as well as the reduction of the storage size.

Poelker (2008) simplifies the understanding of the data deduplication and gives it as a way of comparing objects and more so, the files or the blocks; and as well removes all the duplicate objects or the non-unique objects.  These objects are the copies of what already exists in the computer and which cannot be needed because the original or the authenticated files are present.  The aim is to free the computer disks and give some room for more files that the user may be having.  The explanation of the data deduplication can be done using some blocks where, the total number of blocks before deduplication may be eight where half of it is duplicates.  After removing these duplicates using the process of data deduplication, the resultant is four blocks and where the other four blocks are discarded.  From this analysis, it can be therefore noted that duplication removes the blocks or the files that are not needed and only remains with the necessary files.

Philosophy of the Security Aspects of Deduplication

The prime concern when opting for deduplication of data is the security of the stored data; for example, large industries are always worried that their data can be lost if there is a collapse of the hard-disk.  Preston (2007) writes that, most big companies especially the regulated ones require the legislation of the privacy act for their information, and that is why saving these files is paramount.  Some of these duplicate files can get into the hands of unauthorized persons such as the competitors and when this is the case, it would mean that the competitive advantage of the company is jeopardized.  In all of this, the computer disk that has received optimum attention and all concerns are on how to remove the excess files and have room for more storage of the files and their security.

Benefit of Using the Data Deduplication

As noted earlier, the modern understanding of the data deduplication is the use of blocks where, usually, the total number of blocks comprises of the original blocks and the duplicates.  It therefore means that if there was to be a deduplication process, half of the area would be secured and which can be used for the storage of other important documents.  Calder (2009) gives this benefit as the key when it comes to the use of the data deduplication in offices and can easily turn the office to be a green office.  Duplicated emails can be reduced by more than half when most of them are duplicated files, and in this, it would essentially put a room of adding more which are of urgency and whose benefits are most needed.  In this, there is also the reduction of the energy that is needed in the processing of data in the computer when the bulky files are reduced using the process.

Data deduplication is also essential in allowing a back-up program that can copy the files in a disk preserved for this back-up and this can be done without trying to differentiate them or omit some of the files.  The storage system would in effect ensure that only one copy is preserved either while the excess is put in other files using other formats or discarded in totality. Since the copied file is just like the original file, it is not good to have the two in one file, and to ensure that the other or the copied file is not lost; it is good to find other space for this back up.   When this is done, it is possible to reduce the backup storage needs by more than 50% or even more, and where to some extent, the storage space can be saved by close to 90%.  Therefore, it can be noted that this program is very essential to the organizations; both large and small for the potential it puts in the storage of data.

Fisher Investment (2010) notes the other core of data deduplication is the increase and improvement of the data integrity, which essentially ends with the reduction of the overall data protection costs.  The data deduplication allows any user to reduce the amount of the disk that is needed for backup by 90% or even more, and therefore, the resultant is that there is added or improved integrity in the use of data and files in major organizations.  With the reduced power and the reduced acquisition costs, it is possible to put the data more in the hard disks of the computers and this adds the integrity and effectiveness of the data.  All forms of data that is to be used in any company should bear authenticity, and it would be bad if there is too much excess data that not only make work redundant but also makes everything ineffective.  It is therefore paramount for organization to find space for any other useful data and that can only be done if the excess files or the duplicated files are removed to give way for new ones.

There is the boosting of the storage capacity and it allows the sharing of the information and the resources that can be used across multiple users.  Data deduplication is very significant and this is regardless of the underlying storage network technology (SAN or NAS).  The fact that the data deduplication removes the redundant files means that what is unwanted is removed or is put in a place that can be referred later.  The files that are left are the ones that are needed and which are of high integrity or the files used to drive the company forward.  According to Fisher (2010), data deduplication is like editing a book to remove some of the things that does not add up well or makes everything unreadable and editing as deduplication leaves only the needed parts.

When Data Deduplication May Occur

There are two major times when the data deduplication may occur, and this is during the inline deduplication and the post-process deduplication process (Laverick, 2010).  During the post-process deduplication, any new data is stored on the device for storage of data and it is then subjected to a process later that analyzes the data that is looking for duplication.  In this, the benefit is that there is absolutely no need to wait for the calculations that can be hash and lookup to be done with or completed before it can be stored.  In this, the store performance is ensured and it is not degraded and the implementation of this offers a kind of policy based operation that gives the users the ability to put off or defer the optimization on the active files.  However, as Laverick (2010) notes, one of the draw backs in this is that it is possible to unnecessarily store the duplicate data for a short period and this can be an issue if the system for storage is nearly full.

On the other hand, the In-line deduplication is a process where the hash calculations in deduplication are created on the device that is targeted and this is as the data enters into the device in real time.  Hiles (2010) is of the view that the benefit that comes with this on which is over the post-process deduplication as discussed above is that it needs less storage as the data is not duplicated.  However, it has its negative sides, and specifically, it is argued that the hash calculations as well as the lookups take a lot of time.  This can therefore mean that the ingestion of the data can be slower and therefore reducing the overall backup throughout the device. Despite this, some vendors in line deduplication that have demonstrated equipment are similar to the post process deduplication.


Data deduplication means reducing the storage capacity of the disks in a computer; and therefore, the process is purely concentrated on the computers.  The essence of this process is to put away some of the data that is not needed and mostly the files that are duplicated from the original ones.  The philosophy of data deduplication is to enhance the safety and security of data in a computer where, if the original and the duplicated files are to coexist, then, if it happens that there is a class of the hard disk, then, everything would be lost.  It is therefore essential that each company in the world have storage devices for the excess files.  This process comes with a number of advantages and some of them encompass the safety and security of the data in the organization.  As well, by the use of the process, it is noted that it is possible to make everything have integrity of some sort and improve the working of the data.  It is paramount therefore to use this process in the freeing of the storage space for new and most urgent data in an organization.


Calder, A. (2009). The green office. Cambridgeshire: IT Governance Publishing

Cohen, A. (2009). ESI handbook: Sources, technology, and process. New York: Aspen Publishers

Fisher Investments (2010). Fisher investments of technology. New Jersey: John Wiley & Sons

Hiles, A. (2010). The definitive handbook of business continuity management. New Jersey: John Wiley & Sons

Laverick, M. (2010). VMware VSphere 4 implementation. New York: McGraw-Hill Companies

Lee, R. (2008). Software engineering, artificial intelligence, networking and parallel/distributed computing. Chennai: Scientific Publishing Services Pvt. Ltd

Poelker, C. (2008). Storage area networks for dummies. New Jersey: John Wiley & Sons

Preston, W.  (2007). Backup & recovery. Sebastopol: O’Reilly Media, Inc