1.1. An Introduction to Hashing What is Hashing? A process whereby an item is placed into a structure based on a key-to-address transformation. Hashing is a way of performing optimal searches and retrievals. Essentially, it increases speed, betters ease of transfer, improves retrieval, optimizes searching of data, reduces overhead. Thus, one objective of hashing is to optimize disk accesses and packing density. The packing density, roughly equivalent to a load factor, is simply how much of the table is filled (normally, 70% is sufficient). The ultimate goal, however, is to minimize disk space and access time by inserting and retrieving a record from the table in only one seek. One way to minimize this time is to use small table sizes (say m < 10). There are 4 key components involved in hashing: the hash table the hashing method collisions collision resolution techniques The hash table is a storage location in memory or on disk that records the hashed values created by the hashing algorithm. Some familiarity with the data to be stored is desirable. The hash table is normally created with a certain number of buckets or storage locations. The number of buckets or size of the hash table is related to the quantity of data items that must be stored. It is ideal to choose hashing methods that result in the fewest collisions in a table, so you must take table size into consideration when choosing your hashing method. Hashing differs from other types of data structures such as binary trees, stacks, or lists primarily because of the fact that the data in a hash table does not have to be reorganized before being retrieved or inserted. Also, in the aforementioned other data structures, the items are stored either in lists or trees. This tends to be a problem for larger data sets, where a search and retrieval must travel through all nodes of a tree or all elements of a list. With a hash table, the size is set, so inserting or searching for an item is limited, and not as cumbersome. One final note is that usually with lists, trees or even stacks, the time for storage and / or retrieval is related to the size of the data set. The structure to which elements of data can be hashed is almost always a table (of size = m). These tables are stored in memory or on disk, and contain a set number of buckets (storage spaces). The items to insert will vary. Generally these consist of integers and strings. Each are handled differently. If the item is an integer, it can be directly used by a hashing method to find a key. On the other hand, if the item is a string, it must first be converted to an integer value using the ASCII conventions or some other consistently used method (this is called 'preconditioning'). 1.2 Preconditioning Preconditioning Converting a string to an integer using ASCII conventions. When an item is a string, it has to be converted to an integer before a key can be found. This process is known as preconditioning. Normally, ASCII conventions for converting characters are used. ASCII values are assigned to the 26 letters starting at 11 (numbers 0 to 10 are first). Thus, 'A' is 11, 'B' is 12, 'C' is 13, and so on until 'Z' is 36. The numbers 37 and up are assigned to 'special characters' such as +, -, =, /, * and so on. For example, "Joe" is converted to 202515 using J=20, O=25, and E=15. Sometimes, after preconditioning an item, the resulting integer is too large to be stored in a table. For example, if we had an item that converted to 301522152618252415 (i.e., "telephone"), then this would be too large to find a key. To solve this problem we can use one hash method to achieve a usable number and a second method to map the result to the table. The Advantages and Disadvantages of Hashing One advantage of hashing is that no index storage space is required, whereas inserting into other structures such as trees does normally require an index. Such an index could be in the form of a queue. In addition, hashing provides the advantage of rapid updates. One disadvantage to using hashing as an insert and retrieval tool is that it usually lacks locality and sequential retrieval by key. Consequently, insertion and retrieval are more random. Another disadvantage is the inability to use duplicate keys. This is a problem when key values are very small (i.e. one or two digits). 1.4 Some Real World Applications of Hashing Database indexing Some database systems have a separate file of files known as indexes. When a request is made for data, the key information is first found in the appropriate index file which references the exact record location of the data in the database file. The key information in the index file is sometimes stored as a hashed value. Symbol tables e.g. Fortran variable names It is very important in any language that the references to variables can be looked up quickly. This speeds up the execution of the program. File and Directory hashing in high performance file systems Two complementary techniques are often used to improve the performance of file access. One is caching which simply saves information in memory. The second is hashing. In large file systems that are typical of file servers, it is common to use hashing to improve performance. Looking up the file location in memory is much quicker using hashing than most other methods. 1.5 A Student Number Example University student information often needs to be stored and retrieved. The data to be stored (student numbers) is normally in integer format or numeric string format which can easily be converted to integers. The length of the integers is normally uniform for all students (e.g. a 6 digit number). In this case the range of values would be from 100000 to 999999. The first thing to do is to set up a hash table. If we set up a table with 999,999 positions, we will have a huge table with many positions that will never be used. The values from 0 to 99,999 will never be used, as well there are values which have not yet been assigned to students. A better method would be to create a hash table whose size is the number of students. Next, we must establish a hashing algorithm. A simple algorithm would be to simply subtract 100,000 from any student number. This will eliminate all of the values between 900,000 and 999,999, inclusive. The student number 100000 will be stored in position 0 in the hash table and so on. This still requires a very large hash table that will have many unused positions. Let us assume that we are starting out as a fairly new university with a population of 100 students. Since a hash table with 100 positions does not require very much storage space, we will create a table of that size. In order to store the student numbers in our hash table, we will use an algorithm that uses the modulus function to obtain a value from 0 to 99. 'y mod 100'. We will divide any student number by 100 and take the remainder as being the hash table position. If the students are assigned numbers sequentially beginning at 100000 and extending to 100099, we will have a 1 to 1 correspondence between student number and hash table position. Suppose we start a new semester and 10 new students attend our university. These students will be assigned the numbers 100100 through 100109. When we use our hashing function, these number will correspond to positions 0 through 9 in our hash table. These positions are already in use by previously existing student numbers (i.e. 100000 to 100009). We cannot store these numbers in the assigned positions because the positions are already in use. We have what is known as a conflict and now need a method of conflict resolution. Let us suppose that in our university example, we have been operating for some time in a non computerized environment. The students have been assigned numbers, but nothing has yet been stored on computer. It is decided that a computer system will be implemented and the student numbers will be stored in a hash table for quick retrieval. The university has 90 students and their numbers range from 100000 to 100089. This will be a new term and the university anticipates that there will be 10 new students. The table size is set to 100 entries. The hashing algorithm will use the modulus function to produce values between 0 and 100. 'y mod 100'. All goes well as the first 10 new students register and are assigned student numbers ranging from 100090 to 100099. These values are stored in the hash table in the appropriate locations. The university has a highly regarded Computer Sciences department and has attracted more students than anticipated. An extra 5 students want to register. The student numbers assigned will continue sequentially from 100100 to 100104. When the hashing algorithm is applied to these values, the results are the integers 0 through 4. Since positions 0 through 4 in the hash table are already used by student numbers 100000 through 100004, there is no place to put these new numbers. A conflict has occurred and must be resolved. Several methods exist for resolving conflicts of this type. We will review three in detail in the following pages. 2.0 Hashing Methods There are seven essential methods of hashing, or ways to insert values into a key accessed table. Listed in no particular order they are the division method, multiplication method, folding method, length-dependent method, midsquare method, digit-analysis and the addition method. We should note that most of these methods have a timing complexity of O(1). This tutorial discusses each of these with emphasis on the division, multiplication and folding methods. 2.1 The Division Method Algorithm: H(x) = x mod m + 1 Where: m is some predetermined divisor integer (i.e., the table size), x is the preconditioned key, and mod stands for modulo. Note that adding 1 is only necessary if the table starts at key 1 (if it starts at 0, the algorithm simplifies to (H(x) = x mod m). Please note that in the applet, we did not add 1. So, in other words: given an item, divide the preconditioned key of that item by the table size (+1). The remainder is the hash key. EXAMPLE: Given a hash table with 10 buckets, what is the hash key for 'Cat'? Since 'Cat' = 131130 when converted to ASCII, then x = 131130. We are given the table size (i.e., m = 10, starting at 0 and ending at 9). H(x) = x mod m H(131130) = 131130 mod 10 = 0 (there is no remainder) 'Cat' is inserted into the table at address 0. The Division method is distribution-independent. 2.2 The Multiplication Method The method that is used in the applet is the multiplication method. It multiplies of all the indiviudal digits in the key together, and takes the remainder after dividing the resulting number by the table size. In functional notation, the algorithm is: H(x) = (a * b * c * d *....) mod m Where: m is the table size, a, b, c, d, etc. are the individual digits of the item, and mod stands for modulo. Let's apply this algorithm to an example. EXAMPLE: Given a hash table of ten buckets (0 through 9), what is the hash key for 'Cat'? Since 'Cat' = 131130 when converted to ASCII, then x = 131130 We are given the table size (i.e., m = 10). The constant can be any number we want, let's use five (i.e., c = 5). H(x) = (a * b * c * d *....) mod m H(131130) = (1 * 3 * 1 * 1 * 3 * 0) mod 10 = 0 mod 10 = 0 'Cat' is inserted into the table at address 0. Both of these Multiplication methods are distribution-independent. Click here for a complete description of a multiplication method that uses bit shifting. 2.3 The Folding Method Algorithm: H(x) = (a + b + c) mod m Where: a, b, and c represent the preconditioned key broken down into three parts, m is the table size, and mod stands for modulo. In other words: the sum of three parts of the preconditioned key is divided by the table size. The remainder is the hash key. EXAMPLE: Fold the key 123456789 into a hash table of ten spaces (0 through 9). We are given x = 123456789 and the table size (i.e., m = 10). Since we can break x into three parts any way we want to, we will break it up evenly. Thus a = 123, b = 456, and c = 789. H(x) = (a + b + c) mod m H(123456789) = (123+456+789) mod 10 = 1368 mod 10 = 8 123456789 is inserted into the table at address 8. The Folding method is distribution-independent. 2.4 Other Hashing Methods Now that we have shown, in detail, a few of the many hashing methods which you can use, we will briefly discuss the length-dependent method, midsquare method, digit-analysis and the addition method. Midsquare Multiply x by itself (i.e., x2) and select a number of digits from the middle of the result. How many digits you select will depend on your table size and the size of your hash key. Addition The sum of the digits of the preconditioned key is divided by the table size. The remainder is the hash key. Digit Analysis Certain digits of the preconditioned key are selected in a certain predetermined and consistent pattern. 3.0 Collision Collision: A failed attempt to insert an item in a table because there is already an item in the slot the new item has hashed to. Once a space in the table is occupied, no other item can be placed there unless the existing item is deleted first. Think of a collision as "interference." A practical example might help solidify this idea. You have a class in a classroom that seats 20 (20 would thus be the 'table size'). Today is the day of your midterm, so everyone shows up. The first person to show up could choose a seat at random because they would all be empty. This random seating could be done for the first 12 people or so. Then, the free seats would gradually diminish. Out of the 20 people in the class, you are the last person to show up. Since you cannot sit in a seat that someone else is sitting in, you have to find the empty chair (it happens to be in the back row). So, you sequentially pass all the full seats until you get to the empty seat. By the time this happens, the test is over Collision-resolution techniques are used to compensate for the occurrence of collisions. Collision Resolution Techniques Collision-resolution techniques are used to offset the effects of collisions. Should a collision occur, these techniques look for another place in the table for an item to be stored (similar to you finding an empty chair). There are different methods of doing this: linear probing, open addressing, chaining and double hashing. 3.1 Linear Probing After first being located to an already occupied table position, the item is sent sequentially down the table until an empty space is found. If m is the number of possible positions in the table, then the linear probe continues until either an empty spot is found, or until m-1 locations have been searched. One thing that happens with linear probing however is clustering, or bunching together. Primary clustering occurs when items are always hashed to the same spot in the table, and the lookup requires searching through 4 or 5 buckets before finding the empty spot. A graphic might help explain this: The opposite of clustering is uniform distribution. Simply, this is when the hash function uniformly distributes the items over the table. 3.2 Chaining A collision has occurred. The position resolved by the resolution technique is already full, therefore the element you wish to insert into the hash table cannot go there without losing data. Seperate chaining simply makes a backup list of elements that would have otherwise been inserted into a given position had that position been empty. For example, let us assume that position x in the table has been previously filled by a hash insertion. Now, let us assume that the element we wish to insert into the hash table resolves to position x as well. We cannot simply overwrite the element already in position x. Each hash table entry requires a pointer to a linked list of records that can hold the element data. This list is only used when a collision occurs. The element that we are inserting, E, is then appended onto the list associated with the target position, x. The result is a linked list associated with each table position, each list containing only those elements that collided on that table position with an element already in that position. 3.3 Double Hashing Double Hashing is another method of collision resolution, but unlike the linear collision resolution, double hashing uses a second hashing function which normally limits multiple collisions. The idea is that if two values hash to the same spot in the table, a constant can be calculated from the initial value using the second hashing function which can then be used to change the sequence of locations in the table, but still have access to the entire table. 3.4 Open Addressing Using the open addressing technique, after a collision, new locations are examined until an empty one is found. The item to be inserted is placed here. If an item is deleted, then it is replaced with the word "DELETED" . The sequence in which other locations are examined can be determined in many different ways. For example, after collision, new locations could be searched from bottom to top, top to bottom, or by every 2 places.