An Introduction to Hashing

Discussion in 'Community News' started by tamarisk, Jun 25, 2003.

  1. tamarisk New Member

    1.1. An Introduction to Hashing



    What is Hashing?

    A process whereby an item is placed into a structure based on a key-to-address
    transformation.

    Hashing is a way of performing optimal searches and retrievals. Essentially, it increases
    speed, betters ease of transfer, improves retrieval, optimizes searching of data, reduces
    overhead. Thus, one objective of hashing is to optimize disk accesses and packing density.
    The packing density, roughly equivalent to a load factor, is simply how much of the table is
    filled (normally, 70% is sufficient). The ultimate goal, however, is to minimize disk space and
    access time by inserting and retrieving a record from the table in only one seek. One way to
    minimize this time is to use small table sizes (say m < 10).

    There are 4 key components involved in hashing:

    the hash table
    the hashing method
    collisions
    collision resolution techniques

    The hash table is a storage location in memory or on disk that records the hashed values
    created by the hashing algorithm. Some familiarity with the data to be stored is desirable.
    The hash table is normally created with a certain number of buckets or storage locations.

    The number of buckets or size of the hash table is related to the quantity of data items that
    must be stored. It is ideal to choose hashing methods that result in the fewest collisions in a
    table, so you must take table size into consideration when choosing your hashing method.

    Hashing differs from other types of data structures such as binary trees, stacks, or lists
    primarily because of the fact that the data in a hash table does not have to be reorganized
    before being retrieved or inserted. Also, in the aforementioned other data structures, the
    items are stored either in lists or trees. This tends to be a problem for larger data sets,
    where a search and retrieval must travel through all nodes of a tree or all elements of a list.
    With a hash table, the size is set, so inserting or searching for an item is limited, and not as
    cumbersome. One final note is that usually with lists, trees or even stacks, the time for
    storage and / or retrieval is related to the size of the data set.

    The structure to which elements of data can be hashed is almost always a table (of size =
    m). These tables are stored in memory or on disk, and contain a set number of buckets
    (storage spaces).

    The items to insert will vary. Generally these consist of integers and strings. Each are
    handled differently. If the item is an integer, it can be directly used by a hashing method to
    find a key. On the other hand, if the item is a string, it must first be converted to an integer
    value using the ASCII conventions or some other consistently used method (this is called
    'preconditioning').






    1.2 Preconditioning



    Preconditioning

    Converting a string to an integer using ASCII conventions.

    When an item is a string, it has to be converted to an integer before a key can be found. This
    process is known as preconditioning.

    Normally, ASCII conventions for converting characters are used. ASCII values are assigned to
    the 26 letters starting at 11 (numbers 0 to 10 are first). Thus, 'A' is 11, 'B' is 12, 'C' is 13,
    and so on until 'Z' is 36. The numbers 37 and up are assigned to 'special characters' such as
    +, -, =, /, * and so on. For example, "Joe" is converted to 202515 using J=20, O=25, and
    E=15.

    Sometimes, after preconditioning an item, the resulting integer is too large to be stored in a
    table. For example, if we had an item that converted to 301522152618252415 (i.e.,
    "telephone"), then this would be too large to find a key. To solve this problem we can use
    one hash method to achieve a usable number and a second method to map the result to the
    table.





    The Advantages and Disadvantages of Hashing

    One advantage of hashing is that no index storage space is required, whereas inserting into
    other structures such as trees does normally require an index. Such an index could be in the
    form of a queue. In addition, hashing provides the advantage of rapid updates.

    One disadvantage to using hashing as an insert and retrieval tool is that it usually lacks
    locality and sequential retrieval by key. Consequently, insertion and retrieval are more
    random. Another disadvantage is the inability to use duplicate keys. This is a problem when
    key values are very small (i.e. one or two digits).


    1.4 Some Real World Applications of Hashing



    Database indexing
    Some database systems have a separate file of files known as indexes. When a
    request is made for data, the key information is first found in the appropriate index
    file which references the exact record location of the data in the database file. The
    key information in the index file is sometimes stored as a hashed value.

    Symbol tables e.g. Fortran variable names
    It is very important in any language that the references to variables can be looked
    up quickly. This speeds up the execution of the program.

    File and Directory hashing in high performance file systems
    Two complementary techniques are often used to improve the performance of file
    access. One is caching which simply saves information in memory. The second is
    hashing. In large file systems that are typical of file servers, it is common to use
    hashing to improve performance. Looking up the file location in memory is much
    quicker using hashing than most other methods.



    1.5 A Student Number Example



    University student information often needs to be stored and retrieved. The data to
    be stored (student numbers) is normally in integer format or numeric string format
    which can easily be converted to integers. The length of the integers is normally
    uniform for all students (e.g. a 6 digit number). In this case the range of values
    would be from 100000 to 999999.

    The first thing to do is to set up a hash table. If we set up a table with 999,999
    positions, we will have a huge table with many positions that will never be used.
    The values from 0 to 99,999 will never be used, as well there are values which have
    not yet been assigned to students. A better method would be to create a hash
    table whose size is the number of students.

    Next, we must establish a hashing algorithm. A simple algorithm would be to simply
    subtract 100,000 from any student number. This will eliminate all of the values
    between 900,000 and 999,999, inclusive. The student number 100000 will be stored
    in position 0 in the hash table and so on. This still requires a very large hash table
    that will have many unused positions.

    Let us assume that we are starting out as a fairly new university with a population
    of 100 students. Since a hash table with 100 positions does not require very much
    storage space, we will create a table of that size.

    In order to store the student numbers in our hash table, we will use an algorithm
    that uses the modulus function to obtain a value from 0 to 99. 'y mod 100'. We will
    divide any student number by 100 and take the remainder as being the hash table
    position. If the students are assigned numbers sequentially beginning at 100000
    and extending to 100099, we will have a 1 to 1 correspondence between student
    number and hash table position.

    Suppose we start a new semester and 10 new students attend our university. These
    students will be assigned the numbers 100100 through 100109. When we use our
    hashing function, these number will correspond to positions 0 through 9 in our hash
    table. These positions are already in use by previously existing student numbers
    (i.e. 100000 to 100009). We cannot store these numbers in the assigned positions
    because the positions are already in use. We have what is known as a conflict and
    now need a method of conflict resolution.

    Let us suppose that in our university example, we have been operating for some
    time in a non computerized environment. The students have been assigned
    numbers, but nothing has yet been stored on computer. It is decided that a
    computer system will be implemented and the student numbers will be stored in a
    hash table for quick retrieval.

    The university has 90 students and their numbers range from 100000 to 100089.
    This will be a new term and the university anticipates that there will be 10 new
    students. The table size is set to 100 entries.

    The hashing algorithm will use the modulus function to produce values between 0
    and 100. 'y mod 100'.

    All goes well as the first 10 new students register and are assigned student
    numbers ranging from 100090 to 100099. These values are stored in the hash table
    in the appropriate locations.

    The university has a highly regarded Computer Sciences department and has
    attracted more students than anticipated. An extra 5 students want to register.

    The student numbers assigned will continue sequentially from 100100 to 100104.
    When the hashing algorithm is applied to these values, the results are the integers
    0 through 4. Since positions 0 through 4 in the hash table are already used by
    student numbers 100000 through 100004, there is no place to put these new
    numbers. A conflict has occurred and must be resolved.

    Several methods exist for resolving conflicts of this type. We will review three in
    detail in the following pages.


    2.0 Hashing Methods



    There are seven essential methods of hashing, or ways to insert values into a key accessed
    table. Listed in no particular order they are the division method, multiplication method,
    folding method, length-dependent method, midsquare method, digit-analysis and the addition
    method. We should note that most of these methods have a timing complexity of O(1). This
    tutorial discusses each of these with emphasis on the division, multiplication and folding
    methods.

    2.1 The Division Method



    Algorithm: H(x) = x mod m + 1

    Where: m is some predetermined divisor integer (i.e., the table size), x is the
    preconditioned key, and mod stands for modulo.

    Note that adding 1 is only necessary if the table starts at key 1 (if it starts at 0, the
    algorithm simplifies to
    (H(x) = x mod m). Please note that in the applet, we did not add 1.

    So, in other words: given an item, divide the preconditioned key of that item by the table size
    (+1). The remainder is the hash key.


    EXAMPLE:

    Given a hash table with 10 buckets, what is the hash key for 'Cat'?

    Since 'Cat' = 131130 when converted to ASCII, then x = 131130.
    We are given the table size (i.e., m = 10, starting at 0 and ending at 9).

    H(x) = x mod m
    H(131130) = 131130 mod 10
    = 0 (there is no remainder)

    'Cat' is inserted into the table at address 0.

    The Division method is distribution-independent.


    2.2 The Multiplication Method




    The method that is used in the applet is the multiplication method. It multiplies of all the
    indiviudal digits in the key together, and takes the remainder after dividing the resulting
    number by the table size.

    In functional notation, the algorithm is:

    H(x) = (a * b * c * d *....) mod m

    Where: m is the table size, a, b, c, d, etc. are the individual digits of the item, and
    mod stands for modulo.

    Let's apply this algorithm to an example.


    EXAMPLE:

    Given a hash table of ten buckets (0 through 9), what is the hash key for 'Cat'?

    Since 'Cat' = 131130 when converted to ASCII, then x = 131130
    We are given the table size (i.e., m = 10).
    The constant can be any number we want, let's use five (i.e., c = 5).

    H(x) = (a * b * c * d *....) mod m
    H(131130) = (1 * 3 * 1 * 1 * 3 * 0) mod 10
    = 0 mod 10
    = 0

    'Cat' is inserted into the table at address 0.


    Both of these Multiplication methods are distribution-independent.
    Click here for a complete description of a multiplication method that uses bit shifting.


    2.3 The Folding Method



    Algorithm: H(x) = (a + b + c) mod m

    Where: a, b, and c represent the preconditioned key broken down into three parts,
    m is the table size, and mod stands for modulo.

    In other words: the sum of three parts of the preconditioned key is divided by the table size.
    The remainder is the hash key.


    EXAMPLE:

    Fold the key 123456789 into a hash table of ten spaces (0 through 9).

    We are given x = 123456789 and the table size (i.e., m = 10).
    Since we can break x into three parts any way we want to, we will break it up evenly.
    Thus a = 123, b = 456, and c = 789.

    H(x) = (a + b + c) mod m
    H(123456789) = (123+456+789) mod 10
    = 1368 mod 10
    = 8

    123456789 is inserted into the table at address 8.

    The Folding method is distribution-independent.

    2.4 Other Hashing Methods



    Now that we have shown, in detail, a few of the many hashing methods which you can use,
    we will briefly discuss the length-dependent method, midsquare method, digit-analysis and
    the addition method.

    Midsquare

    Multiply x by itself (i.e., x2) and select a number of digits from the middle of the result. How
    many digits you select will depend on your table size and the size of your hash key.

    Addition

    The sum of the digits of the preconditioned key is divided by the table size. The remainder is
    the hash key.

    Digit Analysis

    Certain digits of the preconditioned key are selected in a certain predetermined and
    consistent pattern.




    3.0 Collision



    Collision:

    A failed attempt to insert an item in a table because there is already an item in the slot the
    new item has hashed to.

    Once a space in the table is occupied, no other item can be placed there unless the existing
    item is deleted first. Think of a collision as "interference." A practical example might help
    solidify this idea.

    You have a class in a classroom that seats 20 (20 would thus be the 'table size'). Today is the day of
    your midterm, so everyone shows up. The first person to show up could choose a seat at random
    because they would all be empty. This random seating could be done for the first 12 people or so. Then,
    the free seats would gradually diminish. Out of the 20 people in the class, you are the last person to
    show up. Since you cannot sit in a seat that someone else is sitting in, you have to find the empty chair
    (it happens to be in the back row). So, you sequentially pass all the full seats until you get to the empty
    seat. By the time this happens, the test is over :)

    Collision-resolution techniques are used to compensate for the occurrence of collisions.



    Collision Resolution Techniques

    Collision-resolution techniques are used to offset the effects of collisions. Should a collision
    occur, these techniques look for another place in the table for an item to be stored (similar to
    you finding an empty chair). There are different methods of doing this: linear probing, open
    addressing, chaining and double hashing.




    3.1 Linear Probing



    After first being located to an already occupied table position, the item is sent sequentially
    down the table until an empty space is found. If m is the number of possible positions in the
    table, then the linear probe continues until either an empty spot is found, or until m-1
    locations have been searched.

    One thing that happens with linear probing however is clustering, or bunching together.
    Primary clustering occurs when items are always hashed to the same spot in the table, and
    the lookup requires searching through 4 or 5 buckets before finding the empty spot. A
    graphic might help explain this:



    The opposite of clustering is uniform distribution. Simply, this is when the hash function
    uniformly distributes the items over the table.

    3.2 Chaining



    A collision has occurred. The position resolved by the resolution technique is already
    full, therefore the element you wish to insert into the hash table cannot go there
    without losing data. Seperate chaining simply makes a backup list of elements that
    would have otherwise been inserted into a given position had that position been
    empty.

    For example, let us assume that position x in the table has been previously filled by
    a hash insertion. Now, let us assume that the element we wish to insert into the
    hash table resolves to position x as well. We cannot simply overwrite the element
    already in position x. Each hash table entry requires a pointer to a linked list of
    records that can hold the element data. This list is only used when a collision
    occurs. The element that we are inserting, E, is then appended onto the list
    associated with the target position, x. The result is a linked list associated with
    each table position, each list containing only those elements that collided on that
    table position with an element already in that position.

    3.3 Double Hashing



    Double Hashing is another method of collision resolution, but unlike the linear
    collision resolution, double hashing uses a second hashing function which normally
    limits multiple collisions. The idea is that if two values hash to the same spot in
    the table, a constant can be calculated from the initial value using the second
    hashing function which can then be used to change the sequence of locations in the
    table, but still have access to the entire table.

    3.4 Open Addressing



    Using the open addressing technique, after a collision, new locations are examined until an
    empty one is found. The item to be inserted is placed here. If an item is deleted, then it is
    replaced with the word "DELETED" . The sequence in which other locations are examined can
    be determined in many different ways. For example, after collision, new locations could be
    searched from bottom to top, top to bottom, or by every 2 places.

Share This Page