Senior Fitness - Exercise and Nutrition for Aging Men and Women
FREE Article Feed for your website.
Home Ownership Magazine
Party Planning Information
Article Marketing Resources
Bio-Medical Research Article Database
Informative Articles on Life, Love and Happiness
Tutorials on Business to Writing
Famous Quotes from Famous People
Song Lyric Information
New US Patent Information
Comprehensive List of Content by Category
Online Auctions and Shopping Related Articles
Article Search
Most Recent Articles
 

Trading Online Trading India Internet Trading Net Trading e Trad...
Category:
Finance / Investment  

Protect Your Home with Spy Camera
Category:
Home And Family  

How to Make a Free Web Site
Category:
Business  

Starting an Ebook Online Business in Just 3 Easy Steps
Category:
Business  

Give a man six inches and he ll want a
Category:
Health / Fitness  

Double Your Dish Network Affiliate Check
Category:
Marketing  

Going to the Beach Lose Up to 20 Pounds In Less Than 2 Weeks
Category:
Health / Fitness  

Tips On Getting A Suntan
Category:
Health / Fitness  

CHOOSING A LABEL PRINTER
Category:
Business  

Adverse Credit Credit Cards
Category:
Business  

mouth watering lobster recipes
Category:
Health / Fitness  

importance of food elements
Category:
Health / Fitness  

Blood Test To Predict Risk of Heart Disease For Diabetics
Category:
Health / Fitness  

How to Create a Money Magnet E commerce Web Site
Category:
Marketing  

10 Offline Tightwad Marketing Strategies to Help You Get More Cl...
Category:
Business  

Decent Acne Medicines
Category:
Health / Fitness  

Role play with added sex appeal
Category:
Health / Fitness  

Grow a Healthy Lawn You Can Do That
Category:
Home And Family  

Stock Images The Indispensable Tool For Designers And Webmasters...
Category:
Marketing  

Easy Work From Home Ideas Quickstarts For Everyone
Category:
Business  

Tips for Your Walking Program
Category:
Health / Fitness  

Everything About Arthritis
Category:
Health / Fitness  

A Gentle Warning To All Webmasters About RSS
Category:
Marketing  

15 Ways To Sell Yourself Effectively In A Job Interview Part Thr...
Category:
Business  

2 Ways Online Web Conferencing Can Save Your Business Money
Category:
Business  

Lighting Your Way to Outdoor Living
Category:
Home And Family  

7 Rules Every Salesman Should Follow
Category:
Business  

Give a man six inches and he ll want a
Category:
Health / Fitness  

Nurses Wanted Incredible Career Opportunities in Nursing Today
Category:
Health / Fitness  

Baby Wont Sleep Here s some helpful advice
Category:
Home And Family  

Why Cotoneaster Makes a Good Bonsai Candidate
Category:
Home And Family  

Home Hair Care Tips for Dry Hair
Category:
Health / Fitness  

A Home Gym and Walking a Great Exercise Program
Category:
Health / Fitness  

Preparing For Cosmetic Plastic Surgery
Category:
Health / Fitness  

Avoiding Razor Burn
Category:
Health / Fitness  

Curcumin An Anti Aging Herbal
Category:
Health / Fitness  

Take You Russian Fiance to an American Wedding Before You Get Ma...
Category:
Travel  

How and Why to Get an Awesome X Box 360 Skin for your XBOX Conso...
Category:
Entertainment / Television  

Where Are All of The Best Job Search Engines
Category:
Business  

The Power of Intention
Category:
Health / Fitness  

Traditional Therapies Can Prevent Heart Disease Too
Category:
Health / Fitness  

Handling devil Boss II
Category:
Home And Family  

10 Tips when using electronic forms
Category:
Business  

Mens Jewellery Snap Style Guide on Wearing Jewellery
Category:
Home And Family  

6 Things to Consider When Naming Your Baby
Category:
Home And Family  

Give a man six inches and he ll want a
Category:
Health / Fitness  

Stevie Wonder Challenges Memphis and the World
Category:
Entertainment / Television  

Writing the Resource Box so it Makes People click
Category:
Marketing  

Weight Loss Psychology
Category:
Health / Fitness  

Australia Visa Services Free Online Australian Immigration Asses...
Category:
Travel  

The Truth About Passive Income
Category:
Finance / Investment  

A New Way of Looking at NJ Divorce
Category:
Finance / Investment  

Can Stress Play a Role In Hair Loss
Category:
Health / Fitness  

Tips to Selecting an RSS News Aggregator
Category:
Computers  

WHY LABEL PRINTERS STAY SO BUSY
Category:
Business  

No Win No Fee Compensation Claims No Risk No Costs
Category:
Finance / Investment  

Why Heart Fails
Category:
Health / Fitness  

Find The Best Compensation Claim Specialist
Category:
Business  

Starting up a business in the 21st century
Category:
Business  

The Benefits of Press Releases
Category:
Business  

Tips on Improving the Positioning of your site on the Major
Category:
Computers  

Cheap Christmas Present
Category:
Home And Family  

How can a piece of article boost your marketing efforts
Category:
Marketing  

Philadelphia s Four Seasons Hotel For Business Vacations Or Wedd...
Category:
Travel  

7 Skin Care Tips Look Stunning in Your 50s
Category:
Health / Fitness  

Exercise Why Bother
Category:
Health / Fitness  

Frugal Living Money Making Ideas for Stay at Home Moms
Category:
Home And Family  

Internet marketing tips to help your business grow
Category:
Marketing  

Teen and Adolescence Acne
Category:
Health / Fitness  

Why Top Level Domain Names Mean Better Search Engine Rankings
Category:
Business  

How TO Start Making Money With Adsense
Category:
Business  

An Unplanned Pregnancy Raises Questions
Category:
Home And Family  

Why Do You Have Asthma
Category:
Health / Fitness  

Do Cosmetics Causes Acne
Category:
Health / Fitness  

What is Body Acne
Category:
Health / Fitness

System and method for detecting process and network failures in a distributed system Number:6,820,221 from the United States Patent and Trademark Office (PTO) owispatent

Home    Author Login    Submit Article    Article Search    Add Your Link    Edit Your Link    Contact Us    Advertising    Disclaimer

   

 
Web LinkGrinder.com

Top Breaking News
     Greek, Cypriot Leaders Resume Unification Talks in Nicosia by Nathan Morley
     Indonesia Tobacco Sales Grow, Raising Health Fears
     South Korea Allows Top Defector to Travel Overseas by VOA News

Title: System and method for detecting process and network failures in a distributed system

Abstract: The present invention provides a system and method of detecting a process failure and a network failure in a distributed system. The distributed system includes a plurality of processes, each executing on a host, operable to transmit messages (i.e., heartbeats) to each other on a network. A process in the system is operable to execute a process failure algorithm for detecting failure of a process in the system. The process failure algorithm includes calculating a difference in the period of time to receive a heartbeat from a first processes and a period of time to receive a heartbeat from a second process in the system. If the difference exceeds a process failure threshold, the second process is suspected of failing. A process in the system is also operable to execute a network failure algorithm for detecting failure of a network connecting a plurality of hosts in the system. The network failure algorithm includes detecting receipt of a heartbeat from any one of a plurality of processes in the system within a network failure time limit. If a heartbeat is not received prior to the expiration of the network failure time limit, the network in the system is suspected of failing.

Patent Number: 6,820,221 Issued on 11/16/2004 to Fleming


Inventors: Fleming; Roger A. (Fort Collins, CO)
Assignee: Hewlett-Packard Development Company, L.P. (Houston, TX)
Appl. No.: 833650
Filed: April 13, 2001


Current U.S. Class: 714/31 ; 714/55
Current International Class: G06F 11/00 (20060101)
Field of Search: 714/31,47,48,55,56


References Cited [Referenced By]

U.S. Patent Documents
4811200 March 1989 Wagner et al.
5884018 March 1999 Jardine et al.
6088330 July 2000 Bruck et al.
6321344 November 2001 Fenchel
6363496 March 2002 Kwiat
6647508 November 2003 Zalewski et al.
6678840 January 2004 Kessler et al.
Primary Examiner: Beausoliel; Robert

Parent Case Text



The following applications containing related subject matter and filed concurrently with the present application on Apr. 13, 2001 are hereby incorporated by reference: Ser. No. 09/833,771, entitled System and Method for Detecting Process and Network Failures in a Distributed System Having Multiple Independent Networks, Publication No. US 2002/0152432 A1; Ser. No. 09/833,573, entitled Probationary Members, Publication No. US 2002/0161849 A1; and Ser. No. 09/833,572 and entitled Adaptive Heartbeats, Publication No. U.S. 2002/0152446 A1.
Claims



What is claimed is:

1. A method of detecting a process failure in a distributed system having at least one network, the method comprising steps of: (1) measuring a first period of time between an instance a last heartbeat was received over a network from a first process executing on a first host and a later instance in time; (2) measuring a second period of time between an instance a last heartbeat was received over said network from a second process executing on a second host and said later instance in time; (3) comparing said first and second periods of time with a predetermined threshold; and (4) determining whether a process failure, and not a network failure, occurred in response to said comparison in step (3).

2. The method of claim 1, wherein step (3) further comprises steps of: calculating a difference between said first period of time and said second period of time; and comparing said difference to said predetermined threshold.

3. The method of claim 2, wherein said step (4) further comprises steps of: detecting a failure of said second process in response to said difference equaling or exceeding said predetermined threshold.

4. The method of claim 1, wherein said steps are performed as computer-executable instructions on a computer-readable medium.

5. A method of detecting a network failure in a distributed system, the method comprising steps of: (1) arranging for at least two processes executing respectively on first and second hosts to generate heartbeats and to apply them to a network; (2) determining whether a heartbeat is received over said network from at least one of said processes in the distributed system prior to an expiration of a heartbeat timeout; (3) detecting a failure of said network, as opposed to failure of a process, in response to not receiving a heartbeat from at least one of said process prior to said expiration of said heartbeat timeout; (4) if at least one heartbeat is received, determining whether heartbeats are received from all of said at least two processes; and (5) detecting a failure of at least one of said processes in response to not receiving heartbeats from all of said at least two processes.

6. The method of claim 5, wherein said steps are preformed as computer-executable instructions on a computer-readable medium.

7. A distributed system including a plurality of hosts connected via at least one network, wherein each host executes at least one process in said distributed system, said system comprising: first, second, and third hosts of said plurality of hosts executing respectively first, second, and third processes and interconnected by a network, at least said second and third processes generating heartbeats; wherein said first process on said first host is operable to detect one of failure of said second process executing on said second host and failure of said network, detection of failure of said network being based on expiration of a period of time without reception of any heartbeats transmitted over said network from either of said second and third processes on said second and third host, and failure of said second process being based on expiration of a period of time with reception of at least one heartbeat transmitted from said third process over said network but without reception of any heartbeats transmitted from said second process over said network.

8. The system of claim 7, wherein: said first process on said first host is operable to measure a first period of time between an instance when a last heartbeat was received from said third process on said third host on said network and a later instance in time, and to measure a second period of time between an instance when a last heartbeat was received from said second process on said second host on said network and said later instance in time; and said first process on said first host is further operable to compare said first and second periods of time with a predetermined threshold, and detect a failure of said second process in response to said comparison.

9. The system of claim 8, wherein said first process on said first host is further operable to calculate a difference between said first period of time and said second period of time, and to compare said difference to said predetermined threshold.

10. The system of claim 9, wherein said first process on said first host is operable to detect said failure of said second process in response to said difference exceeding said predetermined threshold.

11. The system of claim 10, wherein said first process on said first host is operable to remove said second process from a view in response to detecting said failure of said second process.

12. The system of claim 7, wherein said first process on said first host is operable to determine whether a heartbeat is received from at least one of said second and third processes over said network in said system prior to an expiration of a heartbeat timeout.

13. The system of claim 12, wherein said first process on said first host is further operable to detect said failure of said network in response to not receiving a heartbeat from said at least one of said second and third processes over said network prior to said expiration of said heartbeat timeout.
Description



FIELD OF THE INVENTION

The present invention is generally related to monitoring computer processes in a distributed system. More particularly, the present invention is related to detecting process and network failures in a distributed system.

BACKGROUND OF THE INVENTION

In recent years, reliable, high performance computer systems have been, and still are, in great demand. Users have also demanded the introduction and propagation of multi-processor distributed computer systems to support their computing processes (e.g. simulations, parallel processing, etc.). A distributed computer system generally includes a collection of processes and a collection of execution platforms (i.e., hosts). Each process may be capable of executing on a different host, and collectively, the processes function to provide a computer service. A failure of a critical process in a distributed system may result in the service halting. Therefore, techniques have been implemented for detecting a failure of a process in a timely manner, such that an appropriate action can be taken.

A conventional technique for detecting failure of a process includes the use of heartbeats, which are messages sent between processes at regular intervals of time. According to the heartbeat technique, if a process does not receive a heartbeat from a remote process prior to the expiration of a predetermined length of time, i.e., the heartbeat timeout, the remote process is suspected to have failed. Corrective action, such as eliminating the suspected process, may thus be taken.

A remote process not transmitting a heartbeat may not be an indication of a failure in the remote process. Instead, a network failure may have prevented a process from receiving a heartbeat from the remote process, especially when multiple processes in a distributed system are communicating over a common network. For example, a network failure may include a network pause (i.e., a temporary condition that prevents communication on a network) or a less temporary network failure, such as a hardware failure for hardware facilitating transmission on the network. A network pause, for example, can be the result of heavy, high-priority traffic over a network link, sometimes caused by other processes (e.g., remote machine backups). If the network pause endures for a period of time greater than the heartbeat timeout or if a network failure occurs, each process waiting for a heartbeat transmitted over the network in the distributed system may suspect the other processes of failing. Then, each process may take unnecessary corrective actions, such as eliminating and/or replacing the suspected processes from the distributed system, which can cause each service provided by the processes in the distributed system to be halted. If network conditions can be detected, appropriate corrective action could be taken, such as establishing connections between the distributed system processes using alternative paths.

SUMMARY OF THE INVENTION

An aspect of the present invention is to provide a system and method for detecting and distinguishing between a process failure and a network failure in a distributed system.

In one respect, the present invention includes a system and method for detecting a process failure in a distributed system. A process in the distributed system is connected to a plurality of other processes in the distributed system via a network. If the difference in the period of time to receive a heartbeat from a first of the plurality of processes and a period of time to receive a heartbeat from a second process of the plurality of processes exceeds a process failure threshold, the second process is suspected of failing.

In another respect, the present invention includes a system and method for detecting a network failure in the distributed system. A process in the distributed system monitors a plurality of other processes in the distributed system via a network. If the process fails to receive a heartbeat from any one of the plurality of processes within a network failure time limit, the network in the distributed system is suspected of failing.

The methods of the present invention include steps that may be performed by computer-executable instructions recorded on a computer-readable medium.

The present invention provides low cost simplistic techniques for detecting network and process failures in a distributed system. Accordingly, corrective action may be taken when failures are detected. Therefore, down-time for a service provided by the processes in the distributed system may be minimized. Those skilled in the art will appreciate these and other advantages and benefits of various embodiments of the invention upon reading the following detailed description of a preferred embodiment with reference to the below-listed drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is illustrated by way of example and not limitation in the accompanying figures in which like numeral references refer to like elements, and wherein:

FIG. 1 illustrates an exemplary block diagram of a distributed system employing the principles of the present invention;

FIG. 2 illustrates a flow-diagram of an exemplary embodiment of a method employing the principles of the present invention;

FIG. 3 illustrates a flow-diagram of another exemplary embodiment of a method employing the principles of the present invention; and

FIG. 4 illustrates a flow-diagram of another exemplary embodiment of a method employing the principles of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent to one of ordinary skill in the art that these specific details need not be used to practice the present invention. In other instances, well known structures, interfaces, and processes have not been shown in detail in order not to unnecessarily obscure the present invention.

FIG. 1 shows a distributed system 100 employing the principles of the present invention. The distributed system 100 includes host 1, host 2 and host 3 executing process A, process B and process C, respectively. Processes A-C function to provide a service to a plurality of users via distributed system 100. Hosts 1-3 are connected via bi-directional communication paths 110, 120 and 130. Communication paths 110, 120 and 130 include network links in one network 150. Hosts 1-3 are typical nodes in a distributed system and can include a data processing system, memory and network interface, all of which are not specifically shown. It will be apparent to those of ordinary skill in the art that an arbitrary number of hosts in distributed system 100 may be supported in an arbitrary configuration. Furthermore, each host may execute one or more processes.

An administration function performed by distributed system 100 can include detecting failure of one or more of processes A-C, such that corrective action (e.g., eliminating and/or replacing a failed process) can be taken when a process fails. For example, a failed process may be removed from a "view", when a consensus is reached that the process has failed. Accordingly, processes A-C, executing on hosts 1-3 respectively, transmit heartbeats on communication paths 110-130 in network 150 to detect a process failure. Process A may utilize a process failure algorithm for detecting a failure of a process in system 100. The process failure algorithm includes comparing the difference between a period of time to receive a heartbeat from a first process and period of time to receive a heartbeat from a second process to a process failure threshold. For example, process A monitors processes B and C by monitoring heartbeats transmitted on communication paths 110 and 130 from processes B and C, respectively. If the difference between a period of time to receive a heartbeat from process B and a period of time to receive a heartbeat from process C exceeds a process failure threshold, process B is suspected of failing. The process failure threshold may be a predetermined threshold or a threshold that can automatically adapt to varying network conditions. It will be apparent to one of ordinary skill in the art that the threshold may be determined based upon the network configuration, average network traffic and/or other factors relevant to network transmission.

System 100 may also detect failure of network 150 using a network failure algorithm, such as determining whether a heartbeat is received from any process in system 100 prior to expiration of a network failure time limit. For example, process A monitors processes B and C by monitoring heartbeats transmitted on communication paths 110 and 130 from processes B and C, respectively. If process A fails to receive a heartbeat from any one of processes B and C within a network failure time limit, network 150 is suspected of failing. Similarly to the process failure threshold, the network failure time limit may be predetermined or adaptive. It will be apparent to one of ordinary skill in the art that the time limit may be determined based upon the network configuration, average network traffic and other factors relevant to network transmission.

A network failure may include a network condition that prevents communication on the network for a predetermined period of time. For example, a network failure may include a network pause (i.e., a temporary condition that prevents communication on a network) or a less temporary network failure, such as a hardware failure for hardware facilitating transmission on the network. A network pause, for example, can be the result of heavy, high priority traffic over a network link, sometimes caused by other processes (e.g., remote machine backups).

Based on the monitoring of processes B and C, process A may take appropriate corrective action. For example, when process A determines that process B has failed, process A can eliminate and/or replace process B. Alternatively, when process A determines that a network failure may have occurred, process A may take a different action, such as waiting for a condition causing a network pause to clear or attempting to establish new communication path(s) over a different network or alternative paths within network 150.

A flow-diagram, shown in FIG. 2, illustrates an exemplary embodiment of a method 200 for implementing the network failure algorithm of the present invention. The steps shown in FIG. 2 are described with respect to processes A-C in distributed system 100. It will be apparent to one of ordinary skill in the art, however, that the method shown in FIG. 2 is applicable to distributed systems having a variety of configurations and having a process monitoring more than two processes.

In step 210, process A determines whether a heartbeat is received from any process (e.g., process B or process C) in network 150 prior to the expiration of the network failure time limit. If a heartbeat is not received prior to the expiration of the network failure time limit, network 150 is suspected to have failed and appropriate corrective action may be taken (step 215). If a heartbeat is received prior to the expiration of the network failure time limit, the network failure time limit is reset (step 220). Then, method 200 is repeated.

A flow-diagram, shown in FIG. 3, illustrates an exemplary embodiment of a method 300 including the steps of the process failure algorithm of the present invention. The steps shown in FIG. 3 are described with respect to processes A-C in distributed system 100. It will be apparent to one of ordinary skill in the art that the method shown in FIG. 3 is applicable to distributed systems having a variety of configurations and having a process monitoring more than two processes. Also, it will be apparent to one of ordinary skill in the art that the process failure algorithms may be implemented using a plurality of techniques.

In step 305, a first period of time between an instance a last heartbeat was received from a first process (e.g., process B) and a later instance in time is measured. In step 310, a second period of time between an instance a last heartbeat was received from a second process (e.g., process C) and the later instance in time is measured. In step 320, the difference between the first and second periods of time is calculated. In step 330, the difference is compared to the process failure threshold. If the difference exceeds the process failure threshold, the second process is suspected of failing (step 340), and appropriate corrective action may be taken. If the difference does not exceed the process failure threshold, a failure of the second process is not suspected (step 350).

A flow-diagram, shown in FIG. 4, illustrates an exemplary embodiment of a method 400 implementing the process failure algorithm of the present invention in a distributed system. The steps shown in FIG. 4 are described with respect to processes A-C in distributed system 100. It will be apparent to one of ordinary skill in the art, however, that the method shown in FIG. 4 is applicable to distributed systems having a variety of configurations and having a process monitoring more than two processes.

In step 405, process A receives a heartbeat from a first process (e.g., process B in system 100). In step 410, a timer is started for detecting a heartbeat timeout of a second process (e.g., process C) in distributed system 100 that is monitored by process A. In step 415, process A determines whether a heartbeat is received from process C. If a heartbeat is received from process C, the timer is cancelled (step 420). If a heartbeat is not received from process C, process A determines whether the heartbeat timeout for process C is expired (step 425). The heartbeat timeout may be predetermined or adaptive, similar to the process failure threshold. An adaptive heartbeat timeout technique is described in co-pending U.S. Pat. Application Ser. No. 09/833,572 , entitled Adaptive Heartbeats and incorporated by reference herein. It will be apparent to one of ordinary skill in the art that a predetermined heartbeat timeout may be determined based upon the network configuration, average network traffic arid other factors relevant to network transmission.

If the heartbeat timeout is expired, process A suspects a failure of process C (step 430), and process A may take appropriate corrective action. If the heartbeat timeout is not expired, process A determines whether a heartbeat is received from another process (step 415).

The methods shown in FIGS. 2-4 detect process and network failures. Accordingly, corrective actions tailored to the type of failure detected can be taken to reach a timely solution. Thus, down-time is limited for service(s) facilitated by processes executing in a distributed system.

The methods shown in FIGS. 2-4 and described above can be performed by a computer program. The computer program can exist in a variety of forms both active and inactive. For example, the computer program can exist as software possessing program instructions or statements in source code, object code, executable code or other formats; firmware program(s); or hardware description language (HDL) files. Any of the above can be embodied on a computer readable medium, which include storage devices and signals, in compressed or uncompressed form. Exemplary computer readable storage devices include conventional computer system RAM (random access memory), ROM (read only memory), EPROM (erasable, programmable ROM), EEPROM (electrically erasable, programmable ROM), and magnetic or optical disks or tapes. Exemplary computer readable signals, whether modulated using a carrier or not, are signals that a computer system hosting or running the computer program can be configured to access, including signals downloaded through the Internet or other networks. Concrete examples of the foregoing include distribution of executable software program(s) of the computer program on a CD ROM or via Internet download. In a sense, the Internet itself, as an abstract entity, is a computer readable medium. The same is true of computer networks in general.

Also, the methods shown in FIGS. 2-4 and described above may be performed by a process facilitating a service, such as process A in distributed system 100, or performed by a separate process executed on a host in a distributed system.

While this invention has been described in conjunction with the specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. There are changes that may be made without departing from the spirit and scope of the invention.

*


Free Web Sudoku Puzzles.
Solve with your browser.
        3        
    5 6     9    
8   6 5 4     7  
  2           5  
3       8       6
  7           1  
  9     6 3 2   1
    2     8 7    
        1        
What is it?



Add Your Site · Terms Of Service · Privacy Policy


DISCLAIMER
Linkgrinder is a free service that searches the Internet and indexes all files found so that you may search quickly and easily for shared files. These files are created and made available individually by users whose identity we are not aware of and who we have no control over. In essence we function like a search engine tool; these files ARE NOT STORED OR SERVED BY OUR NETWORK. We are not responsible for any materials obtained by using our service. We do not monitor any of the contents of these files. These files may contain viruses, illegal materials, materials inappropriate for minors, offensive files and the like. BY USING OUR SERVICE, YOU ASSUME FULL RESPONSIBILITY FOR DOWNLOADING THESE MATERIALS AND WILL INDEMNIFY US FOR ANY DAMAGES THAT MAY BE INCURRED.

For More Specific Information VIEW OUR TERMS OF SERVICE.

Thank you and Enjoy!