Title: Nucleotide sequence of the Haemophilus influenzae Rd genome, fragments thereof, and uses thereof
Abstract: The present invention provides the sequencing of the entire genome of Haemophilus influenzae Rd, SEQ ID NO:1. The present invention further provides the sequence information stored on computer readable media, and computer-based systems and methods which facilitate its use. In addition to the entire genomic sequence, the present invention identifies over 1700 protein encoding fragments of the genome and identifies, by position relative to a unique Not I restriction endonuclease site, any regulatory elements which modulate the expression of the protein encoding fragments of the Haemophilus genome.
Patent Number: 6,846,651 Issued on 01/25/2005 to Fleischmann,   et al.
| Inventors:
|
Fleischmann; Robert D. (Gaithersburg, MD);
Adams; Mark D. (Rockville, MD);
White; Owen (Gaithersburg, MD);
Smith; Hamilton O. (Reistertown, MD);
Venter; J. Craig (Queenstown, MD)
|
| Assignee:
|
Human Genome Sciences, Inc. (Rockville, MD);
Johns Hopkins University (Baltimore, MD)
|
| Appl. No.:
|
158865 |
| Filed:
|
June 3, 2002 |
| Current U.S. Class: |
435/69.1; 435/252.3; 435/320.1; 536/23.7 |
| Intern'l Class: |
C12N 015/63; C12N001/21; C12N015/31 |
| Field of Search: |
435/69.1,320.1,252.3
536/23.7
|
References Cited [Referenced By]
U.S. Patent Documents
| 6528289 | Mar., 2003 | Fleischmann et al. | 435/91.
|
Other References
Schein Production of soluble recombinant proteins in bacteria Biotechnology
vol. 7, pp. 1141-1147 (1989).
|
Primary Examiner: Brusca; John S.
Attorney, Agent or Firm: Human Genome Sciences, Inc.
Goverment Interests
STATEMENT REGARDING FED SPONSORED R & D
Part of the work performed during development of this invention utilized
U.S. Government funds. The government may have certain rights in this
invention. NIH-5R01GM48251
Parent Case Text
This appln is a DIV of Ser. No. 09/557,884, filed Apr. 25, 2000, now U.S.
Pat. No. 6,506,881 which is a con of Ser. No. 08/476,102 filed Jun. 7,
1995, now U.S. Pat. No. 6,355,450 which is a CIP of Ser. No. 08/426,787
filed Apr. 21, 1995, abandoned.
Claims
What is claimed is:
1. An isolated polynucleotide comprising a nucleic acid sequence encoding
an amino acid sequence encoded by ORF HI0270, represented by nucleotides
301-267 of SEQ ID NO:1.
2. The isolated polynucleotide of claim 1, wherein said polynucleotide
comprises a heterologous polynucleotide sequence.
3. The isolated polynucleotide of claim 2, wherein said heterologous
polynucleotide sequence encodes a heterologous polypeptide.
4. A nucleic acid sequence complementary to the polynucleotide of claim 1.
5. A method for making a recombinant vector comprising inserting the
isolated polynucleotide of claim 1 into a vector.
6. A recombinant vector comprising the isolated polynucleotide of claim 1.
7. The recombinant vector of claim 6, wherein said polynucleotide is
operably associated with a heterologous regulatory sequence that controls
gene expression.
8. A recombinant host cell comprising the isolated polynucleotide of claim
1.
9. The recombinant host cell of claim 8, wherein said polynucleotide is
operably associated with a heterologous regulatory sequence that controls
gene expression.
10. A method for producing a polypeptide, comprising:
(a) culturing a cell under conditions suitable to produce a polypeptide
encoded by the polynucleotide of claim 1; and
(b) recovering the polypeptide.
11. An isolated polynucleotide comprising a nucleic acid sequence encoding
a fragment of the amino acid sequence encoded by ORF HI0270, represented
by nucleotides 301245-302267 of SEQ ID NO:1, wherein said fragment
specifically binds an antibody which specifically binds a polypeptide
consisting of the amino acid sequence of HI0270.
12. The isolated polynucleotide of claim 11, wherein said polynucleotide
comprises a heterologous polynucleotide sequence.
13. The isolated polynucleotide of claim 12, wherein said heterologous
polynucleotide sequence encodes a heterologous polypeptide.
14. An isolated polynucleotide complementary to the polynucleotide of claim
11.
15. A method for making a recombinant vector comprising inserting the
isolated polynucleotide of claim 11, into a vector.
16. A recombinant vector comprising the isolated polynucleotide of claim
11.
17. The recombinant vector of claim 16, wherein said polynucleotide is
operably associated with a heterologous regulatory sequence that controls
gene expression.
18. A recombinant host cell comprising the isolated polynucleotide of claim
11.
19. The recombinant host cell of claim 18, wherein said polynucleotide is
operably associated with a heterologous regulatory sequence that controls
gene expression.
20. A method for producing a polypeptide, comprising:
(a) culturing a host cell under conditions suitable to produce a
polypeptide encoded by the polynucleotide of claim 11; and
(b) recovering the polypeptide from the cell culture.
21. An isolated polynucleotide fragment comprising a nucleic acid sequence
which hybridizes under hybridization conditions, comprising hybridization
in 5.times.SSC and 50% formamide at 50-65.degree. C. and washing in a wash
buffer consisting of 0.5.times.SSC at 50-65.degree. C., to the
complementary strand of ORF HI0270, represented by nucleotides
301245-302267 of SEQ ID NO:1.
22. The isolated polynucleotide of claim 21, wherein said polynucleotide
comprises a heterologous polynucleotide sequence.
23. The isolated polynucleotide of claim 22, wherein said heterologous
polynucleotide sequence encodes a heterologous polypeptide.
24. An isolated polynucleotide complementary to the polynucleotide of claim
21.
25. A method for making a recombinant vector comprising inserting the
isolated polynucleotide of claim 21 into a vector.
26. A recombinant vector comprising the isolated polynucleotide of claim
21.
27. The recombinant vector of claim 26, wherein said polynucleotide is
operably associated with a heterologous regulatory sequence that controls
gene expression.
28. A recombinant host cell comprising the isolated polynucleotide of claim
21.
29. The recombinant host cell of claim 28, wherein said polynucleotide is
operably associated with a heterologous regulatory sequence that controls
gene expression.
30. A method for producing a polypeptide, comprising:
(a) culturing a host cell under conditions suitable to produce a
polypeptide encoded by the polynucleotide of claim 21; and
(b) recovering the polypeptide from the cell culture.
31. An isolated polynucleotide comprising a nucleic acid sequence encoding
a polypeptide fragment consisting of at least 10 contiguous amino acid
residues and no more than 100 amino acid residues of the amino acid
sequence encoded by ORF HI0326, represented by nucleotides 301245-302267
of SEQ ID NO:1.
32. The isolated polynucleotide of claim 31, wherein said polynucleotide
comprises a heterologous polynucleotide sequence.
33. The isolated polynucleotide of claim 32, wherein said heterologous
polynucleotide sequence encodes a heterologous polypeptide.
34. An isolated polynucleotide complementary to the polynucleotide of claim
31.
35. A method for making a recombinant vector comprising inserting the
isolated polynucleotide of claim 31 into a vector.
36. A recombinant vector comprising the isolated polynucleotide of claim
31.
37. The recombinant vector of claim 36, wherein said polynucleotide is
operably associated with a heterologous regulatory sequence that controls
gene expression.
38. A recombinant host cell comprising the isolated polynucleotide of claim
31.
39. The recombinant host cell of claim 38, wherein said polynucleotide is
operably associated with a heterologous regulatory sequence that controls
gene expression.
40. A method for producing a polypeptide, comprising:
(a) culturing a host cell under conditions suitable to produce a
polypeptide encoded by the polynucleotide of claim 31; and
(b) recovering the polypeptide from the cell culture.
41. An isolated polynucleotide fragment comprising a nucleic acid sequence
consisting of at least 30 contiguous nucleotide residues and no more than
300 contiguous nucleotide residues of an ORF HI0270, represented by
nucleotides 301245-302267 of SEQ ID NO:1.
42. The isolated polynucleotide of claim 41, wherein said polynucleotide
comprises a heterologous polynucleotide sequence.
43. The isolated polynucleotide of claim 41, wherein said heterologous
polynucleotide sequence encodes a heterologous polypeptide.
44. An isolated polynucleotide complementary to the polynucleotide of claim
41.
45. A method for making a recombinant vector comprising inserting the
isolated polynucleotide of claim 41 into a vector.
46. A recombinant vector comprising the isolated polynucleotide of claim
41.
47. The recombinant vector of claim 46, wherein said polynucleotide is
operably associated with a heterologous regulatory sequence that controls
gene expression.
48. A recombinant host cell comprising the isolated polynucleotide of claim
41.
49. The recombinant host cell of claim 48, wherein said polynucleotide is
operably associated with a heterologous regulatory sequence that controls
gene expression.
50. A method for producing a polypeptide, comprising:
(a) culturing a host cell under conditions suitable to produce a
polypeptide encoded by the polynucleotide of claim 41; and
(b) recovering the polypeptide from the cell culture.
Description
REFERENCE TO SEQUENCE LISTING, A TABLE, OR A COMPUTER LISTING APPENDIX
This application refers to a "Sequence Listing" listed below, which is
provided as an electronic document on two identical compact discs (CD-R),
labeled "Copy 1" and "Copy 2." These compact discs each contain the file
"PB186P2C1D1.ST25.txt" (2,385,030 bytes, created on May 31, 2002), which
is hereby incorporated in its entirety herein.
1. Field of the Invention
The present invention relates to the field of molecular biology. The
present invention discloses compositions comprising the nucleotide
sequence of Haemophilus influenzae, fragments thereof and usage in
industrial fermentation and pharmaceutical development.
2. Background of the Invention
The complete genome sequence from a free living cellular organism has never
been determined. The first mycobacterium sequence should be completed by
1996, while E. coli and S. cerevisae are expected to be completed before
1998. These are being done by random and/or directed sequencing of
overlapping cosmid clones. No one has attempted to determine sequences of
the order of a megabase or more by a random shotgun approach.
H. influenzae is a small (approximately 0.4.times.1 micron) non-motile,
non-spore forming germ-negative bacterium whose only natural host is
human. It is a resident of the upper respiratory mucosa of children and
adults and causes otitis media and respiratory tract infections mostly in
children. The most serious complication is meningitis, which produces
neurological sequelae in up to 50% of affected children. Six H. influenzae
serotypes (a through f) have been identified based on immunologically
distinct capsular polysaccharide antigens. A number of non-typeable
strains are also known. Serotype b accounts for the majority of human
disease.
Interest in the medically important aspects of H. influenzae biology has
focused particularly on those genes which determine virulence
characteristics of the organism. A number of the genes responsible for the
capsular polysaccharide have been mapped and sequenced (Kroll et al., Mol.
Microbiol. 5(6):1549-1560 (1991)). Several outer membrane protein (OMP)
genes have been identified and sequenced (Langford et al., J. Gen.
Microbiol. 138:155-159 (1992)). The lipoligosaccharide (LOS) component of
the outer membrane and the genes of its synthetic pathway are under
intensive study (Weiser et al., J. Bacteriol. 172:3304-3309 (1990)). While
a vaccine has been available since 1984, the study of outer membrane
components is motivated to some extent by the need for improved vaccines.
Recently, the catalase gene was characterized and sequenced as a possible
virulence-related gene (Bishni et al., in press). Elucidation of the H.
influenzae genome will enhance the understanding of how H. influenzae
causes invasive disease and how best to combat infection.
H. influenzae possesses a highly efficient natural DNA transformation
system which has been intensively studied in the non-encapsulated (R),
serotype d strain (Kahn and Smith, J. Membrane Biology 81:89-103 (1984)).
At least 16 transformation-specific genes have been identified and
sequenced. Of these, four are regulatory (Redfield, J. Bacteriol.
173:5612-5618 (1991), and Chandler, Proc. Natl. Acad. Sci. USA
89:1626-1630 (1992)), at least two are involved in recombination processes
(Barouki and Smith, J. Bacteriol 163(2):629-634 (1985)), and at least
seven are targeted to the membranes and periplasmic space (Tomb et al.,
Gene 104:1-10 (1991), and Tomb, Proc. Natl. Acad. Sci. USA 89:10252-10256
(1992)), where they appear to function as structural components or in the
assembly of the DNA transport machinery. H. influenzae Rd transformation
shows a number of interesting features including sequence-specific DNA
uptake, rapid uptake of several double-stranded DNA molecules per
competent cell into a membrane compartment called the transformasome,
linear translocation of a single strand of the donor DNA into the
cytoplasm, and synapsis and recombination of the strand with the
chromosome by a single-strand displacement mechanism. The H. influenzae Rd
transformation system is the most thoroughly studied of the gram-negative
systems and distinct in a number of ways from the gram-positive systems.
The size of H. influenzae Rd genome has been determined by pulsed-field
agarose gel electrophoresis of restriction digests to be approximately 1.9
Mb, making its genome approximately 40% the size of E. coli (Lee and
Smith, J. Bacterol. 170:4402-4405 (1988)). The restriction map of H.
influenzae is circular (Lee et al., J. Bacteriol. 171:3016-3024 (1989),
and Redfield and Lee, "Haemophilus influenzae Rd", pp. 2110-2112, In
O'Brien, S. J. (ed), Genetic Maps: Locus Maps of Complex Genomes, Cold
Spring Harbor Press, New York). Various genes have been mapped to
restriction fragments by Southern hybridization probing of restriction
digest DNA bands. This map will be valuable in verification of the
assembly of a complete genome sequence from randomly sequenced fragments.
GenBank currently contains about 100 kb of non-redundant H. influenzae DNA
sequences. About half are from serotype b and half from Rd.
SUMMARY OF THE INVENTION
The present invention is based on the sequencing of the Haemophilus
influenzae Rd genome. The primary nucleotide sequence which was generated
is provided in SEQ ID NO:1.
The present invention provides the generated nucleotide sequence of the
Haemophilus influenzae Rd genome, or a representative fragment thereof, in
a form which can be readily used, analyzed, and interpreted by a skilled
artisan. In one embodiment, present invention is provided as a contiguous
string of primary sequence information corresponding to the nucleotide
sequence depicted in SEQ ID NO:1.
The present invention further provides nucleotide sequences which are at
least 99.9% identical to the nucleotide sequence of SEQ ID NO:1.
The nucleotide sequence of SEQ ID NO:1, a representative fragment thereof,
or a nucleotide sequence which is at least 99.9% identical to the
nucleotide sequence of SEQ ID NO:1 may be provided in a variety of mediums
to facilitate its use. In one application of this embodiment, the
sequences of the present invention are recorded on computer readable
media. Such media includes, but is not limited to: magnetic storage media,
such as floppy discs, hard disc storage medium, and magnetic tape; optical
storage media such as CD-ROM; electrical storage media such as RAM and
ROM; and hybrids of these categories such as magnetic/optical storage
media.
The present invention further provides systems, particularly computer-based
systems which contain the sequence information herein described stored in
a data storage means. Such systems are designed to identify commercially
important fragments of the Haemophilus influenzae Rd genome.
Another embodiment of the present invention is directed to isolated
fragments of the Haemophilus influenzae Rd genome. The fragments of the
Haemophilus influenzae Rd genome of the present invention-include, but are
not limited to, fragments which encode peptides, hereinafter open reading
frames (ORFs), fragments which modulate the expression of an operably
linked ORF, hereinafter expression modulating fragments (EMFs), fragments
which mediate the uptake of a linked DNA fragment into a cell, hereinafter
uptake modulating fragments (UMFs), and fragments which can be used to
diagnose the presence of Haemophilus influenzae Rd in a sample,
hereinafter, diagnostic fragments (DFs).
Each of the ORF fragments of the Haemophilus influenzae Rd genome disclosed
in Tables 1(a) and 2, and the EMF found 5' to the ORF, can be used in
numerous ways as polynucleotide reagents. The sequences can be used as
diagnostic probes or diagnostic amplification primers for the presence of
a specific microbe in a sample, for the production of commercially
important pharmaceutical agents, and to selectively control gene
expression.
The present invention further includes recombinant constructs comprising
one or more fragments of the Haemophilus influenzae Rd genome of the
present invention. The recombinant constructs of the present invention
comprise vectors, such as a plasmid or viral vector, into which a fragment
of the Haemophilus influenzae Rd has been inserted.
The present invention further provides host cells containing any one of the
isolated fragments of the Haemophilus influenzae Rd genome of the present
invention. The host cells can be a higher eukaryotic host such as a
mammalian cell, a lower eukaryotic cell such as a yeast cell, or can be a
procaryotic cell such as a bacterial cell.
The present invention is further directed to isolated proteins encoded by
the ORFs of the present invention. A variety of methodologies known in the
art can be utilized to obtain any one of the proteins of the present
invention. At the simplest level, the amino acid sequence can be
synthesized using commercially available peptide synthesizers. In an
alternative method, the protein is purified from bacterial cells which
naturally produce the protein. Lastly, the proteins of the present
invention can alternatively be purified from cells which have been altered
to express the desired protein.
The invention further provides methods of obtaining homologs of the
fragments of the Haemophilus influenzae Rd genome of the present invention
and homologs of the proteins encoded by the ORFs of the present invention.
Specifically, by using the nucleotide and amino acid sequences disclosed
herein as a probe or as primers, and techniques such as PCR cloning and
colony/plaque hybridization, one skilled in the art can obtain homologs.
The invention further provides antibodies which selectively bind one of the
proteins of the present invention. Such antibodies include both monoclonal
and polyclonal antibodies.
The invention further provides hybridomas which produce the above-described
antibodies. A hybridoma is an immortalized cell line which is capable of
secreting a specific monoclonal antibody.
The present invention further provides methods of identifying test samples
derived from cells which express one of the ORF of the present invention,
or homolog thereof. Such methods comprise incubating a test sample with
one or more of the antibodies of the present invention, or one or more of
the DFs of the present invention, under conditions which allow a skilled
artisan to determine if the sample contains the ORF or product produced
therefrom.
In another embodiment of the present invention, kits are provided which
contain the necessary reagents to carry out the above-described assays.
Specifically, the invention provides a compartmentalized kit to receive, in
close confinement, one or more containers which comprises: (a) a first
container comprising one of the antibodies, or one of the DFs of the
present invention; and (b) one or more other containers comprising one or
more of the following: wash reagents, reagents capable of detecting
presence of bound antibodies or hybridized DFs.
Using the isolated proteins of the present invention, the present invention
further provides methods of obtaining and identifying agents capable of
binding to a protein encoded by one of the ORFs of the present invention.
Specifically, such agents include antibodies (described above), peptides,
carbohydrates, pharmaceutical agents and the like. Such methods comprise
the steps of:
(a) contacting an agent with an isolated protein encoded by one of the ORFs
of the present invention; and
(b) determining whether the agent binds to said protein.
The complete genomic sequence of H. influenzae will be of great value to
all laboratories working with this organism and for a variety of
commercial purposes. Many fragments of the Haemophilus influenzae Rd
genome will be immediately identified by similarity searches against
GenBank or protein databases and will be of immediate value to Haemophilus
researchers and for immediate commercial value for the production of
proteins or to control gene expression. A specific example concerns PHA
synthase. It has been reported that polyhydroxybutyrate is present in the
membranes of H. influenzae Rd and that the amount correlates with the
level of competence for transformation. The PHA synthase that synthesizes
this polymer has been identified and sequenced in a number of bacteria,
none of which are evolutionarily close to H. influenzae. This gene has yet
to be isolated from H. influenzae by use of hybridization probes or PCR
techniques. However, the genomic sequence of the present invention allows
the identification of the gene by utilizing search means described below.
Developing the methodology and technology for elucidating the entire
genomic sequence of bacterial and other small genomes has and will greatly
enhance the ability to analyze and understand chromosomal organization. In
particular, sequenced genomes will provide the models for developing tools
for the analysis of chromosome structure and function, including the
ability to identify genes within large segments of genomic DNA, the
structure, position, and spacing of regulatory elements, the
identification of genes with potential industrial applications, and the
ability to do comparative genomic and molecular phylogeny.
DESCRIPTION OF THE FIGURES
FIG. 1--restriction map of the Haemophilus influenzae Rd genome.
FIG. 2--Block diagram of a computer system 102 that can be used to
implement the computer-based systems of present invention.
FIG. 3--A comparison of experimental coverage of up to approximately 4000
random sequence fragments assembled with AutoAssembler (squares) as
compared to lander-Waterman prediction for a 2.5 Mb genome (triangles) and
a 1.6 Mb genome (circles) with a 460 bp average sequence length and a 25
bp overlap.
FIG. 4--Data flow and computer programs used to manage, assemble, edit, and
annotate the H. influenzae genome. Both Macintosh and Unix platforms are
used to handle the AB 373 sequence data files (Kerlavage et al.,
Proceedings of the Twenty-Sixth Annual Hawaii International Conference on
System Sciences, IEEE Computer Society Press, Washington D.C., 585
(1993)). Factura (AB) is a Macintosh program designed for automatic vector
sequence removal and end trimming of sequence files. The program esp runs
on a Macintosh platform and parses the feature data extracted from the
sequence files by Factura to the Unix based H. influenzae relational
database. Assembly is accomplished by retrieving a specific set of
sequence files and their associated features using stp, an X-windows
graphical interface and control program which can retrieve sequences from
the H. influenzae database using user-defined or standard SQL queries. The
sequence files were assembled using TIGR Assembler, an assembly engine
designed at TIGR for rapid and accurate assembly of thousands of sequence
fragments. TIGR Editor is a graphical interface which can parse the
aligned sequence files from TIGR Assembler output and display the
alignment and associated electropherograms for contig editing.
Identification of putative coding regions was performed with Genemark
(Borodovsky and McIninch, Computers Chem. 17(2):123 (1993)), a Markov and
Bayes modeled program for predicting gene locations, and trained on a H.
influenzae sequence data set. Peptide searches were performed against the
three reading frames of each Genemark predicted coding region using blaze
(Brutlag et al., Computers Chem. 17:203 (1993)) run on a Maspar MP-2
massively parallel computer with 4096 microprocessors. Results from each
frame were combined into a single output file by mblzt. Optimal protein
alignments were obtained using the program praze which extends alignments
across potential frameshifts. The output was inspected using a custom
graphic viewing program, gbyob, that interacts directly with the H.
influenzae database. The alignments were further used to identify
potential frameshift errors and were targeted for additional editing.
FIG. 5--A circular representation of the H. influenzae Rd chromosome
illustrating the location of each predicted coding region containing a
database match as well as selected global features of the genome. Outer
perimeter: The location of the unique NotI restriction site (designated as
nucleotide 1), the RsrII sites, and the SmaI sites. Outer concentric
circle: The location of each identified coding region for which a gene
identification was made. Second concentric circle: Regions of high G/C
content and high A/T content. High G/C content regions are specifically
associated with the 6 ribosomal operons and the mu-like prophage. Third
concentric circle: Coverage by lambda clones. Over 300 lambda clones were
sequenced from each end to confirm the overall structure of the genome and
identify the 6 ribosomal-operons. Fourth concentric circle: The locations
of the 6 ribosomal operons, the tRNAs and the cryptic mu-like prophage.
Fifth concentric circle: Simple tandem repeats. The locations of the
following repeats are shown: CTGGCT, GTCT, ATT, AATGGC, TTGA, TTGG, TTTA,
TTATC TGAC, TCGTC, AACC, TTGC, CAAT, CCAA. The putative origin of
replication is illustrated by the outward pointing arrows originating near
base 603,000. Two potential termination sequences are shown near the
opposite midpoint of the circle.
FIGS. 6(A) to 6(AN) Complete map of the H. influenzae Rd genome. Predicted
coding regions are shown on each strand. rRNA and tRNA genes are shown as
lines and triangles, respectively. GeneID numbers correspond to those in
Tables 1(a), 1(b) and 2. Where possible, three-letter designations are
also provided.
FIG. 7--A comparison of the region of the H. influenzae chromosome
containing the 8 genes of the fimbrial gene cluster present in H.
influenzae type b and the same region in H. influenzae Rd. The region is
flanked by the pepN and purE genes in both organisms. However in the
non-infectious Rd strain the 8 genes of the fimbrial gene cluster have
been excised. A 172 bp spacer region is located in this region in the Rd
strain and continues to be flanked by the pepN and purE genes.
FIG. 8--Hydrophobicity analysis of five predicted channel-proteins. The
amino acid sequences of five predicted coding regions that do not display
homology with known peptide sequences (GenBank release 87), each exhibit
multiple hydrophobic domains that are characteristic of channel-forming
proteins. The predicted coding region sequences were analyzed by the
Kyte-Doolittle algorithm (Kyte and Doolittle, J. Mol. Biol. 157:105
(1982)) (with a range of 11 residues) using the GeneWorks software package
(Intelligenetics).
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
The present invention is based on the sequencing of the Haemophilus
influenzae Rd genome. The primary nucleotide sequence which was generated
is provided in SEQ ID NO:1. As used herein, the "primary sequence" refers
to the nucleotide sequence represented by the IUPAC nomenclature system.
The sequence provided in SEQ ID NO:1 is oriented relative to a unique Not I
restriction endonuclease site found in the Haemophilus influenzae Rd
genome. A skilled artisan will readily recognize that this start/stop
point was chosen for convenience and does not reflect a structural
significance.
The present invention provides the nucleotide sequence of SEQ ID NO:1, or a
representative fragment thereof, in a form which can be readily used,
analyzed, and interpreted by a skilled artisan. In one embodiment, the
sequence is provided as a contiguous string of primary sequence
information corresponding to the nucleotide sequence provided in SEQ ID
NO:1.
As used herein, a "representative fragment of the nucleotide sequence
depicted in SEQ ID NO:1" refers to any portion of SEQ ID NO:1 which is not
presently represented within a publicly available database. Preferred
representative fragments of the present invention are Haemophilus
influenzae open reading frames, expression modulating fragments, uptake
modulating fragments, and fragments which can be used to diagnose the
presence of Haemophilus influenzae Rd in sample. A non-limiting
identification of such preferred representative fragments is provided in
Tables 1(a) and and 2.
The nucleotide sequence information provided in SEQ ID NO:1 was obtained by
sequencing the Haemophilus influenzae Rd genome using a megabase shotgun
sequencing method. Using three parameters of accuracy discussed in the
Examples below, the present inventors have calculated that the sequence in
SEQ ID NO:1 has a maximum accuracy of 99.98%. Thus, the nucleotide
sequence provided in SEQ ID NO:1 is a highly accurate, although not
necessarily a 100% perfect, representation of the nucleotide sequence of
the Haemophilus influenzae Rd genome.
As discussed in detail below, using the information provided in SEQ ID NO:1
and in Tables 1(a) and 2 together with routine cloning and sequencing
methods, one of ordinary skill in the art will be able to clone and
sequence all "representative fragments" of interest including open reading
frames (ORFs) encoding a large variety of Haemophilus influenzae proteins.
In very rare instances, this may reveal a nucleotide sequence error
present in the nucleotide sequence disclosed in SEQ ID NO: 1. Thus, once
the present invention is made available (i.e., once the information in SEQ
ID NO:1 and Tables 1(a) and 2 have been made available), resolving a rare
sequencing error in SEQ ID NO:1 will be well within the skill of the art.
Nucleotide sequence editing software is publicly available. For example,
Applied Biosystem's (AB) AutoAssembler.TM. can be used as an aid during
visual inspection of nucleotide sequences.
Even if all of the very rare sequencing errors in SEQ ID NO:1 were
corrected, the resulting nucleotide sequence would still beat least 99.9%
identical to the nucleotide sequence in SEQ ID NO:1.
The nucleotide sequences of the genomes from different strains of
Haemophilus influenzae differ slightly. However, the nucleotide sequence
of the genomes of all Haemophilus influenzae strains will be at least
99.9% identical to the nucleotide sequence provided in SEQ ID NO:1.
Thus, the present invention further provides nucleotide sequences which are
at least 99.9% identical to the nucleotide sequence of SEQ ID NO:1 in a
form which can be readily used, analyzed and interpreted by the skilled
artisan. Methods for determining whether a nucleotide sequence is at least
99.9 % identical to the nucleotide sequence of SEQ ID NO:1 are routine and
readily available to the skilled artisan. For example, the well known
fasta algothrithm (Pearson and Lipman, Proc. Natl. Acad. Sci. USA 85:2444
(1988)) can be used to generate the percent identity of nucleotide
sequences.
Computer Related Embodiments
The nucleotide sequence provided in SEQ ID NO:1, a representative fragment
thereof, or a nucleotide sequence at least 99.9% identical to SEQ ID NO:1
may be "provided" in a variety of mediums to facilitate use thereof. As
used herein, provided refers to a manufacture, other than an isolated
nucleic acid molecule, which contains a nucleotide sequence of the present
invention, i.e., the nucleotide sequence provided in SEQ ID NO:1, a
representative fragment thereof, or a nucleotide sequence at least 99.9%
identical to SEQ ID NO:1. Such a manufacture provides the Haemophilus
influenzae Rd genome or a subset thereof (e.g., a Haemophilus Influenzae
Rd open reading frame (ORF)) in a form which allows a skilled artisan to
examine the manufacture using means not directly applicable to examining
the Haemophilas influenzae Rd genome or a subset thereof as it exists in
nature or in purified form.
In one application of this embodiment, a nucleotide sequence of the present
invention can be recorded on computer readable media. As used herein,
"computer readable media" refers to any medium which can be read and
accessed directly by a computer. Such media include, but are not limited
to: magnetic storage media, such as floppy discs, hard disc storage
medium, and magnetic tape; optical storage media such as CD-ROM;
electrical storage media such as RAM and ROM; and hybrids of these
categories such as magnetic/optical storage media. A skilled artisan can
readily appreciate how any of the presently known computer readable
mediums can be used to create a manufacture comprising computer readable
medium having recorded thereon a nucleotide sequence of the present
invention.
As used herein, "recorded" refers to a process for storing information on
computer readable medium. A skilled artisan can readily adopt any of the
presently know methods for recording information on computer readable
medium to generate manufactures comprising the nucleotide sequence
information of the present invention.
A variety of data storage structures are available to a skilled artisan for
creating a computer readable medium having recorded thereon a nucleotide
sequence of the present invention. The choice of the data storage
structure will generally be based on the means chosen to access the stored
information. In addition, a variety of data processor programs and formats
can be used to store the nucleotide sequence information of the present
invention on computer readable medium. The sequence information can be
represented in a word processing text file, formatted in
commercially-available software such as WordPerfect and MicroSoft Word, or
represented in the form of an ASCII file, stored in a database
application, such as DB2, Sybase, Oracle, or the like. A skilled artisan
can readily adapt any number of dataprocessor structuring formats (e.g.
text file or database) in order to obtain computer readable medium having
recorded thereon the nucleotide sequence information of the present
invention.
By providing the nucleotide sequence of SEQ ID NO: 1, a representative
fragment thereof, or a nucleotide sequence at least 99.9% identical to SEQ
ID NO:1 in computer readable form, a skilled artisan can routinely access
the sequence information for a variety of purposes. Computer software is
publicly available which allows a skilled artisan to access sequence
information provided in a computer readable medium. The examples which
follow demonstrate how software which implements the BLAST (Altschul et
al., J. Mol. Biol. 215:403-410 (1990)) and BLAZE (Brutlag et al., Comp.
Chem. 17:203-207 (1993)) search algorithms on a Sybase system was used to
identify open reading frames (ORFs) within the Haemophilus influenzae Rd
genome which contain homology to ORFs or proteins from other organisms.
Such ORFs are protein encoding fragments within the Haemophilus influenzae
Rd genome and are useful in producing commercially important proteins such
as enzymes used in fermentation reactions and in the production of
commercially useful metabolites.
The present invention further provides systems, particularly computer-based
systems, which contain the sequence information described herein. Such
systems are designed to identify commercially important fragments of the
Haemophilus influenzae Rd genome.
As used herein, "a computer-based system" refers to the hardware means,
software means, and data storage means used to analyze the nucleotide
sequence information of the present invention. The minimum hardware means
of the computer-based systems of the present invention comprises a central
processing unit (CPU), input means, output means, and data storage means.
A skilled artisan can readily appreciate that any one of the currently
available computer-based system are suitable for use in the present
invention.
As stated above, the computer-based systems of the present invention
comprise a data storage means having stored therein a nucleotide sequence
of the present invention and the necessary hardware means and software
means for supporting and implementing a search means. As used herein,
"data storage means" refers to memory which can store nucleotide sequence
information of the present invention, or a memory access means which can
access manufactures having recorded thereon the nucleotide sequence
information of the present invention.
As used herein, "search means" refers to one or more programs which are
implemented on the computer-based system to compare a target sequence or
target structural motif with the sequence information stored within the
data storage means. Search means are used to identify fragments or regions
of the Haemophilus influenzae Rd genome which match a particular target
sequence or target motif. A variety of known algorithms are disclosed
publicly and a variety of commercially available software for conducting
search means are and can be used in the computer-based systems of the
present invention. Examples of such software includes, but is not limited
to, MacPattern (EMBL), BLASTN and BLASTX (NCBIA). A skilled artisan can
readily recognize that any one of the available algorithms or implementing
software packages for conducting homology searches can be adapted for use
in the present computer-based systems.
As used herein, a "target sequence" can be any DNA or amino acid sequence
of six or more nucleotides or two or more amino acids. A skilled artisan
can readily recognize that the longer a target sequence is, the less
likely a target sequence will be present as a random occurrence in the
database. The most preferred sequence length of a target sequence is from
about 10 to 100 amino acids or from about 30 to 300 nucleotide residues.
However, it is well recognized that searches for commercially important
fragments of the Haemophilus influenzae Rd genome, such as sequence
fragments involved in gene expression and protein processing, may be of
shorter length.
As used herein, "a target structural motif," or "target motif," refers to
any rationally selected sequence or combination of sequences in which the
sequence(s) are chosen based on a three-dimensional configuration which is
formed upon the folding of the target motif. There are a variety of target
motifs known in the art. Protein target motifs include, but are not
limited to, enzymic active sites and signal sequences. Nucleic acid target
motifs include, but are not limited to, promoter sequences, hairpin
structures and inducible expression elements (protein binding sequences).
A variety of structural formats for the input and output means can be used
to input and output the information in the computer-based systems of the
present invention. A preferred format for an output means ranks fragments
of the Haemophilus influenzae Rd genome possessing varying degrees of
homology to the target sequence or target motif. Such presentation
provides a skilled artisan with a ranking of sequences which contain
various amounts of the target sequence or target motif and identifies the
degree of homology contained in the identified fragment.
A variety of comparing means can be used to compare a target sequence or
target motif with the data storage means to identify sequence fragments of
the Haemophilus influenzae Rd genome. In the present examples,
implementing software which implement the BLAST and BLAZE algorithms
(Altschul et al., J. Mol. Biol. 215:403-410. (1990)) was used to identify
open reading frames within the Haemophilus influenzae Rd genome. A skilled
artisan can readily recognize that any one of the publicly available
homology search programs can be used as the search means for the
computer-based systems of the present invention.
One application of this embodiment is provided in FIG. 2. FIG. 2 provides a
block diagram of a computer system 102 that can be used to implement the
present invention. The computer system 102 includes a processor 106
connected to a bus 104. Also connected to the bus 104 are a main memory
108 (preferably implemented as random access memory, RAM) and a variety of
secondary storage devices 110, such as a hard drive 112 and a removable
medium storage device 114. The removable medium storage device 114 may
represent, for example, a floppy disk drive, a CD-ROM drive, a magnetic
tape drive, etc. A removable storage medium 116 (such as a floppy disk, a
compact disk, a magnetic tape, etc.) containing control logic and/or data
recorded therein may be inserted into the removable medium storage device
114. The computer system 102 includes appropriate software for reading the
control logic and/or the data from the removable medium storage device 114
once inserted in the removable medium storage device 114.
A nucleotide sequence of the present invention may be stored in a well
known manner in the main memory 108, any of the secondary storage devices
110, and/or a removable storage medium 116. Software for accessing and
processing the genomic sequence (such as search tools, comparing tools,
etc.) reside in main memory 108 during execution.
Biochemical Embodiments
Another embodiment of the present invention is directed to isolated
fragments of the Haemophilus influenzae Rd genome. The fragments of the
Haemophilus influenzae Rd genome of the present invention include, but are
not limited to fragments which encode peptides, hereinafter open reading
frames (ORFs), fragments which modulate the expression of an operably
linked ORF, hereinafter expression modulating fragments (EMFs), fragments
which mediate the uptake of a linked DNA fragment into a cell, hereinafter
uptake modulating fragments (UMFs), and fragments which can be used to
diagnose the presence of Haemophilus influenzae Rd in a sample,
hereinafter diagnostic fragments (DFs).
As used herein, an "isolated nucleic acid molecule" or an "isolated
fragment of the Haemophilus influenzae Rd genome" refers to a nucleic acid
molecule possessing a specific nucleotide sequence which has been
subjected to purification means to reduce, from the composition, the
number of compounds which are normally associated with the composition. A
variety of purification means can be used to generated the isolated
fragments of the present invention. These include, but are not limited to
methods which separate constituents of a solution based on charge,
solubility, or size.
In one embodiment, Haemophilus influenaze Rd DNA can be mechanically
sheared to produce fragments of 15-20 kb in length. These fragments can
then be used to generate an Haemophilus influenzae Rd library by inserting
them into labda clones as described in the Examples below. Primers
flanking, for example, an ORF provided in Table 1(a) can then be generated
using nucleotide sequence information provided in SEQ ID NO:1. PCR cloning
can then be used to isolate the ORF from the lambda DNA library. PCR
cloning is well known in the art. Thus, given the availability of SEQ ID
NO:1, Table 1(a) and Table 2, it would be routine to isolate any ORF or
other nucleic acid fragment of the present invention.
The isolated nucleic acid molecules of the present invention include, but
are not limited to single stranded and double stranded DNA, and single
stranded RNA.
As used herein, an "open reading frame," ORF, means a series of triplets
coding for amino acids without any termination codons and is a sequence
translatable into protein. Tables 1a, 1b and 2 identify ORFs in the
Haemophilus influenzae Rd genome. In particular, Table 1a indicates the
location of ORFs within the Haemophilus influenzae genome which encode the
recited protein based on homology matching with protein sequences from the
organism appearing in parentheticals (see the fourth column of Table
1(a)).
The first column of Table 1(a) provides the "GeneID" of a particular ORF.
This information is useful for two reasons. First, the complete map of the
Haemophilus influenzae Rd genome provided in FIGS. 6(A) 6(AN) refers to
the ORFs according to their GeneID numbers. Second, Table 1(b) uses the
GeneID numbers to indicate which ORFs were provided previously in a public
database.
The second and third columns in Table 1(a) indicate an ORFs position in the
nucleotide sequence provided in SEQ ID NO:1. One of ordinary skill will
recognize that ORFs may be oriented in opposite directions in the
Haemophilus influenae genome. This is reflected in columns 2 and 3.
The fifth column of Table 1(a) indicates the percent identity of the
protein encoded for by an ORF to the corresponding protein from the
orgaism appearing in parentheticals in the fourth column.
The sixth column of Table 1(a) indicates the percent similarity of the
protein encoded for by an ORF to the corresponding protein from the
organism appearing in parentheticals in the fourth column. The concepts of
percent identity and percent similarity of two polypeptide sequences is
well understood in the art. For example, two polypeptides 10 amino acids
in length which differ at three amino acid positions (e.g., at positions
1, 3 and 5) are said to have a percent identity of 70%. However, the same
two polypeptides would be deemed to have a percent similarity of 80% if,
for example at position 5, the amino acids moieties, although not
identical, were "similar" (i.e., possessed similar biochemical
characteristics).
The seventh column in Table 1(a) indicates the length of the amino acid
homology match.
Table 2 provides ORFs of the Haemophilus influenzae Rd genome which encode
polypeptide sequences which did not elicit a "homology match" with a known
protein sequence from another organism. Further details concerning the
algorithms and criteria used for homology searches are provided in the
Examples below.
A skilled artisan can readily identify ORFs in the Haemophilus influenzae
Rd genome other than those listed in Tables 1(a), 1(b) and 2, such as ORFs
which are overlapping or encoded by the opposite strand of an identified
ORF in addition to those ascertainable using the computer-based systems of
the present invention.
As used herein, an "expression modulating fragment," EMF, means a series of
nucleotide molecules which modulates the expression of an operably linked
ORF or EMF.
As used herein, a sequence is said to "modulate the expression of an
operably linked sequence" when the expression of the sequence is altered
by the presence of the EMF. EMFs include, but are not limited to,
promoters, and promoter modulating sequences (inducible elements). One
class of EMFs are fragments which induce the expression or an operably
linked ORF in response to a specific regulatory factor or physiological
event. A review of known EMFs from Haemophilus are described by (Tomb et
al. Gene 104:1-10 (1991), Chandler, M. S., Proc. Natl. Acad. Sci. USA
89:1626-1630 (1992).
EMF sequences can be identified within the Haemophilus influenzae Rd genome
by their proximity to the ORFs provided in Tables 1(a), 1(b) and 2. An
intergenic segment, or a fragment of the intergenic segment, from about 10
to 200 nucleotides in length, taken 5' from any one of the ORFs of Tables
1(a), 1(b), or 2 will modulate the expression of an operably linked 3' ORF
in a fashion similar to that found with the naturally linked ORF sequence.
As used herein, an "intergenic segment" refers to the fragments of the
Haemophilus genome which are between two ORF(s) herein described.
Alternatively, EMFs can be identified using known EMFs as a target
sequence or target motif in the computer-based systems of the present
invention.
The presence and activity of an EMF can be confirmed using an EMF trap
vector. An EMF trap vector contains a cloning site 5' to a marker
sequence. A marker sequence encodes an identifiable phenotype, such as
antibiotic resistance or a complementing nutrition auxotrophic factor,
which can be identified or assayed when the EMF trap vector is placed
within an appropriate host under appropriate conditions. As described
above, a EMF will modulate the expression of an operably linked marker
sequence. A more detailed discussion of various marker sequences is
provided below.
A sequence which is suspected as being a EMF is cloned in all three reading
frames in one or more restriction sites upstream from the marker sequence
in the EMF trap vector. The vector is then transformed into an appropriate
host using known procedures and the phenotype of the transformed host in
examined under appropriate conditions. As described above, an EMF will
modulate the expression of an operably linked marker sequence.
As used herein, an "uptake modulating fragment," UMF, means a series of
nucleotide molecules which mediate the uptake of a linked DNA fragment
into a cell. UMFs can be readily identified using known UMFs as a target
sequence or target motif with the computer-based systems described above.
The presence and activity of a UMF can be confirmed by attaching the
suspected UMF to a marker sequence. The resulting nucleic acid molecule is
then incubated with an appropriate host under appropriate conditions and
the uptake of the marker sequence is determined. As described above, a UMF
will increase the frequency of uptake of a linked marker sequence. A
review of DNA uptake in Haemophilus is provided by Goodgall, S. H., et
al., J. Bact. 172:5924-5928 (1990).
As used herein, a "diagnostic fragment," DF, means a series of nucleotide
molecules which selectively hybridize to Haemophilus influenzae sequences.
DFs can be readily identified by identifying unique sequences within the
Haemophilus influenzae Rd genome, or by generating and testing probes or
amplification primers consisting of the DF sequence in an appropriate
diagnostic format which determines amplification or hybridization
selectivity.
The sequences falling within the scope of the present invention are not
limited to the specific sequences herein described, but also include
allelic and species variations thereof. Allelic and species variations can
be routinely determined by comparing the sequence provided in SEQ ID NO:1,
a representative fragment thereof, or a nucleotide sequence at least 99.9%
identical to SEQ ID NO:1 with a sequence from another isolate of the same
species. Furthermore, to accommodate codon variability, the invention
includes nucleic acid molecules coding for the same amino acid sequences
as do the specific ORFs disclosed herein. In other words, in the coding
region of an ORF, substitution of one codon for another which encodes the
same amino acid is expressly contemplated.
Any specific sequence disclosed herein can be readily screened for errors
by resequencing a particular fragment, such as an ORF, in both directions
(i.e., sequence both strands). Alternatively, error screening can be
performed by sequencing corresponding polynucleotides of Haemophilus
influenzae origin isolated by using part or all of the fragments in
question as a probe or primer. Each of the ORFs of the Haemophilus
influenzae Rd genome disclosed in Tables 1(a), 1(b) and 2, and the EMF
found 5' to the ORF, can be used in numerous ways as polynucleotide
reagents. The sequences can be used as diagnostic probes or diagnostic
amplification primers to detect the presence of a specific microbe, such
as Haemophilus influenzae RD, in a sample. This is especially the case
with the fragments or ORFs of Table 2, which will be highly selective for
Haemophilus influenzae.
In addition, the fragments of the present invention, as broadly described,
can be used to control gene expression through triple helix formation or
antisense DNA or RNA, both of which methods are based on-the binding of a
polynucleotide sequence to DNA or RNA. Polynucleotides suitable for use in
these methods are usually 20 to 40 bases in length and are designed to be
complementary to a region of the gene involved in transcription (triple
helix--see Lee et al., Nucl. Acids Res. 6:3073 (1979); Cooney et al.,
Science 241:456 (1988); and Dervan et al., Science 251:1360 (1991)) or to
the mRNA itself (antisense--Okano, J. Neurochem. 56:560 (1991);
Oligodeoxynucleotides as Antisense Inhibitors of Gene Expression, CRC
Press, Boca Raton, Fla. (1988)). Triple helix-formation optimally results
in a shut-off of RNA transcription from DNA, while antisense RNA
hybridization blocks translation of an mRNA molecule into polypeptide.
Both techniques have been demonstrated to be effective in model systems.
Information contained in the sequences of the present invention is
necessary for the design of an antisense or triple helix oligonucleotide.
The present invention further provides recombinant constructs comprising
one or more fragments of the Haemophilus influenzae Rd genome of the
present invention. The recombinant constructs of the present invention
comprise a vector, such as a plasmid or viral vector, into which a
fragment of the Haemophilus influenzae Rd has been inserted, in a forward
or reverse orientation. In the case of a vector comprising one of the ORFs
of the present invention, the vector may further comprise regulatory
sequences, including for example, a promoter, operably linked to the ORF.
For vectors comprising tie EMFs and UMFs of the present invention, the
vector may further comprise a marker sequence or heterologous ORF operably
linked to the EMF or UMF. Large numbers of suitable vectors and promoters
are known to those of skill in the art and are commercially available for
generating the recombinant constructs of the present invention. The
following vectors are provided by way of example. Bacterial: pBs,
phagescript, PsiX174, pBluescript SK, pBs KS, pNH8a, pNH16a, pNH18a,
pNH46a (Stratagene); pTrc99A, pKK223-3, pKK233-3, pDR540, pRIT5
(Pharmacia). Eukaryotic: pWLneo, pSV2cat, pOG44, pXT1, pSG (Stratagene)
pSVK3, pBPV, pMSG, pSVL (Pharmacia).
Promoter regions can be selected from any desired gene using CAT
(chloramphenicol transferase) vectors or other vectors with selectable
markers. Two appropriate vectors are pKK232-8 and pCM7. Particular named
bacterial promoters include lacI, lacZ, T3, T7, gpt, lambda P.sub.R, and
trc. Eukaryotic