Title: System and method for developing and interpreting e-commerce metrics by utilizing a list of rules wherein each rule contain at least one of entity-specific criteria
Abstract: A system, method and computer program product for developing and interpreting e-commerce metrics is disclosed. The method involves collecting pages that are commonly transmitted over a computer network (e.g., the Internet, an institutional intranet, etc.), where the pages are relevant to the business operations of an entity, collecting external data, which may or may not be available on the computer network, but that is highly relevant to the entity, processing the collected pages with additional information such as contact information, routing tables, financial information, and other data which does not need to be collected more than once, and scoring the pages based on all the information collected to determine statistics. The statistics are analyzed for business information which may be important to the operations of the entity. The method then produces a report to deliver a continuous stream of e-commerce intelligence for the entity.
Patent Number: 7,013,323 Issued on 03/14/2006 to Thomas,   et al.
| Inventors:
|
Thomas; Jason B. (Arlington, VA);
Bildner; Mark J. (Alexandria, VA);
Thomas; Brandy M. (Arlington, VA);
Young; Christopher D. (Washington, DC);
Moore; Richard P. (Potomac Falls, VA);
Biro; Ross A. (Alexandria, VA);
Pemberton; Alissa S. (Washington, DC);
Perlman; Diane B. (Silver Spring, MD)
|
| Assignee:
|
Cyveillance, Inc. (Arlington, VA)
|
| Appl. No.:
|
576896 |
| Filed:
|
May 23, 2000 |
| Current U.S. Class: |
709/203; 709/224; 709/219; 707/4; 707/5; 707/6; 707/7; 707/10 |
| Current Intern'l Class: |
G06F 15/16 (20060101) |
| Field of Search: |
709/218,223,224,203,217,219
707/1,3-7,10
|
References Cited [Referenced By]
U.S. Patent Documents
| 5659732 | Aug., 1997 | Kirsch.
| |
| 5931907 | Aug., 1999 | Davies et al.
| |
| 5933822 | Aug., 1999 | Braden-Harder et al.
| |
| 5963965 | Oct., 1999 | Vogel.
| |
| 6289341 | Sep., 2001 | Barney.
| |
| 6321228 | Nov., 2001 | Crandall et al.
| |
| 6377961 | Apr., 2002 | Ryu.
| |
| 6442606 | Aug., 2002 | Subbaroyan et al.
| |
| 6480835 | Nov., 2002 | Light.
| |
| 6480837 | Nov., 2002 | Dutta.
| |
| 6519586 | Feb., 2003 | Anick et al.
| |
| 2001/0044795 | Nov., 2001 | Cohen et al.
| |
| 2002/0147880 | Oct., 2002 | Wang Baldonado.
| |
| 2002/0169694 | Nov., 2002 | Stone et al.
| |
| 2003/0149684 | Aug., 2003 | Brown et al.
| |
Primary Examiner: Najjar; Saleh
Assistant Examiner: Duong; Oanh
Attorney, Agent or Firm: DLA Piper Rudnick Gray Cary US LLP
Claims
What is claimed is:
1. A method for developing and interpreting e-commerce metrics of an entity,
comprising the steps of:
(1) collecting pages that are commonly transmitted over a computer network;
(2) receiving a list of predetermined, entity-specific criteria defining information
relevant to the entity;
(3) receiving a first set of rules related to entity-specific criteria defining
information relevant to an entity;
(4) determining whether each of said pages satisfies each of said first set of
rules therefore obtaining a first subset of said pages;
(5) parsing content of said first subset of said pages using a second set of
rules inclusive of said first set and adding rules related to searching for at
least one key word in at least one predetermined category of key words, thereby
obtaining a second subset of said pages;
(6) scoring said second subset of said pages utilizing a third set of rules incorporating
analyzed statistics based on said first and said second set of rules and incorporating
additional information; and
(7) generating a report utilizing a fourth set of rules prioritizing results
of said second and third set of rules, including said analyzed statistics and said
additional information;
such that said report is utilized to aid an entity in doing business over said
computer network.
2. The method of claim 1, wherein said computer network is the global Internet.
3. The method of claim 1, wherein said computer network is an intranet.
4. The method of claim 1, wherein said computer network is an extranet.
5. The method of claim 2, further comprising the steps of:
(8) obtaining contact information for said report.
6. The method of claim 2, further comprising the step of:
(8) generating said report listing scores.
7. The method of claim 1, wherein step
6 comprises the steps of:
(a) compiling statistics from said second subset of said pages;
(b) storing said statistics; and
(c) analyzing said statistics by combining said statistics, said second subset
of said pages and said additional information.
8. The method of claim 7, further comprising the steps of:
performing step (a)-(c) for a plurality of entities.
9. A system for developing and interpreting e-commerce metrics of an entity, comprising:
a downloader for searching a computer network, wherein said computer network
contains pages;
a page processing module for receiving said pages downloaded from said search
of said computer network, said page processing module utilizing a first set of
rules related to entity-specific criteria defining information relevant to an entity
and forming a first subset of pages;
an archive for storing said first subset of said pages, said pages being downloaded
to said archive by said page processing module; and
a database for allowing said page processing module to perform queries of said
pages from said first subset of said pages, stored on said archive, in order to
produce a report, said report comprising:
parsed content of said pages generated utilizing a second set of rules inclusive
of said first set and adding rules related to searching for at least one key word,
whereby the parsed content is parsed with at least one predetermined category;
scored pages generated utilizing a third set of rules incorporating analyzed
statistics based on said first and said second set of rules and incorporating additional
information; and
pages prioritized utilizing a fourth set of rules prioritizing contents of said
report utilizing results of said second and said third set of rules including said
analyzed statistics and said additional information;
such that said report is utilized to aid an entity in doing business over said
computer network.
10. The system of claim 9, wherein said computer network is the global Internet.
11. The system of claim 9, wherein said computer network is an intranet.
12. The system of claim 9, wherein said computer network is an extranet.
13. The system of claim 10, further comprising:
a plurality of Web clients that provides a graphical user interface for a user
to enter search criteria and communicate with said downloader, thereby controlling
said page processing module.
14. A computer program product comprising a computer usable medium having computer
readable program code means embodied in said medium for causing an application
program to execute on a computer that develops and interprets e-commerce metrics
of an entity, said computer readable program code means comprising:
first computer readable program code means for causing the computer to collect
pages that are commonly transmitted over a computer network;
second computer readable program code means for causing the computer to receive
a first set of rules related to entity specific criteria defining information relevant
to the entity;
third computer readable program code means for causing the computer to determine
whether each of said pages satisfies each of said first set of rules therefore
obtaining a first subset of said pages,
fourth computer readable program code means for parsing content of said first
subset of said pages using a second set of rules inclusive of the first set and
adding rules related to searching for at least one key word in at least one predetermined
category of key words, thereby obtaining a second subset of said pages;
fifth computer readable program code means for scoring said second subset of
said pages utilizing a third set of rules incorporating analyzed statistics based
on the first and the second set of rules and incorporating additional information; and
sixth computer readable program code means for causing the computer to generate
a report utilizing a fourth set of rules;
contents of the report utilizing results of the second and third set of rules
including the analyzed statistics and said additional information;
such that said report is utilized to aid an entity in doing business over said
computer network.
15. The computer program product of claim 14, wherein said computer network is
the global Internet.
16. The computer program product of claim 15, further comprising:
seventh computer readable program code means for causing the computer to obtain
contact information for said report.
17. The computer program product of claim 15, further comprising:
seventh computer readable program code means for causing the computer to generate
said report listing said scores of said subset of pages.
18. The computer program product of claim 17, wherein said fifth computer readable
program code means comprises:
seventh computer readable program code means for causing the computer to compile
statistics from said pages;
eighth computer readable program code means for causing the computer to store
said statistics; and
ninth computer readable program code means for causing computer to analyze said
statistics by combining said statistics, said pages and said additional information.
19. The computer program product of claim 18, further comprising tenth computer
readable program code means for causing the computer to perform the seventh, eighth,
and ninth computer readable program code for a plurality of entities.
Description
CROSS-REFERENCE TO RELATED APPLICATION
This application is related to the following commonly owned, co-pending applications:
- "System, Method and Computer Program Product for an Online Monitoring
Search Engine", by Thomas, having application Ser. No. 09/133,374, filed on Aug.
13, 1998, which is incorporated herein by reference in its entirety; and
- "System, Method and Computer Program Product for Analyzing E-Commerce
Competition", by Thomas et al., having application number TBA Ser. No. 09/576,895,
filed concurrently herewith, which is incorporated herein by reference in it entirety.
BACKGROUND OF THE INVENTION
1. Field of the Invention
The invention relates generally to computer network search engines, and more
particularly to search engines for performing online monitoring activities.
2. Related Art
Over the past several years, there has been a large growth in the number of
computers, and thus people, connected to the global Internet and the World-Wide
Web (WWW). This collective expansion allows computer users to access various types
of information, disseminate information, and be exposed to electronic commerce
(e-commerce) activities, all with a great degree of freedom. E-commerce includes
large corporations, small businesses, individual entrepreneurs, organizations,
and the like who offer their information, products, and/or services to people all
over the world via the Internet.
The rise in use of the Internet, however, also has a negative side. Given the
Internet's vastness and freedom, many unscrupulous companies, organizations and
individuals have taken the opportunity to profit by diverting customer traffic,
misusing product information, and mis-associating their product or company with
others. For example, it has been estimated that millions of pages employ tags and
text designed to divert searchers to their sites when the Internet users actually
searched for something else. These diversions and incidents of misinformation cause
a loss of business. Also, an individual, company, organization, or the like may
be concerned with other violations such as the illegal sale of their products,
or the sale of inferior products using their brand names. Furthermore, an individual,
a company, an organization, or the like may be concerned with false information
(i.e., "rumors") that originate and spread quickly over the Internet, resulting
in the disparagement of the entity. Such entities may also be interested in gathering
data about how they and their products and/or services are perceived on the Internet
(i.e., a form of market research).
In order to compete with the above-described aspects of the Internet, entities
are currently forced to search Internet resources (i.e., Web sites, File Transfer
Protocol (FTP) sites, newsgroups, chat rooms, etc.), by visiting over thousands
of sites in order to discern activities relevant to their business operations.
Such searching is currently done either by hand or using commercial search engines.
Each of these methods is costly because a great amount of time is required to do
such searching—time that detracts from positive, profit-earning activities.
Adding to the frustration of discerning relevant activity is the fact that commercial
search engines are updated infrequently and typically limit the resulting number
of sites (i.e., "hits") that any given search request returns. Furthermore, the
task of visiting each site to determine whether there is indeed relevant activity
and if so, the extent and character of it, also demands a great deal of time.
Therefore, in view of the above, what is needed is a system, method and
computer program product for developing and interpreting e-commerce metrics. Such
e-commerce metrics can provide relevant market information and feedback to an entity
so that it may detect and prioritize its online business efforts. Further, what
is needed is a system, method and computer program product that searches the Internet's
vast resources for data relevant to the entity's activities and its associates
and produces a detailed, customized report of relevant activity affecting the entity.
SUMMARY OF THE INVENTION
The invention is directed to a system, method and computer program product for
developing and interpreting e-commerce metrics that meets the identified needs.
The method and computer program product involve collecting documents that are commonly
transmitted over a computer network (e.g., the Internet, an institutional intranet,
etc.), where the documents are relevant to the business operations of an entity.
The method and computer program product also collect external data, which may or
may not be available on the computer network, but that is highly relevant to the
entity. A list of predetermined, entity-specific criteria is obtained from the
external data. A list of rules is generated, where each rule contains at least
one of the entity-specific criteria. The method and computer program product determine
whether any of the collected pages satisfies any of the listed rules. Matching
pages are gathered into a subset for further processing. Additional information
is added to the subset of pages. The additional information can be contact information,
routing tables, financial information, and other data which does not need to be
collected more than once.
The method and computer program product score the pages based on all the information
collected to determine statistics. The statistics are analyzed for business information
which may be important to the operations of the entity. The method and computer
program product then produce a report to deliver a continuous stream of e-commerce
intelligence for the entity. Depending on the entity-specific criteria, the method
and computer program product can determine and report whether others are diverting
entity's buyers or computer network traffic by using metatags and other browser
magnets; selling or distributing the entity's goods without authorization; using
or misusing the entity's intellectual property; claiming false affiliations with
the entity; associating the entity with objectionable material, such as hate sites
or other rogue sites, or with pornographic content; or engaging in other relevant
activity affecting the entity or its goodwill. The method and computer program
product can also be used to help identify potential partners, affiliates and other
sources of unrealized revenue and to identify newsgroup commentary that may be
impacting the entity's reputation and/or value.
The e-commerce metrics system of the invention includes a downloader for searching
a computer network (e.g., the Internet), a page processing module for receiving
the pages downloaded from the search of the computer network, the page processing
module forming a list of pages. In one embodiment, the system contains numerous
downloaders for searching the entire computer network, searching specific locations,
and searching specific formats (e.g., newsgroups or chat sites). The system also
contains an archive for storing the listed pages, the pages being downloaded to
the archive by the page processing module, and a database for allowing the page
processing module to perform higher order operations on the pages on the list in
order to produce a report to be utilized by users of the system. Entities use the
system to search for information about themselves or other entities. In one embodiment,
the system also includes a plurality of Internet clients (e.g., Web, e-mail, Wireless
Application Protocol (WAP), etc.) that provide a graphical user interface (GUI)
for users to enter search criteria, communicate with the downloader and page processing
module, and view pages with scoring information, entity statistics, and page contents.
One advantage of the invention is that users may quickly and efficiently search
and find relevant information contained on Web, FTP, and File Service Protocol
(F SP) sites, as well as chat rooms and newsgroups within the Internet.
Another advantage of the invention is that detailed and customizable reports
listing overall statistics and associated metrics are produced allowing entities
to focus their business efforts.
Another advantage of the invention is that its back-end (page processing
module) and front-end (user interface) are designed to operate independently of
each other, thus allowing greater throughput and availability of the system as
a whole.
Yet another advantage of the invention is that lists of relevant pages may be
grouped and prioritized, both in an automated and manual fashion, in order to arrive
at a manageable set of data.
Further features and advantages of the invention as well as the structure
and operation of various embodiments of the invention are described in de-tail
below with reference to the accompanying drawings.
BRIEF DESCRIPTION OF THE FIGURES
The accompanying drawings, which are incorporated herein and form a part of the
specification, illustrate the invention and, together with the description, further
serve to explain the principles of the invention and to enable a person skilled
in the pertinent art(s) to make and use the invention.
In the drawings:
FIG. 1A is a block diagram illustrating the system architecture of an embodiment
of the invention, showing network connectivity among the various components;
FIG. 1B is a block diagram illustrating the global Internet, showing the different
components which may be present;
FIG. 2 is a block diagram illustrating the software architecture of an embodiment
of the invention, showing communications among the various components;
FIG. 3 is a flowchart showing the overall operation of an embodiment of the invention;
FIG. 4 is a block diagram illustrating the software architecture of a page processing
module according to an embodiment of the invention;
FIG. 5 is a flowchart showing the operation of scoring pages, according to an
embodiment of the invention;
FIGS. 6, 7, 8A and 8B are exemplary scoring input pages
according to an embodiment of the invention;
FIGS. 9 and 10 are exemplary output report pages according to an embodiment
of the invention; and
FIG. 11 is a block diagram of an exemplary computer system useful for implementing
the invention.
The invention will now be described with reference to the accompanying drawings.
In the drawings, like reference numbers indicate identical or functionally similar
elements. Additionally, the left-most digit(s) of a reference number identifies
the drawing in which the reference number first appears.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
Table of Contents
I. Overview
II. System Architecture
III. Software Architecture
IV. Overall E-Commerce Metrics System Operation
V. Graphical User Interface (Front-End)
VI. Page Processing Module (Back-End)
VII. Output Reports
VIII. Front-End and Back-End Severability
IX. Environment
X. Conclusion
I. Overview
The present invention is directed to a system, method, and computer program product
for developing and interpreting e-commerce metrics. In one embodiment of the invention,
users are entities who are interested in maximizing their return on investment
and e-commerce objectives with a continuous stream of relevant market feedback
from the Internet. Such entities can employ an intelligent search engine that spans
the entirety of the Internet's vast resources and returns links to Internet sites
that, with a high probability of certainty, contain relevant information affecting
the entity. The input of the system's search engine can be customized for each
entity based on, for example, their products, services, business activity, and/or
the types of intellectual property owned. The system's search engine can also provide
detailed reports, customized to fit each entity's monitoring needs, so that the
entity's personnel may prioritize their activities. In one embodiment, the system
also provides a Web server so that entities may remotely utilize the search engine.
While the invention is described in terms of the above example, this is for
convenience only and is not intended to limit its application. In fact, after reading
the following description, it will be apparent to one skilled in the relevant art(s)
how to implement the following invention in alternative embodiments (e.g., providing
online monitoring for a corporate intranet or extranet).
Furthermore, while the following description focuses on the monitoring
of Web sites, newsgroups, and FTP sites, and thus employs such terms as Universal
Resource Locators (URLs), address, Web pages, and content, it is not intended to
limit the application of the invention. It will be apparent to one skilled in the
relevant art(s) based on the teachings contained herein how to implement the following
invention, where appropriate, in alternative embodiments. For example, the invention
may be applied to monitoring chat rooms, forums, or mailing lists, etc.
II. System Architecture
Referring to FIG. 1A, a block diagram illustrating the physical architecture
of a e-commerce metrics system TOO, according to an embodiment of the invention,
showing the network connectivity among the various components is shown. It should
be understood that the particular e-commerce metrics system 100 in FIG.
1A is shown for illustrative purposes only and does not limit the invention. As
will be apparent to one skilled in the relevant art(s) based at least on the teachings
described herein, all of components "inside" (not shown) of the e-commerce metrics
system 100 are connected directly or via to computer network 103.
The e-commerce metrics system 100 includes a Web downloader 108
and news downloader 109. These downloaders are configured according to the
nature of the pages that they search. The system includes a page processing module
110 that serves as the "back-end" of the invention. Page processing module
110 connects to the downloaders 108 and 109 to receive downloaded
pages. Connected to the page processing module 110, is a database 120
and an archive 115. Page processing module 110 performs various counting
and scoring operations on the downloaded pages and forwards the resulting metadata
to database 120. Metadata includes various high order results from processing
the data contained on collected pages. For example, the total number of pages containing
links to a certain Web site, and/or an average of the number of external links
on each Web page on a Web site. Complete copies of the pages are stored on archive 115.
In one embodiment of the invention, directed page processing module 150
gathers pages from specific locations on computer network 103. Thus, directed
page processing module 150 contains control logic similar to downloaders
108, 109 and page processing module 110, but only as necessary
for limited (specific) page retrieval. Directed page processing module 150
forwards these pages to archive 115 after processing the information on
the downloaded pages. Similarly, metadata generated from the downloaded pages is
sent to database 120.
Client/analyst Web server 125 provides clients 140 and
analysts 130 with access to the metadata stored in database 120 and
the pages stored in archive 1115. Analysts 130 are users of the invention
who can review the metadata and pages and alter the focus of the searches conducted
by the downloaders 108, 109, and directed page processing module
150. This feedback measure allows the invention to fully cover areas of
the computer network 103 which contain desired information. Client Web server
135 is connected to archive 115. Client Web server 135 provides
clients 140 with access to the stored pages used to develop metadata, which
forms the bases for conclusions arrived at by the invention by the scoring processes
of the present invention.
As is well-known in the relevant art(s), a Web server is a server process running
at a Web site which sends out Web pages in response to Hypertext Transfer Protocol
(HTTP) requests from remote browsers. The Web servers 125 and 135
serve as "front ends" of the invention. That is, the Web servers 125 and
135 provide the graphical user interface (GUI) to users of the e-commerce
metrics system 100 in the form of Web pages. Such users may access Web servers
125 and 135 either directly or via a connection to computer network
103 (e.g., the Internet).
While only one database 120, archive 115, page processing module
110, and directed page processing module 150 are shown in FIG. 1A,
it will be apparent to one skilled in the relevant art(s) that e-commerce metrics
system 100 may be run in a distributed fashion over a plurality of the above-mentioned
network elements connected via computer network 103. For example, both the
page processing module 110 "back-end" application and the Web servers 125
and 135 "front-end" may be distributed over several computers thereby increasing
the overall execution speed and/or reliability of the e-commerce metrics system
100. More detailed descriptions of the e-commerce metrics system 100
components, as well their functionality, are provided below.
Referring to FIG. 11B, the global Internet depicted by computer network
103, includes a plurality of various FTP sites 104 (shown as sites
104
a-n) and the WWW is shown. Within the WWW are a plurality of Web
sites 106 (shown as sites 106
a-n). The search space for the
page processing module 1110 includes the Web sites 106 and the plurality
of FTP sites 104. Within the Usenet are a plurality of newsgroups 105.
As mentioned above, it will be apparent to one skilled in the relevant art(s),
that the search space (i.e., computer network 103) of the e-commerce metrics
system 100, although not shown, will also include chat rooms, mailing lists,
FSP sites, etc.
As will be apparent to one skilled in the relevant art(s), audio-visual content
can be parsed for analysis by using technologies such as optical character recognition
(OCR) and/or watermark technologies.
III. Software Architecture
Referring to FIG. 2, a block diagram illustrating a software architecture
200 according to an embodiment of e-commerce metrics system 100,
showing communications among the various components, is shown. The software architecture
200 of e-commerce metrics system 100 includes software code that
implements the page processing module 110 and directed page processing module
150 (hereinafter "processing modules 201I") in a high level programming
language such as the C++ programming language. Further, in an embodiment, the processing
modules 201 software code is an application running on an IBM™ (or
compatible) personal computer (PC) in the Windows NT™ operating system environment.
In one embodiment of the invention, the database 120 is implemented using
a high-end relational database product (e.g., Microsoft™ SQL Server, IBM™
DB2, ORACLE™, INGRES™, etc.). As is well-known in the relevant art(s),
relational databases allow the definition of data structures, storage and retrieval
operations, and integrity constraints, where data and relations between them are
organized in tables.
In one embodiment of the invention, the processing modules 201 application
communicates with the database 120 using the Open Database Connectivity
(ODBC) interface. As is well-known in the relevant art(s), ODBC is a standard for
accessing different database systems from high level programming language application.
It enables these, applications to submit statements to ODBC using an ODBC structured
query language (SQL) and then translates these to the particular SQL commands the
underlying database product employs.
The archive 115, in one embodiment of the invention, is any physical memory
device that includes a storage media and a cache (e.g., the hard drive and primary
cache, respectively, of the same PC that runs the page processing module 110
application). In an alternative embodiment, the archive 115 may be a memory
device external to the PC hosting the processing modules 201 application.
In yet another alternative embodiment, the archive 115 may encompass a storage
media physically separate from the cache, where the storage media may also be distributed
over several elements within connected to the computer network. Further, in one
embodiment of the invention, the archive 1115 communicates with the processing
modules 201 application and Web servers 125, and 135 using
the operating system's native file commands (e.g., Windows NT™).
The Web servers 125, and 135 provide the GUI "front-end" for e-commerce
metrics system 100. In one embodiment of the invention, it is implemented
using the Active Server Pages (ASP), Visual BASIC (TB) script, Extensible Mark-up
Language (XML), and JavaScript™ sever-side scripting environments that allow
the creation of dynamic Web pages. The Web servers 125 and 135 communicate
with the plurality of clients 140 and analysts 130 (hereinafter,
collectively shown as "users 202") using HTTP. The users 202 employ
a browser (or other GUI) using Java, JavaScript™, and Dynamic Hypertext
Markup Language (DHTML). In one embodiment, users can connect to e-commerce metrics
system 100 via a WAP phone or facsimile machine. In an embodiment of the
invention, as will be described in detail below in Section VIII, users 202
may also communicate directly with the processing modules 201 application
via HTTP.
IV E-Commerce Metrics System
Referring to FIG. 3, a flowchart 300 showing the overall operation
of the e-commerce metrics system 100, according to an embodiment of the
invention, is shown. Flowchart 300 begins at step 302 with control
passing immediately to both steps 304 and 310. Step 304 takes
place in the directed page processing module 150. In step 304, a
user defines a search criteria. The search criteria, as explained in detail below
in Section V, are customized according to a particular user's concerns. In step
306, a search of the computer network 103 is performed. This search
returns a list of probable uniform resource locators (URL's). As is well-known
in the relevant-art(s), a URL is the standard for specifying the location of an
object on the computer network 103. The URL standard addressing scheme is
specified as "protocol://hostname" (e.g., "http://www.a_company.com", "ftp://organization/pub/files"
or "news:alt.topic"). An URL beginning with "http" specifies a Web site 106,
an URL beginning with "ftp" specifies an FTP site 104, and an URL beginning
with "nntp" specifies a newsgroup. The probable URL's indicate a first (preliminary)
set of locations (i.e., addresses) on the computer network 103, based on
the search criteria, where pages containing information relevant to entity's operations
may be found. The details of the search in step 306 are described in detail
below in Section V.
A separate process is also initiated from step 302. From step 302
control also immediately passes to step 310 in downloaders 108 and
109. The page searching and retrieval process is substantially similar as
in steps 306-308. Step 310, however, does not work from a
predetermined list of locations or address on computer network 103. Downloaders
108 and 109 download everything available on computer network 103.
In step 312, the retrieved pages are filtered for information that is minimally
relevant for users 202. Minimally relevant pages are downloaded to page
processing module 110 in step 314.
In steps 308 and 314, each of the URLs is visited and the contents
downloaded locally to processing modules 201. The aim of the download steps
308 and 314 is so that subsequent processing steps of the e-commerce
metrics system 100 may be performed on preserved copies of the visited URL's.
This eliminates the need for re-visiting (and thus, re-establishing a connection
to) each of the URLs Web sites 106, FTP sites 104, etc. specified
by the URLs, thus increasing the overall performance of the e-commerce metrics
system 100.
If any of the URLs within the preliminary set contains files, those files may
contain potentially relevant material (e.g., a "*.mp3" music file, or a "*.gif"
or "*.jpg" image file). This is in contrast to actual text located on a Web page
of a particular Web site 106. The files may be located: (1) on a different
Web site 106 accessible via a hyperlink on the Web page the e-commerce metrics
system 100 is currently accessing; (2) on a different Web page of the same
Web site 106 the e-commerce metrics system 100 is currently accessing;
or (3) in a different directory of the FTP site 104 than the e-commerce
metrics system 100 is currently accessing. In these instances, the e-commerce
metrics system 100 employs a Web crawling technique in order to locate the files.
The Web crawling technique of the present invention discussed herein includes
the use of URL address variations. After the original URL is visited and the link
to the file is identified, the e-commerce metrics system 100 truncates the
link URL at the rightmost slash ("/"), thus generating a new link URL. This process
is repeated until a reachable domain is generated. This technique takes advantage
of the fact that most designers of Web sites 106 allow "default" documents
to be returned by their Web servers in response to such URL (via HTTP) requests.
An example of the directed page processing module 150 and downloaders 108
and 109 Web crawling technique is shown in Table I below.
| TABLE 1 |
|
| EXAMPLE OF WEB CRAWLING TECHNIQUE |
|
| |
| |
http://www.links-to-interesting-files-all-over-the-net.com |
| Interesting Links Found on the Original Web Page Identified by Search |
| Criteria: |
| |
http://www.really-good-music-not-yet-released.com/future-hit.mp3 |
| |
ftp://www.company-trades-secrets.com/july/tradeseceret.doc |
| |
http://www.really-good-music-not-yet-released.com/ |
| |
ftp://www.company-trades-secrets.com/july/ |
| |
ftp://www.company-trades-secrets.com/ |
| |
|
For any Web site 106 where the site's server is not currently responding
(i.e., "down" or "off-line"), directed page processing module 150 and downloaders
108 and 109 applications, before removing the URL corresponding to
the site from the preliminary set, implements a "re-try" timer and mechanism.
When any of the URLs within the preliminary set is an FTP site 104 (or
FSP site), the normal steps of visiting and downloading the sites are not practical
and thus, not used. Therefore, the invention contemplates a method for "FTP crawling"
in order to accomplish steps 308 and 314 for such URLs.
First, the directed page processing module 150 and downloaders 108
and 109 applications attempt to log into the FTP site 104 specified
by the URL. As is well known in the relevant art(s), there are two types of FTP
sites 104—password protected sites and anonymous sites. If the site
104 is password protected and the password is not published in a reference
linked page, it is passed over and the URL is removed from the preliminary set.
If the FTP site 104 has a published password, the applications attempt to
login using that password. If the FTP site 104 is an anonymous site, the
applications attempt to log in. As is well known in the relevant art(s), an anonymous
FTP site allows a user to login using a user name such as "ftp" or "anonymous"
and then use their electronic mail address as the password.
If a connection can be established, the applications have access to the directory
hierarchy containing the publically accessible files (e.g., a "pub" subdirectory).
The applications may then "nicely" crawl the relevant portions of the FTP site
104 by mapping the directory structure and then visiting certain directories
based on keywords derived from the defined search criteria (steps 306 and 310).
The purpose of nice FTP crawling is to capture the relevant contents of the FTP
site 104 as it relates to the entity without burdening the host's resources
by crawling the entire FTP site 104. This is especially important due the
large size of a typical FTP site 104 (e.g., a university's site or someone
entire PC hard disk drive), and due to the lack of crawl restriction standards
like the "robots.txt" file commonly found on Web sites 106.
Consider the example where the directed page processing module 150
and downloaders 108 and 109 are searching the for the directory:
"ftp://ftp.stuff.com/˜user/music/famous_artist" in the context of a search
for information related to an entity's music product. First, the nice FTP crawling
technique involves establishing a single connection to the FTP site 104
(even if multiple content is needed from the site) and then going to the root directory.
Second, a counter is then marked zero and a directory listing and snapshot of the
current directory is taken. For each directory, if the directory name is "interesting,"
then the directed page processing module 150 and downloaders 108
and 109 enter the directory, set the counter to a positive number (e.g.,
C=2), then repeat the listing and snapshot step. If the counter is greater than
zero or the directory is on the way to the destination directory, then the directory
is entered and then the listing and snapshot step is repeated.
To simulate human behavior, it is best if the directed page processing module
150 and downloaders 108 and 109 perform a depth first search,
and introduce slight pauses between directory listings. "Interesting" directory
listings are those containing terms related to the search criteria. For example,
keywords for this search may include "songs," "sound," "album," "artist," "mp3,"
music_type, famous_artist, etc., and the destination directory (in the example,
it can be "/famous-artist"), and other hard-coded directories that are usually
of to interest (e.g., "/incoming").
In an alternative embodiment, user 202 could also specify that uninteresting
directories be crawled as well. The purpose of the counter (C) is to set the amount
(depth) of sub-directories that the directed page processing module 150,
as well as downloaders 108 and 109 will crawl in order to find "interesting"
files. In one embodiment of the invention, to ease the burden on FTP site 104
servers, the total number of directories that can be crawled in a single FTP session
may be limited.
An example of the nice FTP crawling technique of the directed page processing
module 150 and downloaders 108 and 109 are presented in Table
2 below. Table 2 illustrates a depth-first (from top to bottom) traversal of the
directory structure of an FTP site 104.
| TABLE 2 |
|
| EXAMPLE OF NICE FTP CRAWLING TECHNIQUE |
|
| |
| |
ftp://ftp.stuff.com/~user |
| |
ftp://ftp.stuff.com/~user/homework |
| |
C ftp://ftp.stuff.com/~user/music |
| |
C- ftp://ftp.stuff.com/~user/music/famous_artist1 |
| |
*C- ftp://ftp.stuff.com/~user/music/famous_artist |
| |
C- ftp://ftp.stuff.com/~user/music/famous_artist2 |
| |
C- ftp://ftp.stuff.com/~user/music/famous_artist3 |
| |
ftp://ftp.stuff.com/~user/poetry |
| |
ftp://ftp.stuff.com/~user2 |
| |
ftp://ftp.stuff.com/~user3 |
| |
C ftp://ftp.stuff.com/incoming |
| |
... |
| |
|
| |
C = directory judged to be "interesting" in context of the search and counter
set to C |
| |
C- = counter decremented at this level of the directory tree |
| |
* = destination directory |
| |
... = the page processing module 110 crawls every subdirectory up to the depth
of C under the directory |
The above-described "nice FTP crawling" allows users 202 to obtain reports
with both the URL and contents of any interesting FTP site 104.
For any FTP site 104 where the password failed, it is passed over and
the URL is removed from the preliminary set. If the site's server is not currently
responding (i.e., "down" or "off-line"), too many users were already logged in,
or otherwise unavailable for connection, the directed page processing module 150
and downloaders 108 and 109 applications, before removing the URL
corresponding to those sites from the preliminary set, implement a "re-try" timer
and mechanism.
In step 316, the locally downloaded pages are scored (i.e., ranked). The
scoring of the individual pages is based on the inputs specified in the search
criteria (step 304). Bach page is given a score based on a text search of
keywords from the search criteria and statistics accumulated from analyzing the
pages. The applications of processing modules 201 possess inference code
logic that allows anything resident on a page or in the underlying HTML code (i.e.,
tags) that formats the page to be numerically weighted. The scoring may be based
on the separate regions of the page such as the title or information within a tag
(e.g., meta-tags, anchor tags, etc.). Also, scoring may be based on such information
as the URL of the page itself, dimensions of pictures on the page, the presence
of a specific picture file, the number of a certain type of file, length of sound
files, watermarks, embedded source information, as well as information about a
page provided by another page. During this process, the e-commerce metrics system
100 possesses logic to also recognize exact duplicates of an entity's graphics
files (i.e., pictures, logos, etc.), without the need for digital water marking.
This additional logic further contributes to the scoring process of step 316.
The numbers, figures, and statistics generated by the scoring process is collectively
referred to as metadata. Metadata is stored in database 120 in step 318.
The scoring of pages may also involve whether any offending URLs contain advertising.
This is useful information to clients because those sites are considered commercial
and not fan or personal (i.e., non-commercial) sites. Advertisement recognition
is accomplished by parsing an image located within an URL and capturing the alt
text (alt text is an HTML attribute that displays a block of text as an alternative
to an image, for text-based browsers. It is used inside the <IMG>tag;
the format is <IMG SRC="URL" ALT="TEXT"), click-through URL, click-through
resolved URL, and URL of the image. Then, if any of the following three rules are
met, the e-commerce metrics system 100 identifies the probable presence
of an advertisement: (1) the alt text or URL of the advertisement image contains
keywords common, to those around known advertisements; (2) the click-through URL
and the resolved click through URL specify different domains; or (3) the image
is an exact match of a known advertisement.
During this process, the e-commerce metrics system 100 develops a table
of advertisement dimensions that are common to each Web site 106 encountered.
Thus, in an alternative embodiment, a fourth rule is used to recognize advertisements.
That is, if the dimensions of the image fit the tolerances of the dimensions in
the table for a Web site 106, the image is probably an advertisement. The
data, for the table of advertisement dimensions are kept in archive 115
and queried via the database 120. Accordingly, the score for each page is
adjusted (i.e., increased) if the metrics system 100 identifies the presence
of a probable advertisement.
In step 320, a archive of the pages is done to the storage media of archive
115. In order to archive each Web page, the "inline" contents of the page
must be separated from the non-inline contents. Inline contents include any text,
sounds, and images found directly on the Web page and that automatically plays
or is displayed when the page is browsed. In contrast, non-inline contents include
the links that Web pages contain to other Web sites 106. In order to obtain
a "self-sustaining" local copy of the Web page, only the inline contents of each
Web page of the preliminary list of URLs is stored in archive 115. In an
alternative embodiment, a client may want included in their final report (step
330 described below) properties or metrics associated with non-inline contents
of relevant pages. Thus, in such an embodiment, step 320 can also include
the non-inline contents of each Web page (i.e., a "complete" archive). In yet another
embodiment, the system 100 in step 320 could generate a snapshot
of the page and store this snapshot as a single graphical image.
As indicated in FIG. 3, step 320 is optional. That is, a user may desire
not to perform a complete archive (and thus, not create self-sustaining local copies
of the Web pages. Thus, the operation of e-commerce metrics system 100 may
proceed directly to step 322 after the pages are scored in step 316.
In an alternative embodiment, step 320 may perform a summary archive where,
for example, only the headers and/or titles of the pages is archived.
In step 322, the preliminary set of URLs is grouped into "actual sites."
Most people equate Web sites 106 with either domain names or host names.
For example, a URL of "http://www.a_company.com" and all the pages under it are
typically viewed as one Web site 106. However, as Web designers develop
schemes to partition their sites among distinct users, they divide their-name space
to create sub-sites. Examples are "community sites" which are companies or organizations
that provide free homepages to individual consumers, and university servers that
host student homepages. In these examples, each user or student with a homepage
is an "actual site." For example, the directed page processing module 150
application may obtain a preliminary list (from step 306) of probable URLs
containing the URLs shown in Table 3 below.
| TABLE 3 |
|
| PRELIMINARY LIST OF URLs |
|
| |
| http://www.university_with_many_students.edu/students/b/joe_smith/ |
| main.html |
| http://www.university_with_many_students.edu/students/b/joe_smith/ |
| pics/me.jpg |
| http://www.university_with_many_students.edu/students/c/jane_hacker/ |
| main.html |
In the example of Table 3, the first two URLs are one actual site, whereas the
third is a separate actual site. In one embodiment of the invention, the page processing
module 150 application may recognize which URLs to group into one actual
site based both on: (1) patterns such as ˜username, /students/?/<?>,
/users/?/<?>, /homepages/?/<?>—where "?" is a single
character wildcard and "<?>" is an optional single character wildcard;
and (2) hard-coded rules for known sites which follow no discernable patterns (e.g.,
the GeoCitieS™ community site). The grouping step aids in arriving at a
manageable but informative number of URLs that will be included in a user's final
report. In one embodiment of the invention, the above-described grouping technique
may be used, in conjunction with the score pages step 316, to present the
user with the "best" (i.e., highest scoring) page within an actual site. This removes
information clutter from the final report and further aids in arriving at a manageable
number of URLs to report.
In step 322, the e-commerce metrics system 100 groups pages into
preliminary set(s) of URLs to be selected by users 202 in step 324.
This optional human intervention step allows a second (refined and smaller) set
of probable URLs to be defined, where likely infringements or disparagements of
the entity's Internet Protocol occur. The selection step 324 is essentially
a feedback option for expanding on the preliminary list of URLs. This refinement
allows for more selectivity than what is produced from the search criteria (step
304) or general filtering (step 312).
The e-commerce metrics system 100 automates the information gathering
process in order to minimize the time required by human users and maximize their
effectiveness. It is advisable, however, to have humans review and prioritize the
set of probable URLs because no presently existing software has the ability to
discern the intent of the use of content on a Web page. For example, the e-commerce
metrics system may identify a page with an image of a famous professional athlete.
The e-commerce metrics system, however, may not be able to identify whether the
image is one where the athlete is pictured, without authorization, in his or her
team uniform. Another example includes a page with a probable advertisement identified
by the e-commerce metrics system 100 which is verified by a human user during
step 324.
In one embodiment of the invention, the directed page processing module 150
application allows several users to visit, prioritize, and add analysis data to
the preliminary set of URLs. As a user on any of the plurality of workstations
130 or workstations 140 visits and prioritizes a Web site 106
corresponding to a URL on the preliminary list, it is marked so no duplication
of effort occurs. Further, the e-commerce metrics system 100 is also capable
of logging, for record keeping purposes, which user has analyzed a page including
a time stamp of when the analysis took place.
It should be noted that in alternative embodiments of the invention, the score
pages step 316, full archive step 320, group pages step 322,
and select groups step 324 may be performed in an order different than that
presented herein without departing from the spirit and scope of the invention.
In step 328, the e-commerce metrics system 100 obtains additional
information for each URL in the second refined set. This additional information
is used to provide contact, routing and other information which does not need to
be repeatedly determined (e.g., via searching) or is expensive in terms of the
time required to gather, the monetary cost, and/or other resources. In one embodiment,
this configuration is a result of the time required to operate on a subset of pages.
For instance, the e-commerce metrics system 100, in an automated fashion,
obtains the contact information from the Internet. The sources for this information
include the Network Information Center (InterNIC). As is well-known in the relevant
art(s), InterNIC is a consortium originated by the National Science Foundation
to coordinate information services, directory and database services, and registration
services within the Internet (i.e., computer network 103).
In step 330, a report is generated for the user. The report may be customized
for a particular entity and typically includes the refined list of URLs, the contact
information for each URL, the score for each URL, metadata provided by the processing
modules 201, data provided by users of the e-commerce metrics system 100
(i.e., during step 324), as well as charts and graphs containing any metrics
the user may request. Database 120 is utilized to query the archived metadata
in generating reports, using the tables. Reports may relay information, for example,
on how downloaded pages have changed over time. A more detailed description of
output reports and examples are presented in Section VII below.
In step 332, the user, using the report, may then take action in accordance
with the information presented in the report. In one embodiment of the invention,
the information contained in the output report may be used by the e-commerce metrics
system 100 to be directly inputted into an entity's business model. For
example, the output report may be used to automatically generate: (1) Cease and
desist letters (customized for each entity) to each offending Web site 106
operator; (2) Re