jeelani sofi

Saturday, 22 December 2012

Precision and Recall of Five Search Engines for Retrieval of Scholarly Information in the Field of Biotechnology

S. M. Shafi
Department of Library and Information Science, University of Kashmir, Srinagar-India 190006

Rafiq A. Rather
Department of Library and Information Science, University of Kashmir, Srinagar-India 190006

Received May 10, 2005; Accepted August 9, 2005

Abstract

This paper presents the results of a research conducted about five search engines- AltaVista, Google, HotBot, Scirus and Bioweb -for retrieving scholarly information using Biotechnology related search terms. The search engines are evaluated taking the first ten results pertaining to 'scholarly information' for estimation of precision and recall. It shows that Scirus is most comprehensive in retrieving 'scholarly information' followed by Google and HotBot. It also reveals that the search engines (except Bioweb) perform well on structured queries while Bioweb performs better on unstructured queries.

Keywords

Search engine, Precision and recall, Scholarly information, Structured and unstructured queries, World Wide Web

Introduction

The Web is growing as the fastest communication medium. This technology in combination with latest electronic storage devices enable us to keep track of enormous amount of information available to the information society (Schlichting & Nilsen, 1996). In less than ten years, it has grown from an esoteric system for use by a small community of researchers to the de-facto method of obtaining information for millions of individuals, many of whom have never encountered, and have no interest in the issues of retrieving information from databases (Oppenheiem et al., 2000). A plethora of search engines ranging from general to subject specific are the chief resource discoverers on the Web. These engines search an enormous volume of information at apparently impressive speed but have been the subject of wide criticism for retrieving duplicate, irrelevant and non-scholarly information. The reasons include their comprehensive databases having information on different magnitude like media, marketing, entertainment, advertisement etc. Mainly, these do not sift information from scholar's point of view though some search engines like Google have developed separate applications for disseminating scholarly information like 'Google Scholar' (The tool was incorporated in Google after starting of the study). The number of search engines that are now available has also made them a popular and an important subject for research (Clarke & Willet, 1997; Modi, 1996).

Related Literature

The growing body of literature on web search engine evaluation is purely descriptive in nature and has little consistency. Scoville (1996) surveyed a wide range of web search engines for examining the relevance of documents retrievable through them. The first ten hits evaluated for precision have shown Excite, Infoseek and Lycos superior. Leighton (1996) evaluated the precision of Infoseek, Lycos, WebCrawler and WWWWorm using eight reference questions and rated Lycos and Infoseek higher.Ding and Marchionini (1996) investigated Infoseek, Lycos and Open Text for precision, duplication and degree of overlap using five complex queries. The first twenty hits assessed for precision show that the best results are obtained from Lycos and Open Text. Leighton and Srivastava (1997) searched fifteen queries on AltaVista, Excite, HotBot, Infoseek and Lycos taking the first twenty hits for evaluation of precision. Chu and Rosenthal (1996) have investigated AltaVista, Excite and Lycos for their search capabilities and precision. The authors have used ten search queries of varying complexity by evaluating the first ten results for relevance assessment and revealed that AltaVista outperformed Excite and Lycos both in search facilities and retrieval performance. Clarke and Willett (1997) searched thirty queries of varying nature on AltaVista, Excite and Lycos and obtained best results in terms of precision, recall and coverage from AltaVista. Bar-Ilan (1998) investigated six search engines using a single query "Erdos". All 6,681 retrieved documents examined for precision, overlap and an estimated recall report that no search engine has high recall.

Objectives

The following objectives are laid down for the study:

Identification of search engines for retrieval of scholarly information in the field of Biotechnology.
Assessment of recall and precision of the select search engines.
Understanding the effect of nature and types of queries on precision and recall of the select search engines.

Method

The process was carried out in three stages. In the first stage, related material available in print and electronic format was collected for the study. In the second stage, search engines were selected and search terms drawn subsequently. In the third stage, the search engines were accessed for the select terms from 25th March to 25th April, 2004. However AltaVista and HotBot were revisited during June 2005 in view of changes in their algorithmic policy. Finally, the data was analyzed for results.

I. Search Engines for the Study

The search engines investigated are:

AltaVista (General)
Google (General)
HotBot (General)
Scirus (Science & Technology)
Bioweb (Biotechnology)

II. Sample Search Queries

Twenty search terms were drawn out of a sample of 140 terms compiled with the help of "LC List of Subject Headings" (LCSH, 2003). These were classified under three groups: single, compound and complex terms (Appendix 1) for investigating how search engines control and handle single and phrased terms. Single terms were submitted in natural form, compound terms as suggested by respective search engines and complex terms with suitable Boolean Operators 'AND' and 'OR' between the terms to perform special searches. Five separate queries were constructed for each term in accordance with the syntax of the select search engine.

III. Test Environment

The select search engines offer two modes of searching i.e. simple and advanced mode. The study has chosen the advanced mode of search throughout the study to make use of available features for refining and producing precise number of results. In case of AltaVista and Google "match all of the words" was chosen for single and complex terms and "exact phrase" for compound queries. HotBot and Scirus offer these options through pull down menus. Each search was carried out by choosing title field (i.e. all of the words in title) and limiting age of documents published from 2002 to 2004. All the search engines (except Scirus and Bioweb) were controlled to retrieve the results in English language. Bioweb on the other hand offered relatively different limiting options among which "relevance then date" and hidden Boolean 'OR' were preferred during search.

Each query was submitted to the select engines which retrieved a large number of results but only the first ten results were evaluated to limit the study in view of the fact that most of the users usually look up under the first ten hits of a query. Each query was run on all the five select search engines on the same day in order to avoid variation that may be caused due to system updating (Clarke & Willet, 1997). These first ten hits retrieved for each query were classified as scholarly documents and other categories.

IV. Estimation of Precision and Recall

Precision is the fraction of a search output that is relevant for a particular query. Its calculation, hence, requires knowledge of the relevant and non-relevant hits in the evaluated set of documents (Clarke & Willet, 1997). Thus it is possible to calculate absolute precision of search engines which provide an indication of the relevance of the system. In the context of the present study precision is defined as:

Precision=

Sum of the scores of scholarly documents retrieved by a search engine

Total number of results evaluated

To determine the relevance of each page, a four-point scale was used which enabled us to calculate precision. The criteria employed for the purpose is as under:

A page representing full text of research paper, seminar/conference proceedings or a patent is given a score of three.
A page corresponding to an abstract of a research paper, seminar/conference proceedings or a patent is given a score of two.
A page corresponding to a book or a database is given a score of one.
A page representing other than the above (i.e. company web pages, dictionaries, encyclopedia, organization, etc.) is given a score of zero.
A page occurring more than once under different URL is assigned a score of zero.
A non response of the server for subsequent three searches is assigned a score of zero.

The recall on the other hand is the ability of a retrieval system to obtain all or most of the relevant documents in the collection. Thus it requires knowledge not just of the relevant and retrieved but also those not retrieved (Clarke & Willet, 1997). There is no proper method of calculating absolute recall of search engines as it is impossible to know the total number of relevant in huge databases. However,Clark and Willett (1997) have adapted the traditional recall measurement for use in the Web environment by giving it a relative flavour. This study also followed the method used by Clark and Willett by pooling the relevant results (corresponding here to scholarly documents) of individual searches to form the denominator of the calculations. The relative recall value is thus defined as:

Relative Recall =

Total number of scholarly documents retrieved by a search engine

Sum of scholarly documents retrieved by all five search engines

However, in the case of overlapping between search engines results, only the overlapped results are included for the pooling by taking five search engines (say a, b, c, d and e) into consideration which retrieve a1, b1, c1, d1 and e1 scholarly documents respectively. Further, where there is no overlap between search engines (i.e. a ∩ b, a ∩ c, a ∩ d and a ∩ e is zero) then the relative recall of search engine 'a' is calculated as a1/(a1+b1+c1+d1+e1). Again if overlapping exists between search engines i.e. a ∩ b = b2, a ∩ c = c2, a ∩ d = d2 and a ∩ e = e2 then the relative recall of engine 'a' is a1/(a1+b2+c2+d2+e2). The relative recall is more in case of overlapping between search engines. The mean values for precision and relative recall is obtained by micro-averaging (Clarke & Willet, 1997;Tague, 1992) i.e. average score for each engine against a query is summed over all the twenty queries and mean value calculated from these totals for single, compound and complex terms separately.

Engines Revisited

Two search engines namely AltaVista and HotBot were revisited during June 2005 to investigate the effect of their changing algorithm policy on precision and recall. The mean precision and recall of the observations in AltaVista show a slight increase while as HotBot shows marginal increase in precision but decrease in its recall value (Table 2).

Results and Discussion

The mean precision and relative recall of select search engines for retrieving scholarly information are presented in Table 1.

Table 1. Mean Precision and Relative Recall of search engines during 2004

	Altavista	Google	HotBot	Scirus	Bioweb
Precision	0.27	0.29	0.28	0.57	0.14
Recall	0.18	0.20	0.29	0.32	0.05

Table 2. Comparison of mean Precision and mean Recall of AltaVista and HotBot Search engines between 2004 and 2005

Search Engine	Mean Precision 2004	Mean Precision 2005	Mean Recall 2004	Mean Recall 2005
Altavista	0.27	0.29	0.18	0.21
HotBot	0.28	0.33	0.29	0.27

Comparing the mean precision, Scirus scored the highest rank (0.57) followed by Google (0.29) and HotBot (0.28). AltaVista obtained (0.27) while Bioweb received the lowest precision (0.14). The mean precision obtained for single, compound and complex queries of the respective search engines show Scirus as having the highest precision (0.83) for complex queries followed by compound queries (0.63). AltaVista scored the highest precision (0.50) for complex queries followed by compound quires (0.24). Google and HotBot performed better with complex and compound queries while Bioweb performed better with single queries (Figure1).

Figure 1. Precision of five search engines for single, compound and complex terms

Comparing the corresponding mean relative recall values, Scirus has the highest recall (0.32) followed by HotBot (0.29) and Google (0.20). AltaVista scored a relative recall of 0.18 and Bioweb the least (0.05). While Scirus performed better on complex queries (0.39) followed by compound queries (0.37). HotBot did better in single and compound queries (0.31). Google attained highest recall on compound queries (0.22) followed by complex queries (0.21). AltaVista's performance is better on complex queries (0.28) where as Bioweb performed better on single queries (0.11) (Figure 2).

Figure 2. Relative recall of search engines for single, compound and complex terms

Conclusion

The results depict better performance of Scirus in retrieving scholarly documents and it is the best choice for those who have access to various online journals or databases like Biomednet, Medline plus, etc. Google is the best alternative for getting web-based scholarly documents and its recent introduction of 'Google Scholar' in its beta test for accessing scholarly information offers better dividends for researchers. Scirus acquired the highest recall and precision due to the induction of its journal citations along with web resources; otherwise Google would rank the first. HotBot offers a good combination of recall and precision but has a larger overlap with other search engines which enhance its relative recall over Google search engine. AltaVista once prominent on the Web has lagged behind and the Bioweb is the weakest among the select search engines in all respects. Further, the results reveal that structured queries (i.e. phrased and Boolean) contribute in achieving better precision and recall. The findings also establish the case that precision is inversely proportional to recall i.e. if precision increases recall decreases and vice versa.

References

Bar-Ilan, J. (1998). On the overlap, the precision and estimated recall of search engines: A case study of the query "Erdos". Scientometrics, 42 (2), 207-208.
Chu, H., & Rosenthal, M. (1996). Search engines for the World Wide Web: a comparative study and evaluation methodology. In: Proceedings of the ASIS 1996 Annual Conference, October, 33, 127-35. Retrieved August 19, 2003 from http://www.asis.org/annual-96/ElectronicProceedings/chu.html
Clarke, S., & Willett, P. (1997). Estimating the recall performance of search engines. ASLIB Proceedings, 49 (7), 184-189.
Ding, W., & Marchionini, G. (1996). A comparative study of the Web search service performance. In: Proceedings of the ASIS 1996 Annual Conference, October, 33, 136-142.
Leighton, H. (1996, June 25). Performance of four WWW index services, Lycos, Infoseek, Webcrawler and WWW Worm. Retrieved June 10, 2005 from http://www.winona.edu/library/webind.htm
Leighton, H., & Srivastava, J. (1997). Precision among WWW search services (search engines): AltaVista, Excite, HotBot, Infoseek and Lycos. Retrieved June 11, 2005 from http://www.winona.edu/library/webind2.htm
Library of Congress (2003). Library of Congress Subject Headings (vol.s 1-5). Washington: Library of Congress, Cataloging Distribution Service.
Modi, G. (1996). Searching the Web for gigabucks. New Scientist, 150 (2024), 36-40.
Oppenheiem, C., Moris, A, Mcknight, C., & Lowley, S. (2000). The evaluation of WWW search engines. Journal of documentation, 56 (2), 190-211.
Schlichting, C., & Nilsen, E. (1996). Signal detection analysis of WWW search engines. Retrieved September 15, 2003 from http://www.microsoft.com/usability/webconf/schlichting/schlichting.htm
Scoville, R. (1996). Find it on the Net. PC World, January, 14(1), 125-130. Retrieved June 6, 2003 from http://www.pcworld.com/reprints/lycos.htm
Tague, J. (1992). The Pragmatics of information retrieval experimentation, revisited. Information retrieval experiment, 14, 59-102. Retrieved 11 June, 2005 from http://portal.acm.org/citation.cfm?id=149514

Monday, 10 December 2012

SOUL ......A library software

Software for University Libraries (SOUL) is an state-of-the-art integrated library management software designed and developed by the INFLIBNET Centre based on requirements of college and university libraries. It is a user-friendly software developed to work under client-server environment. The software is compliant to international standards for bibliographic formats, networking and circulation protocols. After a comprehensive study, discussions and deliberations with the senior professionals of the country, the software was designed to automate all house keeping operations in library. The software is suitable not only for the academic libraries, but also for all types and sizes of libraries, even school libraries. The first version of software i.e. SOUL 1.0 was released during CALIBER 2000.

The latest version of the software i.e. SOUL 2.0 was released in January 2009. The database for new version of SOUL is designed for latest versions of MS-SQL and MySQL (or any other popular RDBMS). SOUL 2.0 is compliant to international standards such as MARC 21 bibliographic format, Unicode based Universal Character Sets for multilingual bibliographic records and NCIP 2.0 and SIP 2 based protocols for electronic surveillance and control.

Major Features and Functionalities

UNICODE based multilingual support for Indian and foreign languages;
Compliant to International Standards such as MARC21, AACR-2, MARCXML;
Compliant to NCIP 2.0 protocol for RFID and other related applications especially for electronic surveillance and self check-out & check-in;
Client-server based architecture, user-friendly interface that does not require extensive training;
Supports multi-platform for bibliographic database such as My SQL, MS-SQL or any other RDBMS;
Supports cataloguing of electronic resources such as e-journals, e-books, virtually any type of material;
Supports requirements of digital library and facilitate link to full-text articles and other digital objects;
Support online copy cataloguing from MARC21 supported bibliographic database;
Provides default templates for data entry of different type of documents. User can also customize their own data entry templates for different type of documents;
Provides freedom to users for generating reports of their choice and format along with template and query parameters;
Supports ground-level practical requirements of the libraries such as stock verification, book bank, vigorous maintenance functions, transaction level enhanced security, etc.;
Provides facility to send reports through e-mail, allows users to save the reports in various formats such as Word, PDF, Excel, MARCXML, etc.;
Highly versatile and user-friendly OPAC with simple and advanced search. OPAC users can export their search results in to PDF, MS Excel, and MARCXML format;
Supports authority files of personal name, corporate body, subject headings and series name;
Supports data exchange through ISO-2709 standard;
Provides simple budgeting system and single window operation for all major circulation functions;
Strong region-wise support for maintenance through regional coordinators. Strong online and offline support by e-mail, chat and through dedicated telephone line during office hours; and
Available at an affordable cost with strong institutional support.

LATEST VERSION IS 2.0 TO DOWNLOAD CLINK THE LINK BELOW
                                                                                                Download soul2.0 demo

TO DOWNLOAD EARLIER VERSION OF SOUL CLINK ON BELOW 2 LINKS
CONTAINING 2 SETUPS
                                                                                                Download soul 1.0 setup
                                                                                               Download soul 1.0-database

ACRONYMS AND ABBREVIATIONS

ABET         Adult Basic Education and Training
ACE            Advanced Certificate in Education
ACNL             Advisory Committee on the National Libraries
ACRL Association of College and Research Libraries
ACTAG    Arts and Culture Task Group
AGM    Annual General Meeting
ALASA      African Library Association of South Africa
ANC               African National Congress
BLINDLIB South African Library for the Blind
CALICO        Cape Library Consortium
CEO               chief executive officer
CEPD             Centre for Education Policy Development
CEPD             Continuing Education and Professional Development
CHE               Council on Higher Education
CHEC    Cape Higher Education Consortium
CHELSA        Committee for Higher Education Librarians in South Africa
CICD             Centre for Information Career Development
CNENSA       Cape Non-European Night Schools Association
CNLSA          Conference of National Librarians of Southern Africa
COLIS            Community Library and Information Services
COMLA Commonwealth Library Association
COSALC Coalition of South African Library Consortia
DAC    Department of Arts and Culture
DACST    Department of Arts, Culture, Science and Technology
EAC               East African Community
ECHEA    Eastern Cape Higher Education Association
EFA               Education for All
ELITS            Education Library Information and Technology Services
ESAL            Eastern Seaboard Association of Libraries
ESATI           Eastern Seaboard Association of Tertiary Institutions
ESI               Ecole des Sciences de l’Information
ETD    electronic theses and dissertations
FOTIM         Foundation of Tertiary Institutions of the Northern Metropolis
FPASA         Fire Protection Association of Southern Africa
FRELICO     Free State Library and Information Consortium
FSHETT       Free State Higher and Further Education and Training Trust
FULSA Forum for University Librarians in South Africa
GAELIC    Gauteng and Environs Library Consortium
GCIS            Government Communication and Information System
GIS               geographic information system

HAI    historically advantaged institution
HDI              historically disadvantaged institution
HEI              higher education institution
HEQC         Higher Education Quality Committee
HELIG         Higher Education Libraries Interest Group
HSRC    Human Sciences Research Council
IASL            International Association of School Librarianship
ICT information and communication technology
ICTLIG Information and Communications Technology in Libraries
IDRC             International Development Research Centre
IFLA             International Federation of Library Associations and Institutions
IGBIS            Interest Group for Bibliographic Standards
ILE                information literacy education
INASP          International Network for the Availability of Scientific Publications
IRC               information resource centre
ISAP Index to South African Periodicals
ISASA    Independent Schools Association of South Africa
ISBN            International Standard Book Number
ITU             International Telecommunication Union
IULC           Inter-University Library Committee
IWG           Interministerial Working Group
JCM Joint Catalogue of Monographs
KZN KwaZulu-Natal
LACIG Library Acquisitions Interest Group
LIASA        Library and Information Association of South Africa
LIS              library and information science/services/studies
LISDESA    Libraries and Information Services in Developing South Africa
LISSCO      Library and Information Services of Science Councils
LIWO Library and Information Workers Organisation
MEDLIG     Medical Libraries Interest Group
MiET          Media in Education Trust
MPCC Multipurpose Community Centre
NAC    National Arts Council
NACLI      National Advisory Council for Libraries and Information
NARS       National Archives and Records Service
NCLIS National Council for Library and Information Services
NECC    National Education Coordinating Committee
NEPAD    New Partnership for Africa’s Development
NEPI National Education Policy Investigation
NGO non-governmental organisation
NHC National Heritage Council
NITF National Information Technology Forum
NLAC    National Library Advisory Council
NLSA National Library of South Africa
NOHIM    National Oral History and Indigenous Music
NRF         National Research Foundation
NTCA     National Telecommunications Cooperative Association
OBE        outcomes-based education
OCLC     Online Computer Library Centre
OPAC    Online Public Access Catalogue
OPD        Official Publications Depository
OSALL   Organisation of South African Law Libraries
PACLIG    Public and Community Libraries Interest Group
PaCLISA   Public and Community Libraries Inventory of South Africa
PASA         Publishers’ Association of South Africa
PICC Print Industries Cluster Council
PISAL    Periodicals in Southern African Libraries
PiT          Public Information Terminal
RDP    Reconstruction and Development Programme
RETIG     Research, Education and Training Interest Group
RFID      radio frequency identification device
RNCS Revised National Curriculum Statement
SABA      South African Booksellers’ Association
SABEC    Southern African Book Exchange Centre
SABIB    South African Bibliography
SABINET South African Bibliographic and Information Network
SaCAT      South African National Catalogue
SACMEQ Southern and Eastern African Consortium for Monitoring Educational Quality
SADC       Southern African Development Community
SAHRA        South African Heritage Resources Agency
SAILIS         South African Institute for Library and Information Science
SAIS      Southern African Interlending Scheme
SALA        South African Library Association
SALC South African Library Conference
SALLP South African Library Leadership Programme
SANB    South African National Bibliography
SANCB   South African National Council for the Blind
SANLIC    South African National Library Consortium
SANRIC    South African National Research Information Consortium
SAOUG     Southern African Online User Group
SASLI       South African Site Licensing Initiative
SCECSAL    Standing Conference of East, Central and Southern Africa
SCHELIS     Standing Committee of Heads of Education Library and Information
Services
SEALS     South Eastern Alliance of Library Systems
SLA         Special Libraries Association
SLIG       Special Libraries Interest Group
SLIS    Special Libraries and Information Services
SLYSIG   School Library and Youth Services Interest Group
SOMAFCO    Solomon Mahlangu Freedom College
TAAA Together with Africa and Asia Association
TEC       Transitional Executive Committee
UCT      University of Cape Town
UCTD   Union Catalogue of Theses and Dissertations
ULIS     Unification of Library and Information Stakeholders
UNESCO   United Nations Educational Scientific and Cultural Organisation
UNISA University of South Africa
UP           University of Pretoria
USA        United States of America
USA Universal Service Agency
USAASA    Universal Service and Access Agency of South Africa
UWC        University of the Western Cape
UZKN      University of KwaZulu-Natal
WGNL    Working Group on the National Libraries of South Africa

WLIC World Library and Information Congress