This error is called under-stemming , as the stemmer stripping is under the expected level. On the other hand, a stemmer makes an over-stemming error when it strips more terminations than it should, clearing parts of the word that belong to the morphological root. When this error occurs, there is a loss of semantic information as part of the morphological root is deleted.
Porter goes beyond these definitions and divides this type of error into two cases Porter, : over-stemming when the suffix stripped provokes a change of the meaning of the stem that is, the termination cleared is a suffix, but one belonging to the root and mis-stemming when the cleared termination is not a suffix, but part of the root.
Both errors imply a decrease in the performance of the stemmer, but differently. When a word is under-stemmed it becomes more difficult to identify if two words are related based on their morphology, as their obtained stems are mostly equal, but not totally because of a suffix that has not been deleted. The opposite case takes place when over-stemming is done.
About This Item
In this case the problem is that it becomes more possible that two words that are not related but share part of their morphological root are wrongly detected as related because their stems are equal. Some solutions have been proposed to reduce these errors as much as possible. For over-stemming most of the proposed stemmers set a minimum size constraint on the resulting stem.
This means that a term is stemmed only if its stem has a minimum length that typically is two or three letters. By doing this, the algorithm avoids deleting or modifying the termination of a term that matches one of the suffixes of the list but is not actually a suffix. Therefore it is essential to define the stemming rules as carefully as possible to consider most cases and then prevent uncontrolled situations Dawson, ; Porter, ; Paice, In the other case, the under-conflation provoked by under-stemming can be palliated by applying partial-match algorithms Dawson, ; Lovins, that determine that two stems are related if the similarity between their morphology is over a defined threshold.
Even if these solutions are useful only in some cases and, sometimes, can introduce new errors, globally they yield good results. Their adoption depends on the required needs for the task: if the conflation is an important aspect, even at expense of the thematic precision of the resulting groups, then solutions against under-stemming will be adopted; conversely, if it is preferable to have perfectly classified groups according to their topic, even if they are little and lack some related documents, then over-stemming errors should especially be addressed.
Typically, a trade-off between the two approaches is desired, even if it increases the difficulty of the process. Based on the errors defined above, a classification has been proposed to label a stemmer depending on which type of errors it most commonly does. The strength of a stemmer is defined as the aggressiveness with which the stemmer clears the terminations of the terms and it depends on the rate of over-stemming and under-stemming made by a stemmer.
A strong or heavy stemmer has a high rate of over-stemming errors, while a weak or light stemmer is characterized by having a high under-stemming rate. Paice proposes some metrics to evaluate a stemmer regardless of the task carried out: the under-stemming index UI , the over-stemming index OI , the stemming weight SW and an error rate relative to truncation ERRT.
They also compute the strength of the "S" stemmer, which, as expected, is much weaker than Porter. The "S" stemmer deals only with plural forms and its treatment has been proposed by Harman as a baseline for evaluation and comparison of stemmers. The simple truncation of a fixed number of letters has also been used in many cases as a baseline algorithm for comparisons Braschler and Ripplinger, ; Paice, Table 1 summarizes the main features of the presented stemmers and their strength.
However, these metrics do not allow the evaluation of the accuracy of a stemmer, but only the classification of it according to its typical errors. In order to evaluate the accuracy of the stemming process two metrics, introduced by Kent, et al. These metrics have been widely adopted in information retrieval as the standard metrics to evaluate the accuracy of the process.
The first metric, recall, reflects the rate of relevant documents that are obtained as an answer to a query, while precision represents the rate of the retrieved documents that are relevant to the query. In other words, when the stemmer tends to group as many related documents as possible, even if other non-related documents are also included in the same group, the recall is high, while if the stemmer builds groups with as few non-related documents within a group as possible, even if some related documents are possibly not included in it, the precision is high.
These measures have allowed some authors to link performance in terms of recall and precision to the strength of the stemmer, and then to over-stemming and under-stemming Frakes and Fox, ; Paice, ; Harman, Their conclusions are that strong stemmers will, in general, increase the recall of the results, as they make more probable that two words belonging to the same concept get conflated together, but they also decrease the precision as a higher number of non-related words also get conflated together because of over-stemming. Accordingly, they confirm that weak stemmers are better at correctly conflating related words, thus increasing the precision, but are more likely to avoid conflating related words because of under-stemming, therefore decreasing the recall.
Another way to evaluate an algorithmic-based stemmer is through its conflation rate, also called the Index Compression Factor ICF , which defines how much the stemming process compresses the input vocabulary and then how much it reduces the storage capacity needed and increases the efficiency of the information retrieval system which has to deal with a thinned dictionary. Porter states that his stemmer reduces the initial vocabulary by a third.
Many experiments have proven that the vocabulary compression depends on the strength of the stemmer, as can be seen in Table 2. Lennon, et al.
Intelligent Information Retrieval: The Case of Astronomy and Related Space Sciences
To rank the stemmers according to their performance many experiments have been carried out. Harman compares, in terms of recall and precision, the behaviour of an "S" stemmer, the Porter stemmer and the SMART system Salton, which implements an enhanced version of the Lovins stemmer. In the experiment, an information need is posed through a query and the system returns a ranked list of documents related to the query.
- The Principles of Precis!
- Information Modelling and Knowledge Bases XXV;
- New Technologies in the Oil and Gas Industry;
- London - ASME '17.
- Mel Bays Complete Chromatic Harmonica Method.
Stemming is applied both to the documents and the query, and the evaluation is done with respect to a simple classic ranking technique returns an ordered list of documents whose vocabulary best matches the terms of the query and a more sophisticated one using term weighting and query expansion. Frakes also carries out two experiments to evaluate the accuracy of a system based on the Lovins stemmer in terms of a metric called E Van Rijsbergen, which considers both recall and precision, and concludes that the performance is mostly the same either if the conflation is done manually by experts or if it is done automatically using a stemmer.
Another experiment is undertaken by Hull over five stemmers: "S", Porter, Lovins, Xerox inflectional, and derivational analysers Xerox, In fact, Hull bases this conclusion on the results of three measures: the average precision at eleven points of recall APR11 , the average precision at five to ten documents examined AP and the average recall at fifty to one hundred and fifty documents examined AR. The first measure is calculated by averaging the precision obtained for each of the defined levels of recall that is when recall is 0. This metric is one of the most well-known in information retrieval and allows analysing the interdependence between recall and precision.
The other two metrics calculate respectively the average precision and the average recall applying stemming to increasing numbers of documents. In his experimentation, Hull notes that both APR11 and AP measures remain mostly the same with or without stemming, while AR increases when stemming is applied. Krovetz also confirms that the use of stemming brings a big enhancement in terms of performance and especially of recall when the documents are short.
In addition, Frakes and Fox propose a similarity measure allowing the determination of similarity between two stemmers by comparing the results they return. This measure allows the easy classification of a new stemmer depending on which is the most similar known-stemmer. The similarity is calculated on the basis of the inverse modified Hamming distance. Analysing the similarity measures of all the possible combinations of two stemmers the authors conclude that the algorithms similarity depends on their strength similarity. Figure 1 sums up the features presented above for four of the main classical approaches for stemming.
This summary reinforces the idea that all the features depend on the strength of the stemmers and then on the rate of over-stemming and under-stemming errors. To evaluate stemmers and obtain these measures a big collection of documents is needed. Over recent years, many collections have been proposed and many of them have become reference sets to test new stemmers and, more generally, to evaluate new approaches to any step of the information retrieval process.
Table 3 shows which corpora of documents have been used in the bibliography cited in this paper. In turn, Table 4 lists the most widely known and used collections, and their main features. Some researchers Krovetz, ; Hull, ; Harman, relate the performance of a stemmer to the collection used as input. In fact, they all confirm after their experimentation that stemmers perform better, mainly in terms of precision, when the documents are short less than words per document.
This can be explained because the probability that terms used in the query appear in the same form in the documents decreases as the vocabulary is reduced, and then the conflation obtained by the stemming allows detecting more related terms. This hypothesis also applies for the length of the query. Although many of the stemmers are aimed at and are evaluated with the English language, some researchers have modified existing approaches or proposed new ones to handle other languages. According to linguistics, languages can be divided into two main categories depending on their morphological structure: analytic languages and synthetic languages.
The second category can be divided, in turn, into three subcategories, namely agglutinative languages , fusional or inflecting languages and polysynthetic languages Comrie, ; Pirkola, ; Aikhenvald et al. However, this classification is an idealization, as all languages belong to more than one category, even if they generally fit better into one of them. In fact, two continuous variables have been proposed to indicate to what extent a language belongs to one type or another, the index of synthesis and the index of fusion Whaley, The index of synthesis describes the extent of morphological synthesis, that is, how much the words of a language are affixed.
At one extreme are the most analytic languages also called isolating languages , where all morphemes are free morphemes language tends to have no morphology at all , while at the other extreme, the most polysynthetic languages tend to have sentences consisting of a single complex word formed by many morphemes. The second variable, the index of fusion, describes how easy it is to split the words into morphemes.
On this scale, one extreme would be the agglutinative languages that have words that can be easily separated into clearly identifiable morphemes, while at the other extreme would be the fusional languages, where words are formed by morphemes that are not clearly identifiable. Using these indexes, Table 5 provides the prevailing morphological type of some of the languages for which stemmers have been developed. This diversity of languages and types implies there is also diversity in the problems and challenges that sometimes need to be handled by new non-classical approaches as the classical ones lead to results that are not as accurate and efficient as in English Kettunen, In fact, even if modern English can be considered an analytic language, it is also a weakly inflective language due to its heritage of Old English, which was a fusional language Meyer, This means it is much easier to obtain the stems of the words than in other very isolating languages like Mandarin Chinese or Indonesian.
In fact, while stemming in English only cares about suffix stripping, stemming in Indonesian must consider and remove a wider range of affixes, like prefixes, infixes, confixes combination of prefixes and suffixes , as well as suffixes.
- Ghost Coloring Book!
- Distributed Data Systems and Services for Astronomy and the Space Sciences;
- Pope Francis: The Struggle for the Soul of Catholicism!
- Legacy: A Novel.
Many of the proposed algorithms for stemming Indonesian words use well-known elements used in classical stemmers like dictionaries, lists of rules and recoding Asian, Williams and Tahaghoghi, In particular, Turkish has approximately 23, stems and words are formed depending on their grammatical function, which is indicated by adding suffixes to stems. This results in a theoretically infinite number of words that can be created Hankamer, Heck, F.
Enabling Technologies: 2. What Hypertext can do for Information Retrieval; R. Bogaschewsky, U. Archie; A. WAIS; J.
The Internet Gopher; F. Anklesaria, M. State of the Art in Astronomy and Other Sciences: Albrecht, D.
The Concept of Information
Epilogue; A. Abbreviations and Acronyms. Show More. Average Review. Write a Review. Related Searches. Explanation in the Sciences. Emile Meyerson's writings on the philosophy of science are a rich source of ideas and Emile Meyerson's writings on the philosophy of science are a rich source of ideas and information concerning many philosophical and historical aspects of the development of modem science. Meyerson's works are not widely read or cited today by philosophers or View Product.