All this has already been posted on koha-devel list. In order to share with a wider audience our work and our expectancies about a new indexing engine (SolR) for Koha, we thought it would be good to gather all those thoughts here. Thanks to some librairies who are supporting our approach.
Why zebra doesn’t suit us anymore ?
- There is no community around it, it is an opensource Indexdata software.
- no real-time indexing : the use of a crontab is poor: when you add an authority while creating a biblio, you have to wait some some minutes to end your biblio (could be solved… since zebra has some way to index biblios via z3950 extended services… But hard and should be tested)
- no way to access/process/delete data easily. If someday, you have to reindex the whole stuff, and if you have a problem with your datas (encoding), nothing will be indexed. During index process of a file, if you have a problem in your data, zebraidx just fails silently… And this is NOT secure. And you have no way to know WHICH biblio made the process crash.
- we use a deprecated way to define indexes for biblios (grs1) and the tool developed by indexdata to change to DOM has many flaws. And it would require some change in the indexing method to get it work. But if anyone, has something done and working right away, it would be great.
- Iso2709 standard defines a limit of size for a record. Zebra indexing process uses iso2709 to serialize notice before indexing. XML can be used but causes encoding issues. So, serials with a lot of items cannot be indexed in Zebra.
- ccl, cql and pqf are nonintuitives.
- Facets does not exists, they are calculated by Koha but are wrong (based on current page). A patch was sent that does the process on 500 records, but still, it is not true facets. If we would like to use zebra facets, there are problems with diacritics. And this can’t be solved as of today.
- When using Zebra with ICU (used a lot in France) we loose 2 important features: left truncation and fuzzy search
- All in all, many features are missing at the moment, namely fuzzy-search(with icu enabled not working at all and without icu, it still is only one letter mistyped), true facet search, stemming, synonyms search, metaphone search, Solr can…
- Other problems:
- Zebra config files are a nightmare. You can’t drive the configuration file easily. namely : can’t edit indexes via HTTP or configuration. All is in files hardcoded on disk, so you can’t list, change, edit indexes, you can’t choose if an index is to be shown for OPAC or intranet simply (Could be done with scraping ccl.properties, and then record.abs and bib1.att…. But what a HELL). So you cannot customize configuration defining the indexes you want easily.
- Memory problem with zoom and zebra on persistent connections
What does SolR bring?
- Software status:
- Big community, opensource, free, based on lucene, widely in used, apache driven project
- Very well documented and supported by the community
- Types configuration: Standard solr installation provides Analyzers and Tokenizers out of the box. Indexing with tokenizers: when a document is indexed, fields are analyzed and tokenized to transform and normalize data. Querying with the help of analyzers: an analyzer pre-process text at index or search time to find text (find words no matter lowercase, find synonyms)
- Index names are clear and customizable
- Dynamics fields helps configuration over database
- Can index non Marc resources: full text PDF and Document indexing.
- SolR wrappers exists in lot of languages, so interconnecting applications is made easier
- No more zebra queue cronjob because of the capacity to update on the fly the document. This would surely be done with zebra extended queries. But some problems were raised when it was first developed and tested (with concurrent access) And would still have to be solved.
- Faceted search provides a better user search experience.
- Fuzzy search
- Administration web console provides easy test searching
- Querying syntax is intuitive and if you really want, you can write yours
- Advanced language searching: managing stemming (process that finds words with the same stem) and metaphone search (algorithm for indexing and finding words by pronounciation).
- Use of wildcards for querying right or left truncature (* or ?)
- Proximity search
- Keyword boosting
- Synonyms search
- Orthographic suggestions
- Search documents by similarity (“related items”)
- Relevance search is more relevant
- Diacritics and utf8 search problems are solved out of the box.
- Ability to use a multi core install: one SolR instance, multiple databases
Development facts in Biblibre with SolR:
I. State of the Proof Of Concept
1/ substitute notices indexing and querying, today written in zebra, in solr with simple types (string, text, date, integer) and basics analyzers and tokenizers
2/ Manage biblio and authorities indexes and mappings in an administration page (configuration is recorded in database)
- 08/2010 developments began and first search appears
- code search base (Search.pm etc.) is melding as of 70%
- 08/2010 new admin page “indexes.pl” and “indexmappings.pl”
- 08/2010 supports for functional indexing plugins (real time availability indexing for example)
- 09/2010 WIP during all the development, we refactors Koha code (related to Search)
- 10/2010 a proof of concept (poc) have been deployed and demoed at Koha Con 2010 for feedback
II. Refine needs and developments enhancements
- 11/2010 types management enhancements (fine tuning still in progress)
- 11/2010 Solr allows multi core install which gives the opportunity to manage multiple sites for one solr installation
- Work in progress 11/2010 possible search in different forms like rejected, associated and heading
- Work in progress 11/2010 new z3950 server development on top of SolR using Net::Z3950::SimpleServer module from Indexdata Simple search is already done. Complex queries is a work in progress
- Work in progress 12/2010 updating installation scripts (bdd, solr configuration, scripts…)
- improving indexing time (it’s too long)
- refining types configuration
- dead code cleaning
- development of a test suite (functional scenarios, unit tests, regression tests zebra vs solr)
- updating xslt files
- provide an abstraction layer to give the ability to maintain zebra
- updating advanced search forms support
- displaying tag cloud to represents facets and weights
We really want to work with you on this topic and hdl should organize something as soon as possible, stay tuned!