Edit setup

Name or notes

Not necessary, only for managing purpose

Notes

User interface

Grammar (stemming)

You can enable all document language dependent grammars, since for each document only the grammar of the document language will be applied. If your documents are only or mainly in one language, you should force a grammar, so the grammar is applied even if the document language is not recognized correct.

Use grammar (dependent on document language)

Use grammar heuristic (stemming) if the document language was autodetected

Force grammar (independent of document language)

Force grammar(s) / stemmers for all documents (even if other language)

Additional (results in bigger index) or alternate you can enable stemming by grammar rules and dictionary based Hunspell stemmer:

Use Hunspell stemmer (dependent on document language)

Use Hunspell rules & dictionary if the document language was autodetected

Force Hunspell stemmer (independent of document language)

Force Hunspell language rules & dictionaries for all documents (even if other language)

OCR dictionaries

Dictionaries to use for optical character recognition of scanned documents

OCR of scanned documents and images

Extract & find text from scanned or photographed documents by optical character recognition (OCR) of images

OCR

Extract text from images by optical character recognition

OCR of images in PDF documents

Extract text from images embedded in PDF documents (f.e. scanned documents)

Descew

Additionally descew images before OCR, which sometimes improves OCR results for bad quality scans.

Warning: This will take long time / multiple the time for OCR but in some cases of bad quality scans more can be recognized.

Priority

Since OCR slows down indexing by needing most time for often only few additional infos, OCR can be done later by reindexing with enabled OCR after all documents were yet indexed without OCR first to get an usable index many times faster. So you can search & will find most documents much earlier while OCR analysis runs later in background.

OCR dictionaries

Set up the used language specific OCR dictionaries in the tab Document language(s)

Named Entity Recognition (NER) by machine learning

Named Entity recognition needs many CPU ressources while indexing documents, but will extract automatically many entities like persons, organizaztions or places (that are not configured in your thesaurus or ontologies) for interactive filters & enhanced analysis.

SpaCy NER

Extract named entities by SpaCy NER

Stanford NER

Alternate (or additionally to recognize some entities which SpaCy did not recognize) you can use the NER models of another NER framework Stanford NER

Warning: Low performance while indexing of documents!

Segmentation to pages

By segmentation of PDF files to single pages you can find and analyze PDF documents more granular by single pages.

Warning: Will double indexing time and index size!

Segmentation to sentences

By segmentation of document text to single sentences you can find and analyze entities more granular by search and filter entities that occur together in same sentence(s).

Warning: Will double indexing time and index size!

Preview (Thumbnails)

PDF preview

By additional segmentation / splitting of PDF files to single page PDF files for preview you can spare download bandwith for downloading of full PDF by only downloading/previewing the previewed pages, which can be useful outside your intranet providing search in very big/massive PDF documents over internet.

Warning: Needs additional disc space (more than the size of all indexed PDF files) in separate thumbnail directory!

Graph database (neo4j)

Export entities in documents and connections from/to documents to graph database for analysis and exploration of indirect connections.

Export extracted entities and connections

Export entities connections to Neo4j graph database

If you don't need analyze indirect connections since direct connections by faceted search are enough you can disable the plugin for performance issues.

Neo4j server

Neo4j browser

Links from search UI to Neo4j browser for exploration and visualization if the graph

Servers (URLs)


Save changes

Cancel