A Search Engine in Perl

Max Maischein

Frankfurt.pm

Overview

  • Motivation

  • Structure of a search engine

  • Ingredients

  • Demo

  • Future improvements

Who am I?

  • Max Maischein

  • Frankfurt.pm

  • Perl since 2000

  • Financial regulatory regimes since 2013 (EMIR, EinSiG, MIFiD II, ...)

  • Smart Data + data mining since 2016

Too much information

  • Too much different information

  • Too little time to organize the information

  • Different from Google

  • Keep data local

  • Don't become part of a resultset

Existing approaches

  • Google Search Appliance (too expensive)

Existing approaches

  • Google Search Appliance (too expensive)

  • Windows Desktop Search/Cortana (Only Windows shares, no mail etc.)

Existing approaches

  • Google Search Appliance (too expensive)

  • Windows Desktop Search/Cortana (Only Windows shares, no mail etc.)

  • Siri+Sherlock (Mac) (No Mac)

Existing approaches

  • Google Search Appliance (too expensive)

  • Windows Desktop Search/Cortana (Only Windows shares, no mail etc.)

  • Siri+Sherlock (Mac) (No Mac)

  • Beagle for Linux or Ubuntu (Stopped in 2009)

Do it yourself

  • Otherwise there would be no talk for me

  • Little time

  • Reuse many available building blocks

Splitting the task

Scraper / Crawler

  • Find documents

  • Find linked documents

  • Extract text

  • Import text

  • Metadata: Text language / URL / Creation time stamp

Search Index

  • Optimized data structure

  • Quick retrieval

  • Stemming (Find "Programs" and "Programming" when searching for "Program")

  • Synonyms

Search

  • Query entry

  • Quick (!) response

  • Ranking

  • Preview of document

Parts

Parts

  • Crawler / Extractor (Perl+Apache Tika)

Parts

  • Crawler / Extractor (Perl+Apache Tika)

  • Index (Elasticsearch, Search::Elasticsearch)

Parts

  • Crawler / Extractor (Perl+Apache Tika)

  • Index (Elasticsearch, Search::Elasticsearch)

  • Search (Dancer)

Live Demo

 1:  cpanm --look Dancer::SearchApp
 2:  plackup -Ilib -p 5001 --host 127.0.0.1 -a bin/app.pl &
 3:  
 4:  perl -Ilib -w bin/index-filesystem.pl t\documents
 5:
 6:  # Search

ES Schema

  • URL / id ( file:// or mail:// )

  • title

  • body (HTML)

  • author

  • type (file or mail)

Crawlers / Extractors

  • File system (pdf, Text, Audio, via Apache::Tika::Async)

  • IMAP

  • ICal

  • HTTP (also, Plack)

Comparison with Google

  • Pagerank vs. Elasticsearch rank

  • Pagerank recognizes "Hub" pages

  • Every document on MY laptop is "interesting"

Search Results

We can display local content in local formats

  • PDF (as HTML)

Search Results

We can display local content in local formats

  • PDF (as HTML)

  • Mail (link to Thunderbird)

Search Results

We can display local content in local formats

  • PDF (as HTML)

  • Mail (link to Thunderbird)

  • Music (direct link)

Future improvements

  • Extraction from Online-Content (Intranet, HTML::ContentExtractor::FTR)

  • More extractors (video, ...)

  • Metasearch actross Elasticsearch instances (Laptop in home network)

Installation

Apache Tika

https://tika.apache.org/download.html

 1:  http://www.apache.org/dyn/closer.cgi/tika/tika-server-1.13.jar

ElasticSearch

https://www.elastic.co/downloads/elasticsearch

 1:  https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/zip/elasticsearch/2.2.0/elasticsearch-2.2.0.zip

Thanks

Questions?

Thanks

Questions?

Dancer::SearchApp

corion@cpan.org

Credits

 1:  Hitman Kevin MacLeod (incompetech.com)
 2:  Licensed under Creative Commons: By Attribution 3.0 License
 3:  http://creativecommons.org/licenses/by/3.0/

Google Search Appliance image by Google Inc.

Cortana image by Microsoft Inc.

Apple Siri logo by Apple Inc.

Beagle logo by Fornax / Beagle Project

 1:  https://de.wikipedia.org/wiki/Datei:Beagle_Logo.svg

Bonus section

Content / Testdata

  • My own mails

  • Trip back to 2000

  • No good for public consumption

  • EU/ESMA produces many PDFs

  • I produce many Perl programs

  • YAPC / Act produces many calendars

Crawlers

  • The heart of the search engine

  • Content extraction

  • Much existing code

Development

Filesystem Crawler

  • File::Find

  • Apache::Tika::Async for text extraction

  • Special extraction for mp3 and images

  • Done

IMAP / Mail crawler

  • Not good for presentation

  • Trip back to 2001

  • Start with file import

  • index-imap.pl