Micro Search

A self-hostable search engine API for personal websites


About

A self-hostable search engine for sites supporting microformats2.

FAQ

The frequently asked questions and the information every site owner needs to know.

What is this for?

This self-hostable search engine is an attempt to create an API to search microformats2 supported websites.

Who is this for?

Thi is best used by people who has personal websites and want to add search functionality. For a page to be indexed, the following criteria must be met:

  • The robots.txt must not disallow this bot from crawling.
  • The websites must use h-entry for posts, the pages without h-entry are skipped from indexing.
  • The pages must be defined in the sitemaps file, and that file must be mentioned in the robots.txt

Is this also a bot?

Yes, the crawling bot is integrated with the search engine, the one hosting this search engine needs to manually enter the websites to crawl for.

How do I stop this bot?

You can add the following to your robots.txt file:

User-Agent: IW-microsearch
Disallow: /

I want avoid indexing specific pages

This bot finds all pages that are withing the sitemap.xml, and will NEVER crawl through other links in your webpage. If you want to not index a page that is in your sitemap.xml file, do one of the following:

  • Add <meta name="robots" content="noindex"> tag.
  • Add <meta name="IW-microsearch" content="noindex"> tag. (IW-microsearch is our user agent)
  • Add X-Robots-Tag: noindex in the response header.

Does this bot respect delays?

Yes, and the crawl delay used by this bot is the maximum value between the delay specified in the robots.txt and the default delay of the bot. The default delay for this bot is 2 seconds per URL, and it is configurable by anyone hosting it.

For example, if you specifiy a crawl delay as 5 seconds like below, and the bot is configured to delay the crawl to 10 seconds, the bot will run every 10 seconds instead of 5.

User-agent: IW-microsearch
Crawl-delay: 5

How to run this project?

  1. Clone this repository.
  2. Copy the env.sample to a file named .env.
  3. Edit the newly created .env file and add the sites you want to index in the SITES variable.
  4. If you want to crawl multiple sites, seperate them with commas (SITES=https://example.com,https://anotherexample.com.
  5. Create a new python virtual environment and activate it (python3 -m venv .venv && source .venv/bin/activate).
  6. Install all requirements (pip install -r requirements.txt).
  7. Run the program with the command flask run.
  8. profit!

Running in production

You can use gunicorn for this, the below commands runs app in four workers.

# Binding to a port
gunicorn --bind "$FLASK_RUN_HOST:$FLASK_RUN_PORT" -w 4 wsgi:app --preload

# Binding to a unix socket
gunicorn --bind unix:~/micro_search.sock -w 4 wsgi:app --preload

Or, just run using the docker

docker compose up -d

How to use the project to search webpage?

This service currently exposes only one api endpoint

  • / -> This is a GET request that takes two parameters query and site.
    • The parameter query is mandatory, not providing a query will return an empty result.
    • The parameter site the site url, you could also provide some arbitary text here, this will match all urls containing the provided the text.

Examples

# Search for the text "example"
curl http://localhost:5000/?query=example

# Search for the text "example" in al URLs containing the word "example.com"
curl http://localhost:5000/?query=example&site=example.com

# Search for the text "example" in al URLs containing the word "demo"
curl http://localhost:5000/?query=example&site=demo

Mentions

Looking for comments? This website uses webmention instead, the standard way for websites to talk to each other, you can either send me one through your website, or use the below form to submit your mention. If you do not own a website, send me an email instead.

Toggle the webmention form
All webmentions will be held for moderation by me before publishing it.

About me

Coding Otaku Logo

I'm Rahul Sivananda (he/him), the Internet knows me by the name Coding Otaku (they/them). I used to work as a Full-Stack Developer in London, now actively looking for a job after being laid off.

I care about Accessibility, Minimalism, and good user experiences. Sometimes I write stories and draw things.

Get my cURL card with curl -sL https://codingotaku.com/cc
You can find me on Mastodon, Codeberg, or Peertube.

Subscribe to my web feeds if you like what I post and want to get notified. If you would like to help me to keep this site and my silly projects running, you can now donate to me through Liberapay or Stripe!

Continue Reading

Like this post? there's more where that came from! Continue reading.

Recent Projects

Subscribe via Atom or JSON feeds.