Micro Search
A self-hostable search engine API for personal websites
- Published on .
- Source: https://codeberg.org/IndieWemblates/micro_search
- Tagged under
About
A self-hostable search engine for sites supporting microformats2.
FAQ
The frequently asked questions and the information every site owner needs to know.
What is this for?
This self-hostable search engine is an attempt to create an API to search microformats2 supported websites.
Who is this for?
Thi is best used by people who has personal websites and want to add search functionality. For a page to be indexed, the following criteria must be met:
- The robots.txt must not disallow this bot from crawling.
- The websites must use h-entry for posts, the pages without
h-entry
are skipped from indexing. - The pages must be defined in the sitemaps file, and that file must be mentioned in the
robots.txt
Is this also a bot?
Yes, the crawling bot is integrated with the search engine, the one hosting this search engine needs to manually enter the websites to crawl for.
How do I stop this bot?
You can add the following to your robots.txt
file:
User-Agent: IW-microsearch
Disallow: /
I want avoid indexing specific pages
This bot finds all pages that are withing the sitemap.xml
, and will NEVER crawl through other links in your webpage. If you want to not index a page that is in your sitemap.xml file, do one of the following:
- Add
<meta name="robots" content="noindex">
tag. - Add
<meta name="IW-microsearch" content="noindex">
tag. (IW-microsearch
is our user agent) - Add
X-Robots-Tag: noindex
in the response header.
Does this bot respect delays?
Yes, and the crawl delay used by this bot is the maximum value between the delay specified in the robots.txt
and the default delay of the bot. The default delay for this bot is 2 seconds per URL, and it is configurable by anyone hosting it.
For example, if you specifiy a crawl delay as 5 seconds like below, and the bot is configured to delay the crawl to 10 seconds, the bot will run every 10 seconds instead of 5.
User-agent: IW-microsearch
Crawl-delay: 5
How to run this project?
- Clone this repository.
- Copy the
env.sample
to a file named.env
. - Edit the newly created
.env
file and add the sites you want to index in theSITES
variable. - If you want to crawl multiple sites, seperate them with commas (
SITES=https://example.com,https://anotherexample.com
. - Create a new python virtual environment and activate it (
python3 -m venv .venv && source .venv/bin/activate
). - Install all requirements (
pip install -r requirements.txt
). - Run the program with the command
flask run
. - profit!
Running in production
You can use gunicorn for this, the below commands runs app in four workers.
# Binding to a port
gunicorn --bind "$FLASK_RUN_HOST:$FLASK_RUN_PORT" -w 4 wsgi:app --preload
# Binding to a unix socket
gunicorn --bind unix:~/micro_search.sock -w 4 wsgi:app --preload
Or, just run using the docker
docker compose up -d
How to use the project to search webpage?
This service currently exposes only one api endpoint
/
-> This is aGET
request that takes two parametersquery
andsite
.- The parameter
query
is mandatory, not providing a query will return an empty result. - The parameter
site
the site url, you could also provide some arbitary text here, this will match all urls containing the provided the text.
- The parameter
Examples
# Search for the text "example"
curl http://localhost:5000/?query=example
# Search for the text "example" in al URLs containing the word "example.com"
curl http://localhost:5000/?query=example&site=example.com
# Search for the text "example" in al URLs containing the word "demo"
curl http://localhost:5000/?query=example&site=demo
# Search for the text "example"
curl http://localhost:5000/?query=example
# Search for the text "example" in al URLs containing the word "example.com"
curl http://localhost:5000/?query=example&site=example.com
# Search for the text "example" in al URLs containing the word "demo"
curl http://localhost:5000/?query=example&site=demo