biolds/sosse

By biolds

β€’Updated 4 days ago

Sosse - Open-source, enterprise-grade web search & crawling.

Image
1

10K+

biolds/sosse repository overview

⁠ ⁠ ⁠ ⁠ ⁠

⁠Sosse 🦦

Discover Sosse β€” the Selenium Open Source Search Engine built for powerful web archiving, crawling, and search. Explore all its features and capabilities on the official website⁠.

Whether you're a developer, researcher, or data enthusiast, Sosse is ready to support your projects. Join the community on GitHub⁠ or GitLab⁠ to submit feature requests, report bugs, contribute code, or start a discussion⁠.

⁠Key Features

  • 🌍 Web Page Search: Search the content of web pages, including dynamically rendered ones, with advanced queries. (doc⁠)

  • πŸ•‘ Recurring Crawling: Crawl pages at fixed intervals or adapt the rate based on content changes. (doc⁠)

  • πŸ”– Web Page Archiving: Archive HTML content, adjust links for local use, download required assets, and support dynamic content. (doc⁠)

  • 🏷️ Tags: Organize and filter crawled or archived pages using tags for better search and management. (doc⁠)

  • πŸ“‚ File Downloads: Batch download binary files from web pages. (doc⁠)

  • πŸ“‘ Webhooks: Integrate with external services using highly flexible webhooks. Connect to proprietary AI platforms (doc⁠) or locally hosted solutions (doc⁠) to enable advanced data extraction, summarization, auto-tagging, notifications, and more.

  • πŸ”” Atom Feeds: Generate content feeds for websites that don’t have them, or receive updates when a new page containing a keyword is published. (doc⁠)

  • πŸ”’ Authentication: The crawler can authenticate to access private pages and retrieve content. (doc⁠)

  • πŸ‘₯ Permissions: Admins can configure crawlers and view statistics, while authenticated users can search or do so anonymously. (doc⁠)

  • πŸ‘€ Search Features: Includes private search history (doc⁠), and external search engine shortcuts (doc⁠), etc.

Explore the πŸ“š documentation⁠ and check out some πŸ“· screenshots⁠.

Sosse is written in Python and is distributed under the GNU AGPLv3 license⁠. It uses browser-based crawling with Mozilla Firefox⁠ or Google Chromium⁠ alongside Selenium⁠ to index pages that rely on JavaScript. For faster crawling, Requests⁠ can also be used. Sosse uses PostgreSQL⁠ for data storage.

⁠Try It Out

To quickly try the latest version with Docker:

docker run -p 8005:80 biolds/sosse:stable

Then, open http://127.0.0.1:8005/⁠ and log in with the username admin and password admin.

For persistence of Docker data or alternative installation methods, please refer to the installation guide⁠.

⁠Stay Connected

Join the Discord server⁠ to get help, share ideas, or discuss Sosse!

Tag summary

Content type

Image

Digest

sha256:d8b49d2f2…

Size

1.7 GB

Last updated

4 days ago

Requires Docker Desktop 4.37.1 or later.