search-intelligence-skill
# search-intelligence-skill
Use `search-intelligence-skill` to give any AI agent the ability to search the entire internet like an expert OSINT analyst, SEO engineer, and security researcher combined. All searches flow through your SearXNG instance — zero API keys, full privacy, 90+ engines.
The skill generates optimized dork queries, selects intelligent multi-step search strategies, translates operators across engines, routes queries to the best SearXNG engines, scores results by multi-signal relevance, and learns from results to refine searches automatically.
## Setup (once)
**Install the package**
```bash
# From source (recommended)
git clone https://github.com/mouaad-ops/search-intelligence-skill.git
cd search-intelligence-skill
pip install -e .
# Or direct pip
pip install search-intelligence-skill # NOT yet working
```
**Start a SearXNG instance (if you don't have one)**
```bash
# Docker (quickest)
docker run -d \
--name searxng \
-p 8888:8080 \
-e SEARXNG_SECRET=your-secret-key \
searxng/searxng:latest
# Verify it's running
curl http://localhost:8888/healthz
```
**Enable JSON API in SearXNG settings**
```yaml
# In searxng/settings.yml — ensure search formats include json
search:
formats:
- html
- json
```
**Initialize in code**
```python
from search_intelligence_skill import SearchSkill
# Default — localhost:8888
skill = SearchSkill()
# Custom instance
skill = SearchSkill(
searxng_url="http://localhost:8888",
timeout=30.0,
max_retries=2,
rate_limit=0.5,
verify_ssl=True,
auto_refine=True,
max_refine_rounds=1,
)
# Verify connection
if skill.health_check():
print("✓ SearXNG is reachable")
else:
print("✗ Cannot reach SearXNG — check URL and port")
```
## Common Commands
**Natural language search (the main interface)**
```python
from search_intelligence_skill import SearchSkill
skill = SearchSkill(searxng_url="http://localhost:8888")
# Just describe what you want — the skill handles everything:
# intent detection, dork generation, engine selection, scoring
report = skill.search("find exposed .env files on example.com")
# Print LLM-ready formatted output
print(report.to_context())
# Access structured results
for r in report.top(5):
print(f"[{r.relevance:.1f}] {r.title}")
print(f" {r.url}")
print(f" {r.snippet[:200]}")
```
**Control search depth**
```python
from search_intelligence_skill import Depth
# Quick — 1-2 queries, single step, fast lookup
report = skill.search("what is CORS", depth="quick")
# Standard — 3-6 queries, multi-engine, good default
report = skill.search("python async frameworks comparison", depth="standard")
# Deep — 6-12 queries, multi-step strategies, thorough research
report = skill.search("security audit of target.com", depth="deep")
# Exhaustive — 12+ queries, full OSINT chains, complete sweep
report = skill.search("full recon on suspect-domain.com", depth="exhaustive")
```
**Security scanning — exposed files and panels**
```python
report = skill.search(
"find exposed .env files, admin panels, and directory listings on example.com",
depth="deep",
)
print(f"Intent: {report.intent.category.value}/{report.intent.subcategory}")
# → Intent: security/exposed_files
print(f"Strategy: {report.strategy.name}")
# → Strategy: multi_angle
print(f"Results: {len(report.results)}")
for r in report.top(10):
print(f" [{r.relevance:.1f}] {r.title} — {r.url}")
```
**Security scanning — vulnerability research**
```python
# CVE research
report = skill.search("CVE-2024-3094 xz backdoor exploit details", depth="deep")
# Technology-specific vulnerabilities
report = skill.search(
"Apache Struts remote code execution vulnerabilities 2024",
depth="standard",
)
# Exposed API endpoints
report = skill.search(
"find exposed swagger API docs on target.com",
depth="deep",
)
# Git repository exposure
report = skill.search(
"exposed .git directories on example.com",
depth="deep",
)
```
**OSINT investigation — people**
```python
# By name
report = skill.search(
'OSINT investigation on "John Doe" — social media, email, profiles',
depth="deep",
)
# By email
report = skill.search(
"investigate john.doe@example.com — find all accounts and mentions",
depth="exhaustive",
)
# By username
report = skill.search(
"find all accounts for username @johndoe42",
depth="deep",
)
# By phone number
report = skill.search(
"lookup phone number +1-555-123-4567",
depth="standard",
)
```
**OSINT investigation — domains and companies**
```python
# Domain reconnaissance
report = skill.search(
"full domain recon on target.com — subdomains, DNS, certificates, technology stack",
depth="exhaustive",
)
# Company investigation
report = skill.search(
'investigate company "Acme Corp" — employees, filings, data breaches',
depth="deep",
)
# IP address lookup
report = skill.search(
"investigate IP 192.168.1.1 — open ports, services, abuse reports",
depth="standard",
)
```
**SEO analysis**
```python
# Site indexation check
report = skill.search(
"SEO indexation analysis of example.com",
depth="standard",
)
# Backlink research
report = skill.search(
"find backlinks pointing to example.com",
depth="deep",
)
# Competitor analysis
report = skill.search(
"SEO competitor analysis for example.com — related sites, ranking keywords",
depth="deep",
)
# Technical SEO audit
report = skill.search(
"technical SEO check on example.com — sitemap, robots.txt, canonical, hreflang",
depth="deep",
)
```
**Academic research**
```python
# Find papers
report = skill.search(
"latest research papers on transformer architecture scaling laws 2024",
depth="standard",
)
# Find datasets
report = skill.search(
"download dataset for sentiment analysis benchmark CSV",
depth="standard",
)
# Find authors and their work
report = skill.search(
'research publications by author "Yann LeCun" on deep learning',
depth="deep",
)
```
**Code and developer search**
```python
# Find repositories
report = skill.search(
"python library for PDF text extraction with OCR support",
depth="standard",
)
# Find packages
report = skill.search(
"npm package for real-time WebSocket pub/sub",
depth="standard",
)
# Debug errors
report = skill.search(
"RuntimeError: CUDA out of memory pytorch solution",
depth="standard",
)
# Find documentation
report = skill.search(
"FastAPI dependency injection documentation examples",
depth="quick",
)
```
**File hunting**
```python
# Find specific file types
report = skill.search(
"machine learning cheat sheet filetype:pdf",
depth="standard",
)
# Find datasets
report = skill.search(
"US census data 2023 download CSV",
depth="standard",
)
# Find configuration files
report = skill.search(
"docker-compose example microservices filetype:yaml",
depth="standard",
)
```
**News search**
```python
# Recent news
report = skill.search(
"latest news on AI regulation this week",
depth="standard",
)
# Breaking news
report = skill.search(
"breaking news today cybersecurity",
depth="quick",
)
# News analysis
report = skill.search(
"analysis of EU AI Act implications for startups",
depth="standard",
)
```
**Image and video search**
```python
# Images
report = skill.search(
"high resolution photos of Mars surface NASA",
depth="standard",
)
# Videos
report = skill.search(
"video tutorial on Kubernetes deployment strategies",
depth="standard",
)
```
**Social media search**
```python
# Reddit discussions
report = skill.search(
"reddit discussion about best self-hosted alternatives to Google Photos",
depth="standard",
)
# Forum threads
report = skill.search(
"forum thread comparing Proxmox vs ESXi for home lab",
depth="standard",
)
```
**Direct dork query (no intent parsing)**
```python
# Execute a raw dork you've written yourself
report = skill.search_dork(
'site:github.com "API_KEY" filetype:env',
engines=["google", "bing"],
)
print(report.to_context())
```
**Preview queries without executing them**
```python
# See what dork queries would be generated
dorks = skill.suggest_queries(
"find SQL injection vulnerabilities on target.com"
)
for d in dorks:
print(f" Query: {d.query}")
print(f" Operators: {d.operators_used}")
print(f" Purpose: {d.purpose}")
print()
```
**Build a custom dork from parameters**
```python
dork = skill.build_dork(
keyword="confidential",
domain="example.com",
filetype="pdf",
intitle="report",
exclude=["public", "template"],
exact_match=True,
)
print(f"Generated: {dork.query}")
# → site:example.com filetype:pdf intitle:"report" -public -template "confidential"
# Execute it
report = skill.search_dork(dork.query)
```
**Execute a named strategy against a target**
```python
# Full OSINT chain
report = skill.execute_strategy(
strategy_name="osint_chain",
target="suspect-domain.com",
depth="exhaustive",
)
# Deep security dive
report = skill.execute_strategy(
strategy_name="deep_dive",
target="target.com",
depth="deep",
)
# File hunting
report = skill.execute_strategy(
strategy_name="file_hunt",
target="example.com",
depth="deep",
)
# Temporal trend analysis
report = skill.execute_strategy(
strategy_name="temporal",
target="AI regulation news",
depth="deep",
)
```
**Batch search — multiple queries at once**
```python
queries = [
"python FastAPI vs Flask performance",
"rust web frameworks comparison 2024",
"go gin framework documentation",
]
reports = skill.search_batch(queries, depth="quick")
for report in reports:
print(f"Query: {report.query}")
print(f" Results: {len(report.results)}")
print(f" Best: {report.top(1)[0].title if report.results else 'None'}")
print()
```
**Override engine and category selection**
```python
# Force specific engines
report = skill.search(
"quantum computing breakthroughs",
engines=["google_scholar", "arxiv", "semantic_scholar"],
)
# Force specific categories
report = skill.search(
"kubernetes tutorial",
categories=["it", "general"],
)
# Force time range
report = skill.search(
"zero-day vulnerabilities",
time_range="week",
)
# Force language
report = skill.search(
"machine learning tutorials",
language="en",
)
```
**Working with the SearchReport object**
```python
report = skill.search("advanced persistent threats 2024", depth="standard")
# LLM-ready text (for injecting into AI agent context)
context = report.to_context(max_results=20)
# Top N results sorted by relevance
top5 = report.top(5)
# Full result list
all_results = report.results
# What was detected
print(f"Intent: {report.intent.category.value}") # e.g. "security"
print(f"Subcategory: {report.intent.subcategory}") # e.g. "general"
print(f"Entities: {report.intent.entities}") # e.g. {"year": "2024"}
print(f"Keywords: {report.intent.keywords}") # e.g. ["advanced", "persistent", "threats"]
print(f"Confidence: {report.intent.confidence:.0%}") # e.g. "80%"
# What strategy ran
print(f"Strategy: {report.strategy.name}") # e.g. "multi_angle"
print(f"Steps: {len(report.strategy.steps)}") # e.g. 2
# Performance metrics
print(f"Total found: {report.total_found}") # before dedup
print(f"Final results: {len(report.results)}") # after dedup+scoring
print(f"Time: {report.timing_seconds:.2f}s")
print(f"Engines used: {report.engines_used}")
# Suggested refinements
print(f"Suggestions: {report.suggestions}")
# Errors (if any)
print(f"Errors: {report.errors}")
```
**Working with individual SearchResult objects**
```python
for r in report.top(10):
print(f"Title: {r.title}")
print(f"URL: {r.url}")
print(f"Snippet: {r.snippet[:300]}")
print(f"Relevance: {r.relevance:.2f} / 10.0")
print(f"Engines: {r.engines}") # which SearXNG engines returned this
print(f"Score: {r.score}") # raw SearXNG score
print(f"Category: {r.category}") # SearXNG result category
print(f"Positions: {r.positions}") # rank positions across engines
print(f"Metadata: {r.metadata}") # publishedDate, thumbnail, etc.
print()
```
## AI Agent Integration
**Basic tool handler**
```python
from search_intelligence_skill import SearchSkill
skill = SearchSkill(searxng_url="http://localhost:8888")
def handle_search_tool(user_query: str) -> str:
"""Called by the AI agent when it needs to search the web."""
report = skill.search(user_query, depth="standard")
return report.to_context()
```
**With depth control from agent**
```python
def handle_search_tool(user_query: str, depth: str = "standard") -> str:
report = skill.search(user_query, depth=depth)
return report.to_context()
```
**Returning structured data to agent**
```python
def handle_search_tool(user_query: str) -> dict:
report = skill.search(user_query, depth="standard")
return {
"query": report.query,
"intent": f"{report.intent.category.value}/{report.intent.subcategory}",
"confidence": report.intent.confidence,
"result_count": len(report.results),
"results": [
{
"title": r.title,
"url": r.url,
"snippet": r.snippet[:500],
"relevance": round(r.relevance, 2),
"engines": r.engines,
}
for r in report.top(10)
],
"suggestions": report.suggestions,
"engines_used": report.engines_used,
"time_seconds": round(report.timing_seconds, 2),
}
```
**OpenAI function calling / tool definition**
```python
search_tool_schema = {
"type": "function",
"function": {
"name": "web_search",
"description": (
"Search the internet using advanced dork queries and multi-engine strategies. "
"Supports security scanning, OSINT, SEO analysis, academic research, "
"code search, file hunting, and general web search. "
"Describe what you want to find in natural language."
),
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Natural language search query describing what to find",
},
"depth": {
"type": "string",
"enum": ["quick", "standard", "deep", "exhaustive"],
"description": "Search thoroughness: quick (1-2 queries), standard (3-6), deep (6-12), exhaustive (12+)",
"default": "standard",
},
},
"required": ["query"],
},
},
}
```
**LangChain tool wrapper**
```python
from langchain.tools import Tool
from search_intelligence_skill import SearchSkill
skill = SearchSkill(searxng_url="http://localhost:8888")
search_tool = Tool(
name="web_search",
description=(
"Advanced web search with dork generation and multi-engine strategies. "
"Input a natural language query. Supports security, OSINT, SEO, academic, "
"code, file, and general searches."
),
func=lambda q: skill.search(q, depth="standard").to_context(),
)
```
**Context manager for clean resource handling**
```python
with SearchSkill(searxng_url="http://localhost:8888") as skill:
report = skill.search("find open redirects on example.com")
print(report.to_context())
# HTTP client is automatically closed
```
## Using Individual Components Directly
**IntentParser — analyze queries without searching**
```python
from search_intelligence_skill import IntentParser
parser = IntentParser()
intent = parser.parse("find exposed .env files on example.com")
print(f"Category: {intent.category.value}") # security
print(f"Subcategory: {intent.subcategory}") # exposed_files
print(f"Entities: {intent.entities}") # {"domain": "example.com"}
print(f"Keywords: {intent.keywords}") # ["exposed", "env", "files"]
print(f"Depth: {intent.depth.value}") # standard
print(f"Time range: {intent.time_range}") # ""
print(f"Confidence: {intent.confidence:.0%}") # 95%
print(f"Constraints: {intent.constraints}") # {}
```
**DorkGenerator — generate queries without searching**
```python
from search_intelligence_skill import DorkGenerator, IntentParser
parser = IntentParser()
gen = DorkGenerator()
intent = parser.parse("OSINT investigation on john@example.com")
dorks = gen.generate(intent)
for d in dorks:
print(f" [{', '.join(d.operators_used)}] {d.query}")
print(f" Purpose: {d.purpose}")
# Build a custom dork manually
custom = gen.generate_custom(
keyword="secret",
domain="example.com",
filetype="env",
intitle="config",
exclude=["test", "demo"],
exact_match=True,
)
print(f"Custom: {custom.query}")
# Translate a Google dork to Yandex syntax
yandex_dork = gen.translate(custom, target_engine="yandex")
print(f"Yandex: {yandex_dork.query}")
# Translate to Bing
bing_dork = gen.translate(custom, target_engine="bing")
print(f"Bing: {bing_dork.query}")
```
**ResultAnalyzer — score and analyze results**
```python
from search_intelligence_skill import ResultAnalyzer, IntentParser, SearXNGClient
client = SearXNGClient(base_url="http://localhost:8888")
parser = IntentParser()
analyzer = ResultAnalyzer()
intent = parser.parse("python web frameworks comparison")
raw = client.search("python web frameworks comparison", engines=["google", "bing"])
results = client.parse_results(raw)
# Full analysis pipeline: deduplicate → score → sort
analyzed = analyzer.analyze(results, intent)
for r in analyzed[:5]:
print(f"[{r.relevance:.2f}] {r.title}")
# Generate refinement suggestions
suggestions = analyzer.generate_refinements(analyzed, intent)
print(f"Suggestions: {suggestions}")
# Get a text summary
summary = analyzer.summarize(analyzed, intent)
print(summary)
client.close()
```
**SearXNGClient — direct API access**
```python
from search_intelligence_skill import SearXNGClient
client = SearXNGClient(base_url="http://localhost:8888")
# Single search
raw = client.search(
query='site:github.com "fastapi" filetype:py',
engines=["google", "bing", "duckduckgo"],
categories=["general"],
time_range="month",
language="en",
pageno=1,
safesearch=0,
)
# Parse results into SearchResult objects
results = client.parse_results(raw)
# Get SearXNG suggestions
suggestions = client.get_suggestions(raw)
# Get spelling corrections
corrections = client.get_corrections(raw)
# See which engines failed
unresponsive = client.get_unresponsive(raw)
# Batch search
responses = client.search_batch(
queries=["query 1", "query 2", "query 3"],
engines=["google"],
)
# Health check
if client.health_check():
print("SearXNG is online")
client.close()
```
## Quick Reference
**Search Depths**
```python
from search_intelligence_skill import Depth
Depth.QUICK # 1-2 queries, single step, fast lookups
Depth.STANDARD # 3-6 queries, multi-engine, general searching
Depth.DEEP # 6-12 queries, multi-step, thorough research
Depth.EXHAUSTIVE # 12+ queries, full sweep, complete investigations
```
**Intent Categories (auto-detected)**
```python
from search_intelligence_skill import IntentCategory
IntentCategory.GENERAL # General web search
IntentCategory.SECURITY # Vulnerabilities, exposed files, pentesting
IntentCategory.SEO # Indexation, backlinks, competitors, technical SEO
IntentCategory.OSINT # People, emails, usernames, domains, companies
IntentCategory.ACADEMIC # Papers, datasets, authors, journals
IntentCategory.CODE # Repositories, packages, docs, bugs
IntentCategory.FILES # Documents, data files, archives, media
IntentCategory.NEWS # Breaking news, analysis, trends
IntentCategory.IMAGES # Image search
IntentCategory.VIDEOS # Video search
IntentCategory.SOCIAL # Reddit, forums, discussions
IntentCategory.SHOPPING # Products, prices, comparisons
IntentCategory.LEGAL # Law, regulations, patents
IntentCategory.MEDICAL # Health, diseases, clinical research
```
**Search Strategies (auto-selected by depth + intent)**
```python
# Strategies are selected automatically, but you can also invoke them directly:
skill.execute_strategy("quick", target="example.com") # 1 step, top engines
skill.execute_strategy("broad_to_narrow", target="example.com") # Wide then focused
skill.execute_strategy("multi_angle", target="example.com") # Same topic, different formulations
skill.execute_strategy("deep_dive", target="example.com") # Exhaustive dork coverage
skill.execute_strategy("osint_chain", target="example.com") # Progressive recon
skill.execute_strategy("verify", target="some claim") # Cross-reference sources
skill.execute_strategy("file_hunt", target="example.com") # Targeted file search
skill.execute_strategy("temporal", target="AI news") # Across time periods
```
**Supported SearXNG Engines (90+)**
```python
# General: google, bing, duckduckgo, brave, qwant, startpage, mojeek,
# yandex, yahoo, presearch, wiby, stract, yep, baidu, naver ...
#
# IT/Dev: github, stackoverflow, gitlab, npm, pypi, dockerhub,
# arch_linux_wiki, crates_io, packagist, pkg_go_dev ...
#
# Science: arxiv, google_scholar, semantic_scholar, crossref, pubmed,
# base, openalex, core, wolfram_alpha ...
#
# News: google_news, bing_news, yahoo_news, brave_news, wikinews ...
#
# Social: reddit, lemmy, mastodon, hacker_news, lobsters ...
#
# Images: google_images, bing_images, flickr, unsplash, openverse ...
#
# Videos: youtube, google_videos, dailymotion, vimeo, piped, odysee ...
#
# Files: piratebay, 1337x, annas_archive, z_library ...
#
# Music: bandcamp, genius, soundcloud, youtube_music ...
#
# Maps: openstreetmap, photon ...
#
# Wikis: wikipedia, wikidata, wikimedia_commons ...
```
**Dork Operators (auto-translated across engines)**
```python
# Google operators:
# site: filetype: intitle: allintitle: inurl: allinurl:
# intext: allintext: inanchor: cache: related: info: define:
# before: after: AROUND(N) "exact" -exclude OR * N..M
#
# Bing operators:
# site: filetype: intitle: inurl: inbody: contains: ip:
# language: loc: prefer: feed: "exact" -exclude OR NEAR:N
#
# DuckDuckGo operators:
# site: filetype: intitle: inurl: "exact" -exclude OR
#
# Yandex operators:
# site: mime: title: inurl: host: domain: lang: date:
# "exact" -exclude |
#
# Brave operators:
# site: filetype: intitle: inurl: "exact" -exclude OR
#
# The skill auto-translates between engines:
# filetype: → mime: (Yandex)
# intitle: → title: (Yandex)
# intext: → inbody: (Bing)
```
## Dork Template Library
**Security dorks available (by subcategory)**
```
exposed_files — .env, .log, .sql, .bak, .conf, .pem, .key, .json
directory_listing — "index of", "directory listing", "parent directory"
admin_panels — /admin, /login, /dashboard, wp-admin, phpmyadmin, cpanel
sensitive_data — passwords, RSA keys, AWS keys, database URLs, SMTP creds
exposed_apis — /api/, swagger, api-docs, graphql, openapi
subdomains — site:*.domain, external references, inurl:domain
git_exposed — .git, .git/config, .svn, .hg
technology_stack — "powered by", wp-content, X-Powered-By
general — CVE, exploit, PoC, security advisory
```
**OSINT dorks available (by subcategory)**
```
person — LinkedIn, Twitter/X, Facebook, Instagram, GitHub, Medium, resume, CV
email — email mentions, cross-site, leaks, LinkedIn, GitHub
username — GitHub, Reddit, Twitter, Instagram, YouTube, Keybase, StackOverflow
domain — site:, subdomains, whois, Shodan, DNS, SSL, Censys, crt.sh
company — LinkedIn company, Crunchbase, Glassdoor, SEC filings, employees
phone — whitepages, truecaller, Facebook, name/address
ip — Shodan, abuse/blacklist, open ports, whois
```
**SEO dorks available (by subcategory)**
```
indexation — site:, sitemap, blog, tag/category pages
backlinks — external mentions, anchor text, link:
competitors — related:, same-keyword competitors
content_audit — intitle/inurl/intext keyword matching
technical_seo — sitemap XML, robots.txt, noindex, canonical, hreflang, schema
```
**Academic dorks available (by subcategory)**
```
papers — arxiv, ResearchGate, academia.edu, DOI, .edu PDFs
datasets — CSV, JSON, Kaggle, HuggingFace, Zenodo
authors — Google Scholar, ORCID, ResearchGate, publication lists
```
**Code dorks available (by subcategory)**
```
repositories — GitHub, GitLab, Bitbucket, Codeberg, Sourcehut
packages — npm, PyPI, crates.io, RubyGems, Packagist, pkg.go.dev
documentation — ReadTheDocs, README, API references
issues_bugs — GitHub issues, StackOverflow errors
```
## Advanced Usage
**Cross-engine dork translation**
```python
from search_intelligence_skill import DorkGenerator
gen = DorkGenerator()
# Build a Google dork
dork = gen.generate_custom(
keyword="secret",
domain="example.com",
filetype="env",
intitle="config",
)
print(f"Google: {dork.query}")
# → site:example.com filetype:env intitle:"config" secret
# Translate to Yandex (filetype → mime, intitle → title)
yandex = gen.translate(dork, "yandex")
print(f"Yandex: {yandex.query}")
# → site:example.com mime:env title:"config" secret
# Translate to Bing
bing = gen.translate(dork, "bing")
print(f"Bing: {bing.query}")
# Translate to DuckDuckGo (drops unsupported operators)
ddg = gen.translate(dork, "duckduckgo")
print(f"DDG: {ddg.query}")
# Translate to an engine without operator support (strips all operators)
plain = gen.translate(dork, "wikipedia")
print(f"Plain: {plain.query}")
```
**Result scoring details**
```python
# Each result is scored on 7 signals (0-10 scale):
#
# 1. SearXNG base score (normalized) — weight: 2.0
# 2. Keyword match in title + snippet — weight: 3.0
# 3. Multi-engine agreement (appeared in N) — weight: 0.5/engine, max 2.0
# 4. Position rank (lower = better) — weight: 1.5
# 5. Source credibility (.gov +1.5, .edu +1.4, arxiv +1.4, etc.)
# 6. Content quality (snippet length, HTTPS, URL sanity)
# 7. Intent-specific boost (arxiv for academic, github for code, etc.)
#
# Credibility penalties: spam (-0.7), "click here" (-0.5), "free download" (-0.4)
```
**Auto-refinement behavior**
```python
# When auto_refine=True (default) and results < 5:
# 1. Analyzer generates refined queries (broader, different keywords)
# 2. Skill executes up to 3 refinement queries
# 3. New results are merged with originals
# 4. Full dedup + re-scoring runs
# 5. Process repeats up to max_refine_rounds
skill = SearchSkill(
searxng_url="http://localhost:8888",
auto_refine=True,
max_refine_rounds=2, # Try refining up to 2 times
)
# Disable auto-refinement for speed-critical paths
skill_fast = SearchSkill(
searxng_url="http://localhost:8888",
auto_refine=False,
)
```
**Entity extraction capabilities**
```python
from search_intelligence_skill import IntentParser
parser = IntentParser()
# Domains
intent = parser.parse("scan example.com for vulnerabilities")
# entities: {"domain": "example.com"}
# Emails
intent = parser.parse("investigate user@company.com")
# entities: {"email": "user@company.com", "email_domain": "company.com"}
# IPs
intent = parser.parse("lookup 192.168.1.1")
# entities: {"ip": "192.168.1.1"}
# CVEs
intent = parser.parse("details on CVE-2024-3094")
# entities: {"cve": "CVE-2024-3094"}
# Phone numbers
intent = parser.parse("find owner of +1-555-123-4567")
# entities: {"phone": "+1-555-123-4567"}
# Usernames
intent = parser.parse("find accounts for @johndoe42")
# entities: {"username": "johndoe42"}
# Names (quoted)
intent = parser.parse('investigate "John Smith"')
# entities: {"name": "John Smith"}
# Names (capitalized pattern)
intent = parser.parse("find information about Jane Doe")
# entities: {"name": "Jane Doe"}
# File types
intent = parser.parse("find documents filetype:pdf")
# entities: {"filetype": "pdf"}
# Years
intent = parser.parse("research papers from 2024")
# entities: {"year": "2024"}
# Multiple entities combined
intent = parser.parse('CVE-2024-3094 on example.com "John Doe"')
# entities: {"cve": "CVE-2024-3094", "domain": "example.com", "name": "John Doe"}
```
**Time range detection**
```python
from search_intelligence_skill import IntentParser
parser = IntentParser()
parser.parse("news today").time_range # "day"
parser.parse("what happened this week").time_range # "week"
parser.parse("articles from last month").time_range # "month"
parser.parse("publications this year").time_range # "year"
parser.parse("latest updates on AI").time_range # "month" (heuristic)
parser.parse("history of computing").time_range # "" (no time constraint)
```
**Constraint extraction**
```python
from search_intelligence_skill import IntentParser
parser = IntentParser()
# Language constraints
intent = parser.parse("machine learning tutorials in spanish")
# constraints: {"language": "es"}
# Exhaustive hints
intent = parser.parse("find everything about this vulnerability")
# constraints: {"exhaustive": True}
# Result limits
intent = parser.parse("top 20 python frameworks")
# constraints: {"limit": 20}
# Exclusion hints
intent = parser.parse("web frameworks except Django without Flask")
# constraints: {"exclude": ["django", "flask"]}
```
**Pagination**
```python
from search_intelligence_skill import SearXNGClient
client = SearXNGClient(base_url="http://localhost:8888")
# Fetch multiple pages
all_results = []
for page in range(1, 4):
raw = client.search("python frameworks", pageno=page)
results = client.parse_results(raw)
all_results.extend(results)
if not results:
break
print(f"Total across 3 pages: {len(all_results)}")
client.close()
```
**Rate limiting and retries**
```python
# Built-in rate limiting between requests
skill = SearchSkill(
searxng_url="http://localhost:8888",
rate_limit=1.0, # 1 second minimum between requests
max_retries=3, # Retry failed requests up to 3 times
timeout=30.0, # 30 second timeout per request
)
# Rate limiting is automatic — no manual sleep() needed
# Retries use increasing delays on 429 (Too Many Requests)
```
**Logging for debugging**
```python
import logging
# See everything the skill does
logging.basicConfig(level=logging.DEBUG)
# Or just info-level
logging.basicConfig(level=logging.INFO)
skill = SearchSkill(searxng_url="http://localhost:8888")
report = skill.search("test query", depth="standard")
# Logs will show:
# INFO — Intent: security/exposed_files (confidence=0.95) — entities: {"domain": "..."}
# INFO — Strategy: multi_angle — 2 steps
# DEBUG — Executing step 1: Search angle 1
# DEBUG — Search 'site:... filetype:env' returned 12 results
# DEBUG — Executing step 2: Search angle 2
# INFO — Search complete: 23 results, 4.21s, 4 engines
```
## API Methods
| Method | Purpose | Returns |
|---|---|---|
| `skill.search(query, depth, ...)` | Full intelligent search pipeline | `SearchReport` |
| `skill.search_dork(dork, ...)` | Execute raw dork query directly | `SearchReport` |
| `skill.suggest_queries(query)` | Preview dorks without executing | `list[DorkQuery]` |
| `skill.build_dork(keyword, ...)` | Build custom dork from parameters | `DorkQuery` |
| `skill.execute_strategy(name, target)` | Run named strategy against target | `SearchReport` |
| `skill.search_batch(queries, ...)` | Execute multiple searches | `list[SearchReport]` |
| `skill.health_check()` | Check SearXNG connectivity | `bool` |
| `skill.close()` | Close HTTP client | `None` |
## SearchReport Properties
| Property | Type | Description |
|---|---|---|
| `.query` | `str` | Original natural language query |
| `.intent` | `SearchIntent` | Parsed intent with category, entities, keywords |
| `.strategy` | `SearchStrategy` | Strategy that was used (name, steps) |
| `.results` | `list[SearchResult]` | Scored and deduplicated results |
| `.total_found` | `int` | Total results before deduplication |
| `.suggestions` | `list[str]` | Refinement suggestions |
| `.refined_queries` | `list[str]` | Auto-refinement queries used |
| `.errors` | `list[str]` | Errors encountered during search |
| `.timing_seconds` | `float` | Total wall-clock time |
| `.engines_used` | `list[str]` | Engines that returned results |
| `.to_context(max_results)` | `str` | LLM-formatted text output |
| `.top(n)` | `list[SearchResult]` | Top N by relevance score |
## SearchResult Properties
| Property | Type | Description |
|---|---|---|
| `.title` | `str` | Result title |
| `.url` | `str` | Result URL |
| `.snippet` | `str` | Content snippet / description |
| `.engines` | `list[str]` | Which SearXNG engines returned it |
| `.score` | `float` | Raw SearXNG score |
| `.relevance` | `float` | Computed multi-signal relevance (0-10) |
| `.category` | `str` | SearXNG result category |
| `.positions` | `list[int]` | Rank positions across engines |
| `.metadata` | `dict` | Extra fields: publishedDate, thumbnail, img_src |
## Troubleshooting
**SearXNG not reachable**
```bash
# Check the instance is running
curl http://localhost:8888/healthz
# Check JSON API is enabled
curl "http://localhost:8888/search?q=test&format=json"
# Common fixes:
# 1. Ensure port mapping is correct (docker: -p 8888:8080)
# 2. Ensure search.formats includes "json" in settings.yml
# 3. Check firewall rules
```
```python
if not skill.health_check():
print("SearXNG unreachable — check URL, port, and settings")
```
**No results returned**
```python
report = skill.search("very specific obscure query")
if not report.results:
print("No results. Try:")
print(" 1. Broader keywords")
print(" 2. Different depth: depth='deep'")
print(" 3. Check suggestions:", report.suggestions)
print(" 4. Check errors:", report.errors)
print(" 5. Try different engines:", report.engines_used)
# Manual broader search
report2 = skill.search("broader version of query", depth="deep")
```
**Timeout errors**
```python
# Increase timeout for complex queries
skill = SearchSkill(
searxng_url="http://localhost:8888",
timeout=60.0, # 60 seconds
max_retries=3, # More retries
)
```
**Rate limiting (429 errors)**
```python
# Increase delay between requests
skill = SearchSkill(
searxng_url="http://localhost:8888",
rate_limit=2.0, # 2 seconds between requests
)
```
**SSL errors (local development only)**
```python
skill = SearchSkill(
searxng_url="https://localhost:8888",
verify_ssl=False, # ONLY for local dev — never in production
)
```
**Wrong intent detected**
```python
# If the auto-detection picks the wrong category, use direct dork:
report = skill.search_dork(
'site:example.com filetype:pdf "annual report"',
engines=["google", "bing"],
)
# Or force engines/categories:
report = skill.search(
"some ambiguous query",
engines=["google_scholar", "arxiv"],
categories=["science"],
)
```
**Memory usage with large result sets**
```python
# Limit results to control memory
report = skill.search("broad query", depth="exhaustive", max_results=50)
# Process results in a streaming fashion
for r in report.results:
process(r) # handle one at a time
```
## How It All Works Together
```
User Query
│
▼
┌─────────────────┐
│ IntentParser │──→ category, subcategory, entities, keywords
└────────┬────────┘
│
▼
┌─────────────────┐
│ DorkGenerator │──→ 5-20 optimized dork queries with operators
└────────┬────────┘
│
▼
┌─────────────────┐
│ StrategyPlanner │──→ multi-step plan (which dorks, which engines, what order)
└────────┬────────┘
│
▼
┌─────────────────┐
│ SearXNGClient │──→ executes queries against your instance (retries, rate limit)
└────────┬────────┘
│
▼
┌─────────────────┐
│ ResultAnalyzer │──→ dedup, score, rank, credibility check
└────────┬────────┘
│
(if results poor)
│
▼
┌─────────────────┐
│ Auto-Refine │──→ generate new queries, re-search, re-analyze
└────────┬────────┘
│
▼
SearchReport
.to_context() → LLM-ready text
.top(n) → best results
.results → full list
```
## Notes
**Privacy**
- All searches route through YOUR SearXNG instance
- Zero API keys required for any engine
- No data sent to third-party services (except through SearXNG's engine requests)
- SearXNG strips tracking parameters and anonymizes requests
**Performance tips**
- Reuse the `SearchSkill` instance across searches (connection pooling)
- Use `depth="quick"` for simple lookups, reserve `"deep"` / `"exhaustive"` for research
- Set `auto_refine=False` for speed-critical paths
- Use `skill.suggest_queries()` to preview before executing expensive searches
- Batch independent queries with `skill.search_batch()`
**Accuracy tips**
- Include specific entities in your query (domains, emails, CVEs, names)
- Use quoted phrases for exact matching: `'find "exact phrase"'`
- Specify time ranges when freshness matters: `"latest news this week"`
- Use `depth="deep"` or `"exhaustive"` for comprehensive coverage
- Check `report.suggestions` for refinement ideas
- Check `report.intent` to verify the skill understood your query correctly
**Extending the skill**
- Add new dork templates in `config.py` → `DORK_TEMPLATES`
- Add new intent signals in `config.py` → `INTENT_SIGNALS`
- Add new engines in `config.py` → `ENGINE_CATEGORIES`
- Add new operator translations in `config.py` → `OPERATOR_SUPPORT`
- Add new strategies in `config.py` → `STRATEGY_DEFINITIONS`
- Add new subcategory detection in `intent.py` → `SUBCATEGORY_PATTERNS`
**Confirm before sensitive operations**
- Security scanning dorks may trigger alerts on target domains
- OSINT queries may involve personal information — use responsibly
- Always validate that the target domain/entity is authorized for testing
- This tool is for legitimate research, authorized security testing, and SEO analysis
标签
skill
ai