DeepResearch Part 3: Getting the best web data for your research
Summary
This post details building a robust web data pipeline using SmolAgents. We’ll create tools to retrieve content from various web endpoints, convert it to a consistent format (Markdown), store it efficiently, and then evaluate its relevance and quality using Large Language Models (LLMs). This pipeline is crucial for building a knowledge base for LLM applications.
Web Data Convertor (MarkdownConverter
)
We leverage the MarkdownConverter
class, inspired by the one in autogen
, to handle the diverse formats encountered on the web. This ensures consistency for downstream processing.