nlp

DeepResearch Part 3: Getting the best web data for your research

Summary This post details building a robust web data pipeline using SmolAgents. We’ll create tools to retrieve content from various web endpoints, convert it to a consistent format (Markdown), store it efficiently, and then evaluate its relevance and quality using Large Language Models (LLMs). This pipeline is crucial for building a knowledge base for LLM applications. Web Data Convertor (MarkdownConverter) We leverage the MarkdownConverter class, inspired by the one in autogen, to handle the diverse formats encountered on the web.