SkywalkerDarren/chatWeb
A system that crawls web pages and documents, extracts text, stores embeddings in a vector database, and uses GPT-3.5 to answer questions based on retrieved content.

ChatWeb crawls any webpage or extracts text from PDF, DOCX, and TXT files. It generates embedded summaries using GPT-3.5’s embedding API, stores vector-text mappings in a vector database (FAISS or pgvector), and retrieves the most similar text chunks via nearest neighbor search to generate answers. It improves accuracy by generating vectors from keywords rather than raw questions, effectively breaking through token limits by extracting relevant content from large texts.