Anton Smirnov

Computational Linguist/NLP Engineer

About Me

Master’s student at Freie Universität Berlin (Interdisciplinary Studies of the Middle East), focusing on computational linguistics and digital text analysis. My work involves processing historical Arabic, Persian, and Chinese corpora using OCR, NLP, and vector-based methods. I build tools for semantic retrieval, layout processing, and long-range comparison of textual structures. Interested in open-source, reproducible pipelines, and applications of machine learning to classical texts.

Portfolio

AskSunna — Semantic Engine for Sunni Hadith Corpora

Research Project — Berlin, Germany | June 2025 – Present

Try it out

Search and analysis system over 50,000+ hadiths from major Sunni collections. Enables semantic retrieval, metadata filtering, and cross-collection comparison.

• Sources: al-Bukhari, Muslim, Abu Dawood, Tirmidhi, Nasai, Ibn Majah, Muwatta, Darimi, Ahmad, and others
• Data: Parquet corpus with full metadata (book, chapter, uid_chunked, source)
• Indexing: One FAISS vector index per collection using OpenAI text-embedding-3-large (Arabic)
• Retrieval: Fast semantic search across canonical hadith texts with structured output
• Focus: Tooling for vector-based exploration of Sunni hadith tradition

al-Bukhari RAG — Deployed Semantic QA Web App

Personal Research Project — Berlin, Germany | May 2025 – Present

Try it out

Fully deployed Retrieval-Augmented Generation (RAG) app for semantic search over the Sahih al-Bukhari corpus.

• LLM: DeepSeek (via OpenRouter, free-tier API)
• UI: Streamlit interface with real-time query execution
• Backend: FAISS vector index with OpenAI embeddings (ada-002), LangChain integration
• Deployment: Render (Free Web Service Plan) with persistent vector storage
• Infrastructure: Lightweight, cost-free deployment without runtime OpenAI usage

HadithView — Metadata-Based Corpus Explorer

Independent Project — Berlin, Germany | May 2025 – Present

Tool for manual browsing of hadith corpora by metadata fields. Used for inspecting QA system outputs and supporting error analysis.

• Data: Parquet + pandas
• UI: Streamlit with filter/search capabilities
• Purpose: Assist in debugging, dataset inspection, and manual verification

HadithRAG — Scalable QA for Multi-Volume Hadith Collections

Independent Project — Berlin, Germany | May 2025 – Present

Extension of the Bukhari pipeline to larger multi-volume hadith collections (e.g., Muslim, Tirmidhi). Focus on efficient indexing and multilingual support.

• Data: Parquet format with structured metadata
• Retrieval: ChromaDB + OpenAI embeddings
• UI: Streamlit with language filters and search
• Focus: Performance, modularity, and extensibility for future corpora

JLProj — Johnson–Lindenstrauss Projection Toolkit

Python Package — Berlin, Germany | May 2025 – Present

PyPI Package

Lightweight Python library for fast Johnson–Lindenstrauss (JL) dimensionality reduction. Designed for efficient projection of high-dimensional text embeddings with minimal distortion.

• CLI & Python API: Apply JL projections to vectors or corpora
• Integration: Compatible with FAISS for scalable similarity search
• Use Case: Clustering, semantic analysis, fast search in high-dimensional spaces
• Deployment: Available via PyPI, includes CLI and API docs

OCRmyPDF-RAG Assistant

Independent Project — Berlin, Germany | May 2025 – Present

Pipeline for question answering over scanned academic PDFs. Includes OCR, layout extraction, semantic search, and source-linked output.

• OCR & Layout: OCRmyPDF, Tesseract, PyMuPDF
• Retrieval: ChromaDB, OpenAI embeddings, LangChain RetrievalQA
• Task: Connect user queries to relevant content in scanned texts
• Output: Page-level traceability and structured answers

Experience

ASR & NLP Pipeline Assistant

Subcontract for FosAgro (via third party)

2023

Built a simple tool to transcribe and clean audio interviews for an internal project. Used a Hugging Face wav2vec2 model for Russian speech recognition, followed by a language model to improve readability: punctuation, minor corrections, and layout structuring. The tool was used to process a specific batch of interview recordings for internal documentation purposes.

Education

Freie Universität Berlin

MA Interdisciplinary Studies of the Middle East

2023 – present

Focus on computational methods for historical Arabic and Persian texts. Coursework and research include OCR preprocessing, token normalization, embedding-based analysis of semantic change, and vector comparison across time periods. Participated in the FU–Hebrew University Summer School on Digital Humanities.

Higher School of Economics, Moscow

BA Asian and African Studies

2017 – 2022

Studied Arabic, Persian, and Chinese linguistics with an emphasis on language change and policy. Bachelor’s thesis involved frequency analysis of language reform terms in Persian using historical corpora and basic NLP techniques.

Soft Skills

Ownership & Initiative, Analytical Clarity, Resilience in Complexity, Direct Communication, Learning Autonomy, Independent Problem Solving, Focused Execution

Skills & Languages

Languages & Tools:
Python, JavaScript, SQL, HTML/CSS, Git, Docker

NLP & Semantic Retrieval:
RAG pipelines (LangChain RetrievalQA), OpenAI embeddings, FAISS, ChromaDB, tokenization, lemmatization, semantic filtering, multilingual input handling

OCR & Text Structuring:
OCRmyPDF, Tesseract, PyMuPDF for layout-aware extraction; structured output generation for scanned documents

Speech & Postprocessing:
Hugging Face wav2vec2 (Russian), LLM-based text cleanup, punctuation restoration, layout formatting

Data Handling & Analysis:
pandas, NumPy, Parquet, SQLite; embedding similarity, UMAP, PCA, frequency statistics

Languages:
Russian (native), English (C2), Chinese (fluent),
Arabic (fluent reading), German (B2), Persian (academic reading)

🔒 KEYWORDS

Languages & Tools: Python, JavaScript, SQL, HTML, CSS, Git, Docker

Libraries & Frameworks: pandas, NumPy, scikit-learn, matplotlib, Hugging Face, Streamlit, LangChain, PyMuPDF

NLP & Semantic Retrieval: Retrieval-Augmented Generation (RAG), LangChain RetrievalQA, OpenAI embeddings, FAISS, ChromaDB, UMAP, PCA, lemmatization, tokenization, frequency analysis

OCR & Layout: OCRmyPDF, Tesseract, PyMuPDF layout extraction, text structuring from scanned sources

Dimensionality Reduction & Indexing: JLProj (JL Projection), UMAP, PCA, FAISS

ASR & Postprocessing: Hugging Face wav2vec2, LLM-based transcript cleanup, punctuation and layout formatting

Corpus Work: Arabic and Persian corpora, historical text normalization, semantic shift tracking