About Me
Master’s student at Freie Universität Berlin (Interdisciplinary Studies of the Middle East), focusing on computational linguistics and digital text analysis. My work involves processing historical Arabic, Persian, and Chinese corpora using OCR, NLP, and vector-based methods. I build tools for semantic retrieval, layout processing, and long-range comparison of textual structures. Interested in open-source, reproducible pipelines, and applications of machine learning to classical texts.
Portfolio
AskSunna — Semantic Engine for Sunni Hadith Corpora
Research Project — Berlin, Germany | June 2025 – Present
Try it outSearch and analysis system over 50,000+ hadiths from major Sunni collections. Enables semantic retrieval, metadata filtering, and cross-collection comparison.
• Sources: al-Bukhari, Muslim, Abu Dawood, Tirmidhi, Nasai, Ibn Majah, Muwatta, Darimi, Ahmad, and others
• Data: Parquet corpus with full metadata (book, chapter, uid_chunked, source)
• Indexing: One FAISS vector index per collection using OpenAI text-embedding-3-large (Arabic)
• Retrieval: Fast semantic search across canonical hadith texts with structured output
• Focus: Tooling for vector-based exploration of Sunni hadith tradition
al-Bukhari RAG — Deployed Semantic QA Web App
Personal Research Project — Berlin, Germany | May 2025 – Present
Try it outFully deployed Retrieval-Augmented Generation (RAG) app for semantic search over the Sahih al-Bukhari corpus.
• LLM: DeepSeek (via OpenRouter, free-tier API)
• UI: Streamlit interface with real-time query execution
• Backend: FAISS vector index with OpenAI embeddings (ada-002), LangChain integration
• Deployment: Render (Free Web Service Plan) with persistent vector storage
• Infrastructure: Lightweight, cost-free deployment without runtime OpenAI usage
HadithView — Metadata-Based Corpus Explorer
Independent Project — Berlin, Germany | May 2025 – Present
Tool for manual browsing of hadith corpora by metadata fields. Used for inspecting QA system outputs and supporting error analysis.
• Data: Parquet + pandas
• UI: Streamlit with filter/search capabilities
• Purpose: Assist in debugging, dataset inspection, and manual verification
HadithRAG — Scalable QA for Multi-Volume Hadith Collections
Independent Project — Berlin, Germany | May 2025 – Present
Extension of the Bukhari pipeline to larger multi-volume hadith collections (e.g., Muslim, Tirmidhi). Focus on efficient indexing and multilingual support.
• Data: Parquet format with structured metadata
• Retrieval: ChromaDB + OpenAI embeddings
• UI: Streamlit with language filters and search
• Focus: Performance, modularity, and extensibility for future corpora
JLProj — Johnson–Lindenstrauss Projection Toolkit
Python Package — Berlin, Germany | May 2025 – Present
PyPI PackageLightweight Python library for fast Johnson–Lindenstrauss (JL) dimensionality reduction. Designed for efficient projection of high-dimensional text embeddings with minimal distortion.
• CLI & Python API: Apply JL projections to vectors or corpora
• Integration: Compatible with FAISS for scalable similarity search
• Use Case: Clustering, semantic analysis, fast search in high-dimensional spaces
• Deployment: Available via PyPI, includes CLI and API docs
Pipeline for question answering over scanned academic PDFs. Includes OCR, layout extraction, semantic search, and source-linked output.
• OCR & Layout: OCRmyPDF, Tesseract, PyMuPDF
• Retrieval: ChromaDB, OpenAI embeddings, LangChain RetrievalQA
• Task: Connect user queries to relevant content in scanned texts
• Output: Page-level traceability and structured answers
Experience
ASR & NLP Pipeline Assistant
Subcontract for FosAgro (via third party)
2023
Built a simple tool to transcribe and clean audio interviews for an internal project. Used a Hugging Face wav2vec2 model for Russian speech recognition, followed by a language model to improve readability: punctuation, minor corrections, and layout structuring. The tool was used to process a specific batch of interview recordings for internal documentation purposes.
Education
Freie Universität Berlin
MA Interdisciplinary Studies of the Middle East
2023 – present
Focus on computational methods for historical Arabic and Persian texts. Coursework and research include OCR preprocessing, token normalization, embedding-based analysis of semantic change, and vector comparison across time periods. Participated in the FU–Hebrew University Summer School on Digital Humanities.
Higher School of Economics, Moscow
BA Asian and African Studies
2017 – 2022
Studied Arabic, Persian, and Chinese linguistics with an emphasis on language change and policy. Bachelor’s thesis involved frequency analysis of language reform terms in Persian using historical corpora and basic NLP techniques.
Soft Skills
Ownership & Initiative, Analytical Clarity, Resilience in Complexity, Direct Communication, Learning Autonomy, Independent Problem Solving, Focused Execution
Skills & Languages
Languages & Tools:
Python, JavaScript, SQL, HTML/CSS, Git, Docker
NLP & Semantic Retrieval:
RAG pipelines (LangChain RetrievalQA), OpenAI embeddings, FAISS, ChromaDB, tokenization, lemmatization, semantic filtering, multilingual input handling
OCR & Text Structuring:
OCRmyPDF, Tesseract, PyMuPDF for layout-aware extraction; structured output generation for scanned documents
Speech & Postprocessing:
Hugging Face wav2vec2 (Russian), LLM-based text cleanup, punctuation restoration, layout formatting
Data Handling & Analysis:
pandas, NumPy, Parquet, SQLite; embedding similarity, UMAP, PCA, frequency statistics
Languages:
Russian (native), English (C2), Chinese (fluent),
Arabic (fluent reading), German (B2), Persian (academic reading)
🔒 KEYWORDS
Languages & Tools: Python, JavaScript, SQL, HTML, CSS, Git, Docker
Libraries & Frameworks: pandas, NumPy, scikit-learn, matplotlib, Hugging Face, Streamlit, LangChain, PyMuPDF
NLP & Semantic Retrieval: Retrieval-Augmented Generation (RAG), LangChain RetrievalQA, OpenAI embeddings, FAISS, ChromaDB, UMAP, PCA, lemmatization, tokenization, frequency analysis
OCR & Layout: OCRmyPDF, Tesseract, PyMuPDF layout extraction, text structuring from scanned sources
Dimensionality Reduction & Indexing: JLProj (JL Projection), UMAP, PCA, FAISS
ASR & Postprocessing: Hugging Face wav2vec2, LLM-based transcript cleanup, punctuation and layout formatting
Corpus Work: Arabic and Persian corpora, historical text normalization, semantic shift tracking