From local AI infrastructure to everyday apps — everything we create runs on your terms, not ours. Privacy-first software for real problems.
Tools for developers, professionals, teachers, and curious kids. One shared value: your data stays yours.
RAG, Data Generation, and Fine-Tuning — all running on your machine. No cloud dependency. No data leaving your hardware. 10 LLM providers, 17 document formats, 6 vector stores.
Explore →Track items across shared spaces. AES-256-GCM encrypted, offline-first, stored in your own cloud.
Explore →Safe, fun, COPPA-compliant AI adventures for ages 5–15. 6 worlds, parent controls, teacher tools, zero ads.
Explore →50 levels, 9 knowledge layers, 500+ curated questions — the professional quiz platform for AI/ML engineers. Built with FSRS spaced repetition, adaptive learning, and a 2-LLM architecture.
Explore →Drag-and-drop viewer for 17+ file formats — TSX, Markdown, JSON, PDF, images and more. Cross-window tab dragging, 100% offline. Free download for Windows, macOS, and Linux.
We don't build to grow. We build because we have a real problem and a better answer. Every product ships when it's right — not when it's convenient.
See All Products →We don't collect your data. Our tools run locally, storing what you choose, where you choose — your cloud, your device, your call.
Every product we ship solves a real problem we faced ourselves. No vanity projects. No feature bloat. Just the right tool for the job.
From AI infrastructure to kid-friendly apps — the same quality and values across everything we make. Different audiences, same standard.
We tell you exactly what our tools do, what they collect, and how they work. No dark patterns, no hidden terms.
Three powerful tools — RAG, DataGen, and Fine-Tuning — bundled into one local desktop app. No cloud subscriptions. No data exposure. Full control.
# About MokingBird ## We Build Technology That Works for You MokingBird — registered as **MokingBird Oy** (Business ID: 3615646-1) — is a Finnish technology company with a simple and stubborn belief: software should make your life easier, not harder. We call ourselves **The Everything Lab** because we build across categories — AI tools, productivity apps, educational software — always guided by the same principles: privacy first, genuine usefulness, and no compromises. --- ## Who We Are We are a small, focused team based in Finland. We started with a frustration that many developers and power users share: the tools that exist are either too complex, too invasive, or too expensive. Cloud AI products that require accounts, subscriptions, and send your data to servers you don't control. Mobile apps stuffed with ads and tracking. Productivity software that locks you in. MokingBird exists to build the alternative. We registered MokingBird Oy in Finland under Business ID **3615646-1**, operating under EU law and GDPR — not because compliance is good marketing, but because we genuinely believe that privacy is a human right that should be baked into software from the start. --- ## Our Philosophy > **Technology should work for people, not against them.** **Privacy is not a feature. It's the foundation.** Everything we build starts with the question: does this respect the person using it? That means: - Our AI tools run 100% on your machine. Your documents, your models, your queries — none of it reaches our servers, because we don't have servers watching your activity. - Our mobile apps contain no advertising. None. We make money when you choose to upgrade, not by selling your attention or your data. - We follow EU GDPR not as a legal checkbox but as a design principle. Collect the minimum. Store nothing you don't need. Be transparent. --- ## What We Build We build across four areas: - **Privacy-first AI tooling** — desktop AI infrastructure that runs on your hardware - **Utility mobile applications** — practical tools for everyday organization and tracking - **Education-focused mobile applications** — learning apps that don't compromise on safety or privacy - **Offline-first desktop tools** — professional tools that work without cloud dependency ### AI Tools (Desktop) **MokingBird AI** is our flagship desktop ecosystem — a suite of three powerful local AI tools bundled inside the MokingBird Node: - **mbRAG** — A production-grade Retrieval-Augmented Generation framework. Feed it your documents, connect it to any LLM (local or cloud), and get accurate, context-aware answers. 17 document formats, 10 LLM providers, 6-level contextual retrieval. Everything runs locally. - **mbDataGen** — A synthetic dataset generation platform for AI/ML training. Generates high-quality, validated training data using our GPRO-Hybrid RL approach. Built for researchers and ML engineers who need clean, domain-specific datasets without the noise. - **mbFT** — A universal fine-tuning platform supporting 16 techniques across 6 SFT methods, 5 RL methods, and 5 multimodal approaches. Seven frameworks, VRAM pre-simulation so you know before you run, and our original Hybrid GRPO method. **MB Viewer** is our offline-first desktop file viewer — open TSX, JSX, HTML, Markdown, JSON, CSS, SVG, images, and more. No npm. No build tools. Drop a file, see it rendered. Used by 800+ developers and designers. ### Apps (Mobile) **Sortify** is a privacy-first inventory tracking app for shared spaces — homes, labs, offices, teams. Items, locations, and history stay on your device. User-owned cloud sync (Google Drive, OneDrive, Dropbox, iCloud). AES-256-GCM encryption. No central database. **Jogg** is an AI-powered learning app that turns complex topics into engaging quizzes. 9 layers of curriculum from foundational to advanced, XP system, leaderboard, 6 quiz modes. Learn AI — by doing. **Jogg Mini** is Jogg's younger sibling — AI education for curious kids aged 5 to 15. Six themed worlds (Robot Valley, Data Valley, Pattern Mountain, Smart City, Future Lab, AI Ethics Star), parent dashboard, COPPA compliant, zero ads, zero tracking. --- ## Our Principles ### Privacy First Your data is yours. We build tools that keep it that way — on your device, under your control. ### Human-Centered Design We build for real people solving real problems, not to impress investors or chase trends. ### Built to Last We ship software that works. No endless beta. No rug-pulls. No "we're shutting down in 30 days." ### Accessible to All Every product has a free tier. Advanced features may require a subscription, but the core is always free to download and use. ### Clarity and Control Users should clearly understand what a product does, what it stores, and how to upgrade or leave. No dark patterns. No hidden behavior. ### Reliable Fundamentals Stability and long-term maintainability over trend-driven complexity. We prioritize software that keeps working. --- ## Our Commitment We are committed to building software that remains useful long after launch: - Continuous improvement of existing products - Honest release communication — we say what changed and why - Better documentation and onboarding quality over time - Long-term privacy-conscious architecture decisions - No rug-pulls, no sudden service shutdowns --- ## Company Information | | | |---|---| | **Company Name** | MokingBird Oy | | **Business ID** | 3615646-1 | | **Country** | Finland | | **Email** | [email protected] | | **Website** | mokingbird.xyz | --- ## Connect With Us We'll be getting active on social media and genuinely enjoy hearing from users — bug reports, feature ideas, or just to say hello. - **LinkedIn**: [linkedin.com/company/mokingbird](https://linkedin.com/company/mokingbird) - **X (Twitter)**: [x.com/mokingbirdxyz](https://x.com/mokingbirdxyz) - **Instagram**: [instagram.com/mokingbird](https://instagram.com/mokingbird) - **General**: [[email protected]](mailto:[email protected]) - **Support**: [[email protected]](mailto:[email protected])
# Connect With Us We'll be getting active on social media and genuinely enjoy hearing from users — bug reports, feature ideas, or just to say hello. ## Social Media - **LinkedIn**: [linkedin.com/company/mokingbird](https://linkedin.com/company/mokingbird) - **X (Twitter)**: [x.com/mokingbirdxyz](https://x.com/mokingbirdxyz) - **Instagram**: [instagram.com/mokingbird](https://instagram.com/mokingbird) ## Email - **General enquiries**: [[email protected]](mailto:[email protected]) - **Product support**: [[email protected]](mailto:[email protected]) --- ## Company | | | |---|---| | **Company Name** | MokingBird Oy | | **Business ID** | 3615646-1 | | **Country** | Finland | | **Website** | mokingbird.xyz |
# Privacy Policy
**MokingBird Oy** | Business ID: 3615646-1 | Finland
_Last updated: April 2026_ | _Effective date: April 12, 2026_
---
## Introduction
MokingBird Oy ("MokingBird," "we," "our," or "us") is committed to protecting your privacy. This Privacy Policy explains how we collect, use, disclose, and safeguard information across our website (mokingbird.xyz) and our products: MokingBird AI (including mbRAG, mbDataGen, and mbFT), MB Viewer, Sortify, Jogg, and Jogg Mini.
We operate under Finnish law and comply fully with the **European Union General Data Protection Regulation (GDPR)**.
Our core commitment is simple: **we do not sell your data. We do not run ads. We build products designed to keep your information on your device.**
If you have any questions about this policy, contact us at [[email protected]](mailto:[email protected]).
---
## Scope
This policy applies to:
- `mokingbird.xyz` (main website) and all contact/communication flows initiated through it
- General company-level privacy commitments applicable across all MokingBird products
Product-specific privacy behavior is documented in separate product policies (MokingBird AI, MB Viewer, Sortify, Jogg, Jogg Mini). Where a product-specific policy exists, it controls for that product. This policy sets the company baseline.
---
## 1. What Data We Collect
### 1.1 Our Website (mokingbird.xyz and subdomains)
When you visit our websites, we may collect:
- **Server logs** — IP address, browser type, page visited, timestamp. This is standard web server behavior. Logs are retained for up to 30 days for security purposes and then deleted.
- **Contact form submissions** — If you fill out a contact form, we collect your name, email address, and message. We use this solely to respond to you.
- **Newsletter sign-ups** — If you subscribe to our newsletter, we collect your email address. You can unsubscribe at any time.
We aim to use **privacy-friendly, cookieless analytics** (such as Plausible or Cloudflare Web Analytics) to understand traffic patterns without tracking individuals. If our analytics setup changes, we will update this policy.
### 1.2 Desktop Applications (MokingBird AI, MB Viewer)
Our desktop applications collect **no data whatsoever during normal operation**.
- No telemetry
- No crash reports sent to us
- No usage analytics
- No registration required
- All files you open, process, or generate remain on your machine
The only exception is **optional update checks** — if you choose to check for updates, the app makes a network request to our release server to compare version numbers. No personal data is included in this request.
### 1.3 Mobile Applications (Sortify, Jogg, Jogg Mini)
Our mobile apps collect the minimum data required to provide the service:
- **Sortify** — No account required for local use. If you choose to use cloud sync, your sync data is stored in your own cloud storage (Google Drive, OneDrive, Dropbox, or iCloud) — not on MokingBird servers.
- **Jogg** — May collect anonymized progress data for the leaderboard and XP system. No behavioral tracking or profiling.
- **Jogg Mini** — Designed for children aged 5–15. Complies with COPPA and GDPR. See Section 7 for full details on children's data.
---
## 2. How We Use Your Data
We use the data we collect only for the following purposes:
- To respond to inquiries and support requests
- To send you newsletters you have opted into
- To improve our products based on aggregate (non-personal) usage patterns
- To detect and prevent security incidents on our websites
- To comply with legal obligations
We **do not** use your data for:
- Targeted advertising
- Selling or sharing with third parties for commercial purposes
- Building behavioral profiles
- Training AI models on your data or content
---
## 3. Data We Never Collect or Do
To be explicit:
- **We do not sell your data.** Ever. To anyone.
- **We do not run ads** in any of our products.
- **We do not track your behavior** across websites or apps.
- **We do not access the documents or files** you open in our desktop apps.
- **We do not use your data to train AI models.**
- **We do not require an account** to use our core products.
---
## 4. Third-Party Services
We use a small number of third-party services to operate:
| Service | Purpose | Data shared |
|---------|---------|-------------|
| Cloudflare | Website hosting | Server logs (standard) |
| GitHub | Software distribution (download releases) | Download events (anonymized) |
| Cloud storage providers (Google Drive, OneDrive, etc.) | User-initiated sync in Sortify only | Your sync data — governed by the provider's own privacy policy |
When you use cloud LLM providers (OpenAI, Anthropic, etc.) with MokingBird AI, those queries are governed by those providers' privacy policies. MokingBird has no access to your API keys or queries.
---
## 5. Legal Bases for Processing (GDPR)
Where GDPR applies, we process personal data on the following lawful bases:
- **Consent** — newsletter sign-ups, optional analytics
- **Contract or pre-contract communication** — responding to support or business inquiries
- **Legitimate interests** — security monitoring, abuse prevention, service reliability
- **Legal obligation** — where applicable law requires retention or disclosure
---
## 6. Your Rights Under GDPR
As a resident of the EU (or anyone using our services), you have the following rights:
- **Right of access** — Request a copy of any personal data we hold about you.
- **Right to erasure** — Request that we delete your personal data.
- **Right to rectification** — Request that we correct inaccurate data.
- **Right to portability** — Receive your data in a machine-readable format.
- **Right to object** — Object to processing of your data.
- **Right to withdraw consent** — Where processing is based on consent, withdraw it at any time (e.g., unsubscribe from newsletters).
To exercise any of these rights, contact us at [[email protected]](mailto:[email protected]). We will respond within 30 days. You may also file a complaint with your local supervisory authority.
---
## 7. Data Retention
| Data type | Retention period |
|-----------|-----------------|
| Server logs | 30 days |
| Contact form submissions | Until request is resolved, then deleted within 90 days |
| Newsletter subscriptions | Until you unsubscribe |
| Mobile app account data (if any) | Until you delete your account |
---
## 8. International Transfers
MokingBird Oy is based in Finland, within the EU. If infrastructure services we use process data outside the EU/EEA, we apply appropriate safeguards as required by GDPR (including standard contractual clauses or reliance on adequacy decisions where applicable).
---
## 9. Cookies and Similar Technologies
Our websites aim to minimize cookie usage. Where cookies are used, they are for:
- Technical operation and session management
- Security (e.g., CSRF protection)
We do not use third-party advertising cookies. If optional analytics cookies are introduced in the future, we will update this policy and provide cookie controls. Our current direction is to use cookieless, privacy-friendly analytics tools where analytics are needed at all.
---
## 10. Children's Privacy
**Jogg Mini** is specifically designed for children aged 5 to 15.
We take children's privacy extremely seriously. In compliance with **COPPA** (Children's Online Privacy Protection Act) and **GDPR**:
- We do not collect personal information from children under 13 without verifiable parental consent.
- Jogg Mini contains no advertising of any kind.
- Jogg Mini does not connect children to social features that could expose them to strangers.
- Parents can contact us at [[email protected]](mailto:[email protected]) to request deletion of any data related to their child.
- A dedicated **Parent Dashboard** allows parents to monitor usage, control settings, and manage their child's account.
For other products (Sortify, Jogg, MokingBird AI, MB Viewer), our services are not intended for users under 13. If we become aware that a child under 13 has provided us with personal information without parental consent, we will delete it promptly.
---
## 11. Security
We implement appropriate technical and organizational measures to protect your data:
- HTTPS on all websites
- AES-256-GCM encryption in Sortify for synced data
- No central database holding personal data for desktop apps
- Access to any collected data (e.g., contact form submissions) is restricted to authorized personnel only
Despite these measures, no internet transmission or electronic storage is 100% secure. We encourage you to contact us immediately at [[email protected]](mailto:[email protected]) if you believe there has been a security incident.
---
## 12. Changes to This Policy
We may update this Privacy Policy from time to time. When we do, we will update the "Last updated" date at the top of this page. For significant changes, we will post a notice on our website. Your continued use of our services after any changes constitutes your acceptance of the updated policy.
---
## 13. Contact Us
For any privacy-related questions or requests:
**MokingBird Oy**
Business ID: 3615646-1
Finland
- Privacy inquiries: [[email protected]](mailto:[email protected])
- General inquiries: [[email protected]](mailto:[email protected])
- Security issues: [[email protected]](mailto:[email protected])
## About MokingBird Research MokingBird Research publishes technical articles, architecture deep-dives, and applied research from the engineering work behind our products. Our research focus areas include: - **Privacy-preserving synchronization** — how to build collaborative sync systems that never expose plaintext data to the server - **Synthetic data generation** — practical approaches to generating high-quality training data from private source documents - **Local-first architecture** — engineering patterns for offline-first, user-controlled software - **Applied machine learning** — fine-tuning, evaluation, and deployment of language models in production Research articles are written for technical readers: software engineers, ML practitioners, and architects working on similar problems. --- ## Contact For research enquiries, collaborations, or technical questions: **[email protected]** --- ## Articles - [Hybrid GPRO for Synthetic Data Generation](/research/articles/Hybrid-gpro-for-synthetic-data) - [Sortify Intelligent Synchronization: Privacy-First Distributed Sync](/research/articles/sortify-intelligent-sync) --- *MokingBird Oy — Finland*
## Articles ### [Hybrid GPRO for Synthetic Data Generation](/research/articles/Hybrid-gpro-for-synthetic-data) *mbDataGen · April 2026* How MokingBird's mbDataGen uses a Hybrid GPRO reinforcement learning approach to generate high-quality, validated synthetic training data — entirely on your local hardware. --- ### [Sortify Intelligent Synchronization: Privacy-First Distributed Sync](/research/articles/sortify-intelligent-sync) *Sortify Engineering · May 2026* How Sortify builds a client-side distributed synchronization engine with end-to-end encryption, offline-first design, and multi-user conflict resolution — without ever sending plaintext workspace data to a server. --- *MokingBird Oy — [email protected]*
Ask any ML practitioner what the hardest part of fine-tuning is, and you'll get the same answer: data. Not the model architecture. Not the training loop. The data.
High-quality, domain-specific training data is time-consuming to create manually, expensive to annotate professionally, and hard to find in public datasets — especially for specialized domains like legal, medical, or scientific applications. The data that's available is often noisy, misaligned with your task, or insufficient in volume.
mbDataGen was built to solve this. It generates synthetic training data from your own source documents — clean, validated, and grounded in your actual knowledge base.
---
## Why Synthetic Data?
Synthetic data has a reputation problem it doesn't entirely deserve. The concern is circular generation: if you generate data with a model and then fine-tune that same model on the generated data, you get drift, hallucination amplification, and quality degradation.
This concern is valid for naive approaches — generating thousands of random examples with no validation and feeding them directly into a training loop. It's not inherent to synthetic data as a concept.
The reason synthetic data fails is usually not the generation — it's the validation. Most pipelines skip validation or treat it as an afterthought. mbDataGen treats validation as the core of the product.
---
## The 5-Phase Pipeline
mbDataGen organizes the entire process into five phases:
### Phase 1: Extract
Load your source documents — the knowledge base that generated data will be grounded in. mbDataGen supports 17 document formats: PDF (with multi-engine parsing), DOCX, Excel, CSV, JSON, Markdown, PowerPoint, Email, images with OCR, web content, and more.
During extraction, the system parses structure: sections, headings, tables, code blocks, and relationships between document elements. This structure informs later phases — generated data that understands document structure is more useful than data that treats all text as a flat blob.
### Phase 2: Enrich
Add contextual metadata to extracted content: source attribution, document type, section relationships, entity extraction, topic classification. This enrichment is what allows the provenance system to work — every generated data point can be traced back to specific source passages.
### Phase 3: Generate
Produce candidate data using the GPRO-Hybrid RL approach (described in detail below). For each target data point, generate K=4 candidates and score them. Candidates are structured according to your output schema — instruction-following pairs, MCQ questions with rationales, preference pairs, structured extraction examples, or any custom schema you define.
### Phase 4: Validate
Run every candidate through the 5-stage validator. This is where mbDataGen distinguishes itself — the validation pipeline is not a simple heuristic filter but a multi-stage quality gate that evaluates different dimensions of data quality independently.
### Phase 5: Deploy
Export approved records to your training format. Every output includes a HMAC-signed RunManifest — a cryptographically verifiable record of how each data point was produced. When you need to audit your training data or certify its provenance, the RunManifest provides the chain of custody.
---
## GPRO-Hybrid RL: The Generation Engine
The core of mbDataGen's generation step is **GPRO-Hybrid RL** — an original reward learning approach developed by MokingBird.
Standard data generation with an LLM produces one output per prompt. The quality of that output depends entirely on prompt engineering. There's no mechanism for the system to distinguish a good output from a mediocre one.
GPRO-Hybrid RL changes this by generating **K=4 candidate outputs** for each data point and scoring all of them using a hybrid reward function:
```
Total Reward = 0.7 × Field/Process Reward + 0.3 × Outcome/Overall Reward
```
**Field/Process Reward (70% weight):**
Evaluates each field of the generated output independently. For an MCQ question, this means scoring: Is the question grammatically correct? Is the question answerable from the source? Is the correct answer actually correct? Are the distractors plausible but clearly wrong? Are all required fields present and properly formatted?
Field-level scoring catches micro-quality issues that overall quality scores miss. A data point can look good at a high level while containing a subtly wrong distractor answer or a malformed JSON field.
**Outcome/Overall Reward (30% weight):**
Evaluates the data point holistically. Is this a useful training example? Does it test the right concepts? Is there diversity relative to other generated examples? Would a model that learned from this example be better at the target task?
The K=4 candidates are compared, and the highest-scoring candidate is selected for validation. This process — generating multiple candidates and selecting the best — is a form of rejection sampling with learned scoring, and it reliably produces higher-quality output than single-shot generation.
---
## The 5-Stage Validator
After generation selects the best candidate, it enters the validation pipeline. Five stages, each assessing a different quality dimension:
### Stage 1: Schema Validation
Does the output conform to the required schema? Are all required fields present? Are field types correct? Are values within expected ranges?
This catches structural failures — malformed JSON, missing fields, type errors — before they enter the training dataset.
### Stage 2: Distribution Validation
Does the generated dataset, taken as a whole, match realistic distributions? For classification tasks: are label proportions reasonable? For question generation: is there appropriate coverage across difficulty levels, question types, and topic areas? For instruction-following: is there variety in instruction types and response styles?
Distribution validation catches a subtle failure mode: a dataset that passes all per-example quality checks but is heavily skewed — 90% easy questions, all from one document section — and would produce a model with systematic blind spots.
### Stage 3: Deduplication
Near-duplicate examples in training data waste compute and can cause overfitting. mbDataGen identifies semantic near-duplicates (not just exact matches) and flags them for review or removal.
### Stage 4: Grounding Validation
Can each generated claim be traced back to a source passage? For factual content, can the generated answer be verified against the source document?
This stage is critical for preventing hallucination propagation. A generated example that contains a plausible-but-false claim — if it passes all structural and distribution checks — can introduce false information into a fine-tuned model's behavior. Grounding validation checks generation against source.
### Stage 5: Novelty Validation
Does this data point add value over what already exists in the dataset? If a very similar example already passed validation, is the marginal utility of this one sufficient to include it?
Novelty validation maximizes information density per training example.
**Scoring thresholds:**
- Score ≥ 90%: AUTO_APPROVE
- Score 70–89%: REQUIRE_REVIEW (human review queue)
- Score < 70%: AUTO_REJECT
---
## The RunManifest: Data Provenance
Every dataset exported by mbDataGen includes a **HMAC-signed RunManifest** — a structured metadata document that records:
- Source documents used (with hashes for integrity verification)
- Generation parameters (model, temperature, prompt version, K value)
- Validation scores per stage for each data point
- Timestamp and hardware fingerprint of the generation run
- Selection rationale for each accepted record
The HMAC signature ensures the manifest cannot be tampered with after the fact. This matters when:
- Your organization audits AI training data for compliance
- You need to demonstrate that training data was grounded in authorized sources
- You want to reproduce or extend a dataset months later
- You are submitting a model for certification and need to document its training data provenance
---
## Hardware Requirements and Output Schema
**Minimum hardware:** 6GB VRAM, 16GB RAM
**Recommended:** 8GB+ VRAM, 32GB RAM
**Storage:** Depends on model size and dataset volume
mbDataGen supports any output schema you define. Built-in schemas include:
- **Instruction-following pairs** — `{"instruction": "...", "input": "...", "output": "..."}`
- **MCQ with rationale** — Full Jogg quiz format with 4 options, correct answer, explanation
- **Preference pairs** — `{"prompt": "...", "chosen": "...", "rejected": "..."}` for DPO/ORPO training
- **Structured extraction** — Any custom JSON schema you define
If your target task needs a different format, you define the schema and mbDataGen generates to it.
---
## Use Cases
**Fine-tuning a domain-specific Q&A model.** You have a corpus of 10,000 internal technical documents. Using mbDataGen, generate 50,000 instruction-following pairs grounded in those documents. Fine-tune a base model on the result. The trained model answers questions about your internal systems with accuracy that general models cannot achieve.
**Building an educational AI quiz system.** Define an MCQ schema with question, four options, correct answer, and rationale. Feed in curriculum documents. mbDataGen generates a question bank that covers the curriculum with appropriate difficulty distribution and is validated for factual accuracy against the source.
**Creating preference data for alignment.** Generate instruction-response pairs, then use mbDataGen's comparative generation to create a chosen/rejected pair for each instruction, scoring which response is higher quality. Use the resulting preference dataset for DPO or ORPO fine-tuning.
**Augmenting sparse datasets.** You have 200 real labeled examples in a specialized domain — enough to establish quality but not enough to fine-tune reliably. Use those 200 examples as grounding signals to generate 5,000 validated synthetic examples with the same quality characteristics.
---
## Fully Local, Your Data Stays Yours
mbDataGen generates from your documents on your hardware. Source documents never leave your machine. Generated datasets are written to local files. The RunManifest is a local file.
If you have compliance requirements around training data — where it comes from, what it contains, who can access it — mbDataGen's local-first architecture and provenance system address those requirements by design.
---
## Where DataGen Fits in the Ecosystem
mbDataGen is not isolated tooling. It is the middle layer in a coherent end-to-end flow:
1. **mbRAG** — Retrieve and contextualize information from your source documents
2. **mbDataGen** — Generate structured training data grounded in that retrieved knowledge
3. **mbFT** — Fine-tune a model on the generated dataset to adapt it to your domain
This pipeline reduces the handoff friction between knowledge, data, and model. Instead of three disconnected tools with incompatible formats and separate configuration approaches, the MokingBird Node coordinates all three in one workspace.
---
## Download Free
mbDataGen is available as part of MokingBird AI — free to download.
Full pipeline features (all validation stages, HMAC provenance, all output schemas) are available in the Premium tier.
Download at [ai.mokingbird.xyz](https://ai.mokingbird.xyz).
Modern synchronization is deceptively difficult. At small scale, syncing data across devices seems straightforward: upload local changes, download remote updates, and keep everything "in sync." But the moment an application becomes: * offline-first, * collaborative, * privacy-focused, * encrypted, * multi-device, * and backend-light, synchronization becomes one of the hardest engineering problems in the entire system. At Sortify, we faced this challenge directly. We wanted users to: * fully own their data, * store encrypted workspaces in their own cloud providers, * collaborate across devices and users, * work offline, * and still experience reliable synchronization without destructive data loss. We intentionally chose not to build a traditional centralized sync backend. Instead, we designed a synchronization engine centered around: * encrypted snapshot storage, * client-side merge orchestration, * optimistic concurrency control, * deterministic conflict handling, * and replayable local intent journaling. This article explores the architecture, the failures we encountered, the distributed systems lessons we learned, and the synchronization model we ultimately implemented. --- # The Problem With Traditional Sync Models Most synchronization systems follow one of two architectures. ## Centralized Server Synchronization The most common approach is: * all clients communicate with a central backend, * the backend stores authoritative state, * the backend resolves conflicts, * clients become relatively thin. This model works well for: * real-time collaborative editors, * enterprise SaaS platforms, * heavily coordinated multi-user systems. But it comes with tradeoffs: * infrastructure cost, * operational complexity, * trust requirements, * and major privacy implications. The server often becomes: * the owner of user data, * the merge authority, * and the visibility layer for all collaborative content. That did not align with Sortify's philosophy. --- ## Pure File Synchronization At the opposite end is simple file synchronization: * upload a database file, * overwrite remote copy, * download latest version later. This is simple, but dangerously naive under concurrency. If two users upload snapshots concurrently: * later uploads overwrite earlier uploads, * changes disappear, * and synchronization becomes non-deterministic. Initially, our architecture resembled this model more than we wanted to admit. That became obvious once real-world collaborative edge cases appeared. --- # Our Constraints Before designing the synchronization system, we defined several non-negotiable principles. ## 1. User-Owned Data Sortify users should remain in control of their own storage. Instead of storing workspace databases on our servers, we integrate with providers such as: * Google Drive * Dropbox * OneDrive This means: * users control retention, * users control deletion, * users control account access, * and Sortify itself does not become the permanent owner of workspace content. --- ## 2. End-to-End Encryption Workspace snapshots are encrypted before upload using **AES-256-GCM** with a 256-bit workspace-specific key and a 96-bit random IV generated per upload. The encryption key is stored in platform-native secure storage: **iOS Keychain** on Apple devices, **Android Keystore** on Android devices. It never leaves the device and is never transmitted to any server, including MokingBird's own infrastructure. Cloud providers store opaque encrypted blobs rather than readable workspace content. This ensures: * cloud providers cannot inspect workspace data, * MokingBird cannot inspect workspace contents in the ordinary course of operations, * synchronization remains privacy-preserving by architecture rather than by policy, * and user trust boundaries remain clear. --- ## 3. Offline-First Behavior Users must be able to: * create rooms, * move items, * rename entities, * update metadata, * and continue working offline. Synchronization therefore cannot assume permanent connectivity. --- ## 4. Multi-User Collaboration Workspaces are collaborative. Multiple users may: * edit the same workspace, * move items simultaneously, * rename rooms concurrently, * or sync from multiple devices. This introduces distributed synchronization challenges even without a traditional backend. --- # The First Major Failure Our early synchronization engine used a snapshot-authoritative merge strategy. The logic was conceptually simple: * pull remote snapshot, * compare against local database, * apply remote state, * delete rows missing remotely. This appeared reasonable initially. But it created a catastrophic edge case. --- ## The "Missing Means Delete" Problem Suppose a user: * creates several items offline, * those items exist only locally, * synchronization runs before upload succeeds, * remote snapshot does not contain those items yet. The pull engine interpreted this as: > "these rows do not exist remotely, therefore they should be deleted locally." As a result: * items vanished, * rooms disappeared, * counters dropped to zero, * and local state was erased despite never being intentionally deleted. This revealed a critical distributed systems truth: > Absence is not deletion. Especially in eventually consistent systems. --- # Understanding the Real Problem The issue was not simply: * timestamps, * ordering, * or retries. The deeper issue was architectural. We had combined: * snapshot synchronization, * partial journaling, * and destructive reconciliation semantics. That combination was unsafe under concurrency. --- # Reframing the Architecture We eventually shifted toward a fundamentally different model: > client-side merge with optimistic concurrency guarded snapshot publishing. This became the foundation of the new synchronization engine. --- # The New Synchronization Model The final architecture follows this sequence: 1. Read remote metadata/version token. 2. Pull latest snapshot if remote changed. 3. Merge locally using deterministic policies. 4. Replay unsynced local intent journal. 5. Export and encrypt merged snapshot. 6. Upload conditionally using optimistic concurrency tokens. 7. Retry on precondition failure. This transformed synchronization from: > blind overwrite synchronization into: > distributed optimistic synchronization. --- # Client-Side Merge With Snapshot Storage One of the most unusual characteristics of Sortify's synchronization engine is where merge logic happens. Traditional systems: * merge on the server. Sortify: * merges entirely on the client. Cloud providers are treated as: * encrypted blob stores, * versioned snapshot containers, * not synchronization authorities. This architecture preserves privacy while still allowing convergence across clients. --- # Optimistic Concurrency Control (OCC) A major architectural upgrade was introducing optimistic concurrency control. Without OCC: * two users can upload snapshots simultaneously, * later uploads overwrite earlier snapshots, * remote history becomes unstable. We solved this by using provider-issued version tokens. Examples include: * Google Drive revision metadata, * Dropbox rev identifiers, * OneDrive etags. Before uploading a snapshot, Sortify verifies: > "Is the remote file still the same version I merged against?" If not: * upload is rejected, * the client pulls latest state again, * merges again, * and retries safely. This small change fundamentally transformed synchronization reliability. --- # Replayable Local Intent Journal Another critical architectural addition was the local change journal. Every mutation now: * updates local tables, * and appends a journal entry in the same transaction. Examples include: * item updates, * room renames, * photo changes, * membership changes, * tombstone deletes. The journal represents: > local user intent not yet safely published remotely. During synchronization: * remote snapshot changes are applied first, * then unsynced journal entries are replayed. This guarantees that local intent is preserved during merge operations. --- # Tombstone Convergence Traditional delete behavior creates major synchronization hazards. Originally, some user-facing deletes physically removed rows from the database. That approach becomes incompatible with eventual consistency. Instead, Sortify moved toward tombstone-based convergence. Rather than deleting rows: ```text is_active = 0 deleted_at = timestamp ``` Deletes become: * explicit state transitions, * synchronizable events, * and conflict-resolvable operations. Most importantly: > remote absence alone is never interpreted as delete intent. This eliminated the entire class of "items vanished" failures. --- # Deterministic Merge Policy Distributed synchronization becomes dangerous when merge outcomes are non-deterministic. If two clients resolve conflicts differently: * synchronization never converges. We therefore implemented deterministic merge policies: * **Last-Write-Wins by `updated_at`:** the entity with the higher timestamp wins. * **Lexical tie-breaker on equal timestamps:** when `updated_at` values are identical, the winning version is determined by lexical comparison of `updated_by` (user identifier) or device ID. This comparison is deterministic: every device running the merge on the same two records will produce the same winner. * **Deletion semantics:** a tombstone (`is_active = 0`) with an `updated_at` timestamp participates in LWW the same way any other update does. An explicit delete can be overridden by a later update, and vice versa, by timestamp. Even equal timestamps now produce stable, identical results regardless of which device runs the merge. This is subtle, but essential: non-deterministic conflict resolution causes perpetual divergence because two clients can both believe they published the authoritative state while holding different data. --- # Bounded Retry Orchestration Optimistic concurrency naturally introduces retries. Consider: * User A syncs, * User B syncs milliseconds later, * both began from the same remote snapshot. One upload succeeds. The other receives a precondition failure. Rather than overwriting: * Sortify retries automatically, * re-pulls latest state, * merges again, * republishes safely. Retries are bounded to avoid: * infinite loops, * battery drain, * synchronization storms. --- # Implementation Phases The rebuild of the synchronization engine was structured as eight discrete implementation phases, each targeting a specific failure class. | Phase | Change | Failure Class Addressed | |---|---|---| | 1 | Non-destructive pull merge | Delete-on-missing data loss | | 2 | `workspace_sync_state` persistence | Implicit sync position causing redundant work | | 3 | OCC tokens on Google Drive, Dropbox, OneDrive | Blind overwrite race conditions | | 4 | Bounded pull-merge-push retry loop | Unrecovered precondition failures | | 5 | Complete `change_journal` coverage across all mutation paths | Local intent invisible to merge engine | | 6 | Deterministic LWW tie-breaker | Non-deterministic conflict divergence | | 7 | iCloud gated as unsupported for collaborative sync | False concurrency guarantees on iCloud Drive | | 8 | Tombstone delete routing through `is_active = 0` | Hard-delete rows defeating eventual consistency | All eight phases are shipped in the current release. --- # Provider-Aware Conditional Uploads Different providers expose concurrency semantics differently. For example: ## Google Drive ### OAuth Scope Decision Sortify originally requested the `drive.file` OAuth scope, which limits access to files the app itself created. This appears to be the minimum necessary scope and is appropriate for single-user workspaces. Under `drive.file`, however, collaborative multi-user workspaces fail. When a workspace owner creates the shared workspace folder and invites a collaborator, the collaborator's app cannot access the folder — because the folder was created by the owner's Sortify instance, not the collaborator's. The `drive.file` scope does not permit a second user's app to read or write files it did not create, even if those files are in a folder explicitly shared with the collaborator at the Google Drive level. Collaboration requires each workspace member's app to read and write the encrypted snapshot in the shared workspace folder regardless of which member created it. This requires the `https://www.googleapis.com/auth/drive` scope. Sortify uses this scope exclusively for Sortify workspace folders and Sortify sync artifacts. It does not enumerate unrelated Drive content, access files outside the workspace folder hierarchy, or store any data in the user's Drive other than the encrypted workspace blob and associated metadata file. ### Concurrency Semantics Google Drive synchronization leverages: * revision-aware file metadata, * `If-Match` conditional upload semantics (etag-based precondition), * and persistent remote token tracking in `workspace_sync_state`. An upload that specifies an etag that no longer matches the current remote file is rejected by Google Drive with a precondition failure. Sortify treats this as a signal to re-pull, re-merge, and retry. --- ## Dropbox Dropbox exposes `rev` identifiers on every file. Conditional uploads are supported natively: an upload specifying `mode: update` and an expected `rev` value is accepted only if the current remote `rev` matches. If the file has changed since the `rev` was captured, the request is rejected. This is functionally equivalent to compare-and-swap: upload succeeds only if the file is still at the version the client merged from. --- ## OneDrive OneDrive exposes `eTag` and `cTag` fields via the Microsoft Graph API. Conditional uploads are supported using `If-Match: <etag>` request headers. An upload that specifies an etag that no longer matches the remote file's current etag is rejected with a precondition failure, triggering the retry loop. --- ## iCloud — Gated as Unsupported for Collaborative Sync iCloud Drive does not expose the optimistic concurrency primitives that Sortify's sync architecture depends on. Google Drive, Dropbox, and OneDrive all provide explicit version tokens and conditional upload semantics (etag, rev, If-Match). iCloud Drive does not expose equivalent public API primitives for file-based conditional uploads, making it impossible to implement the compare-and-swap guarantee that prevents concurrent overwrites. CloudKit — Apple's structured database sync layer — does support record versioning and conflict detection, but adopting it would require a fundamentally different sync path: structured record sync rather than encrypted blob sync, with Apple-specific implementation logic that breaks the provider-neutral architecture. The current `ICloudService` implementation contains placeholder methods (`_saveRecord`, `_fetchRecord`, `_queryRecords`) that simulate storage behavior without enforcing real server-side version checks. These stubs exist to preserve the provider interface shape but do not constitute a production-safe collaborative sync implementation. Collaborative workspace flows — workspace creation, invitation acceptance, and multi-device sync — are explicitly gated and blocked on iCloud in the current release. The design principle behind this decision: > providing no guarantee is safer than providing a false one. --- ## Provider Support Summary | Provider | Version Token | Conditional Upload Mechanism | Collaborative Sync | |---|---|---|---| | **Google Drive** | revision / etag | `If-Match` header | Supported | | **Dropbox** | `rev` identifier | `mode: update` + expected `rev` | Supported | | **OneDrive** | `eTag` / `cTag` | `If-Match` via Microsoft Graph | Supported | | **iCloud Drive** | Not exposed publicly | Not supported | **Gated — unsupported** | --- # Why We Chose Snapshot Synchronization Instead of Delta Server Sync A natural question is: > why not upload only incremental changes? The answer lies in our architectural priorities. Delta synchronization typically requires: * server-side coordination, * change feeds, * remote merge orchestration, * and trusted backend infrastructure. Sortify intentionally avoids centralized synchronization ownership. By using: * encrypted snapshots, * local merge orchestration, * and optimistic concurrency, we preserve: * user-controlled storage, * end-to-end privacy, * backend simplicity, * and offline resilience. This approach trades: * bandwidth efficiency, for: * privacy, * architectural independence, * and deployment simplicity. For Sortify's goals, that tradeoff was worth it. --- # The Importance of Sync State Persistence Another major improvement was introducing durable synchronization state. We added `workspace_sync_state`, a persisted table with one row per workspace. It tracks: ```text workspace_id — stable app-level workspace identifier provider_file_id — provider-side file reference for the encrypted snapshot provider_metadata_id — provider-side file reference for the metadata file remote_version_token — etag / rev / revision captured at last successful sync last_sync_attempt_at — timestamp of most recent sync attempt last_sync_success_at — timestamp of last successful round-trip last_error — structured error classification from last failure retry_count — consecutive failure count for backoff logic sync_status — current state (idle, syncing, failed, provider_blocked) ``` Before this table existed, sync state was implicit: the engine had to re-discover its position on every cycle by re-reading provider metadata. Errors were only visible in logs. There was no way to distinguish "remote is unchanged, skip the download" from "we have never synced this workspace before." With `workspace_sync_state`, the engine always knows: * whether the remote has changed since the last successful merge, * which version it last merged from (for the OCC precondition), * and whether the provider is in a blocked/error state requiring backoff. Synchronization state is now explicit rather than implicit, which is a prerequisite for reliable observability and provider portability. --- # What the Architecture Looks Like Today Today, synchronization behaves approximately like this: 1. Resolve remote metadata and version token. 2. Determine whether remote changed. 3. Download and decrypt remote snapshot if needed. 4. Merge remote state into local state. 5. Replay local unsynced journal entries. 6. Export merged encrypted snapshot. 7. Attempt conditional upload. 8. Retry on precondition failure. 9. Persist new synchronization tokens. 10. Mark journal entries synchronized. This produces eventual consistency without requiring centralized merge infrastructure. --- # Remaining Challenges No synchronization system is perfect. Our architecture still has tradeoffs. ## Snapshot Size Full snapshot uploads are heavier than true delta synchronization. For small-to-medium collaborative workspaces, this is acceptable. Larger workspaces may eventually require: * compression, * chunked snapshots, * or incremental export optimization. --- ## Retry Storms Heavy concurrent synchronization can create repeated retries under rapid edits. Future improvements may include: * randomized retry jitter, * batching, * smarter scheduling, * and sync coalescing. --- ## Clock Dependency Last-Write-Wins still depends on timestamps. Future improvements could explore: * logical clocks, * hybrid logical clocks, * or operation vectors. At current scale, timestamp-based convergence remains a practical compromise. --- ## Sync Frequency and Provider Backoff Sync runs on a configurable interval driven by the user's sync frequency setting (`30s`, `1m`, `5m`, `15m`). The default for new installs is `5m`. The scheduler reads the setting on startup and recreates the timer when the setting changes, without requiring an app restart. When a provider returns a hard configuration error — such as a 403 from a disabled API endpoint — the sync engine applies a provider-level cooldown (10–15 minutes) rather than retrying on every scheduler tick. This prevents log noise and unnecessary battery/network usage during provider misconfiguration states. The cooldown is tracked in `workspace_sync_state.sync_status` as `provider_blocked` with a backoff expiry. --- # The Most Important Lesson The biggest architectural lesson from this journey was simple: > synchronization correctness is less about "syncing data" and more about preserving user intent safely under concurrency. That realization changed everything. --- # Final Thoughts Building synchronization without a central merge backend is difficult. Building it while preserving: * privacy, * offline support, * encrypted storage, * collaboration, * and user-controlled infrastructure, is even harder. But we believe this architecture represents the right balance for Sortify. Instead of centralizing ownership of user data, we built: * client-side intelligence, * deterministic convergence, * optimistic synchronization, * and privacy-preserving collaboration. The result is not merely "cloud sync." It is a distributed synchronization architecture designed around a simple principle: > users should not have to surrender ownership of their data to collaborate safely.