How We Indexed 363K Artist Profiles in 12 Weeks

My Favorite Bands started with a clear vision: a platform where music fans could track their favorite artists, discover new ones, and see what people around them were listening to. Think Letterboxd for music: personal, social, and data-rich.

The target was ambitious but intentionally narrow: 12 weeks from concept to a production-ready MVP with clean architecture and enough functionality to validate the product with real users.

The goal was not to build Spotify-scale infrastructure. The goal was to build a technically solid product with a focused feature set, strong foundations, and room to evolve.

Here's what we built, how we built it, and the technical decisions that made the difference between shipping and stalling.

The Problem Space

Building a music discovery platform sounds straightforward until you look closely at the data layer.

Artist metadata is messy. There is no single canonical source. Bands split, rename, release music under side projects, or appear under slightly different spellings across datasets. Genres are subjective. Touring schedules change constantly.

Before writing production code, we needed to answer three questions:

Where does the artist data come from, and how do we keep it current?
How do we make 300K+ records feel instant to users typing into a search box?
How do we build social functionality without prematurely introducing infrastructure complexity?

Those three questions shaped nearly every architectural decision that followed.

Week 1 and 2: Discovery Sprint Before Code

The single best decision on this project was refusing to write production code for the first two weeks.

Instead, we ran a structured discovery sprint focused on narrowing scope and validating architecture decisions before implementation.

Day 1 and 2: Mapped the core user journey from signup to the first "aha" moment. Defined the exact flows required for launch: account creation, artist search, follows, and personalized feeds. Everything else was explicitly out of scope.
Day 3 and 4: Designed the data model. Artist, User, Follow, Post, and supporting relationships were modeled before schema implementation began. MongoDB document structures were validated against expected query patterns rather than theoretical normalization.
Day 5: Defined API contracts. Every endpoint, request payload, and response shape was agreed upon before implementation. This allowed frontend and backend development to move independently without blocking.
Week 2: Prototyped the ingestion pipeline using a 10,000-record test dataset. Validated normalization logic, indexing strategy, and search responsiveness early enough to avoid architectural rework later.

That front-loaded investment paid for itself repeatedly. The remaining 10 weeks became execution against a known plan instead of continuous mid-build architecture discovery.

The Data Pipeline: Getting to 363,793 Artists

The core technical challenge was not the social layer. It was the data.

We needed hundreds of thousands of structured artist profiles (names, genres, images, discography links, social accounts, and tour metadata) normalized into a consistent schema and queryable with fast response times.

We built a Python ingestion pipeline that:

Pulled from multiple upstream sources using asynchronous request handling to manage rate limits efficiently without sequential blocking. Sources included music metadata APIs, public datasets, and structured web sources.
Normalized records into a canonical schema using deterministic normalization rules and fuzzy matching heuristics to resolve duplicate artists, spelling variations, inconsistent genre naming, and incomplete fields.
Computed derived metadata at ingest time, including popularity scores, genre clusters, and similar-artist relationships. Precomputing these values simplified query logic and reduced runtime overhead.
Loaded incrementally into MongoDB using stable artist IDs and idempotent upserts, allowing the pipeline to be rerun safely without creating duplicates.

By the end of week 6, the database contained 363,793 indexed artist profiles.

The pipeline runs on a schedule and refreshes artist metadata automatically.

The Indexing Strategy

363K documents is not large by database standards, but it absolutely matters when users expect search to feel instantaneous.

We designed indexes around actual access patterns rather than around the schema itself.

Indexed normalized artist-name fields supported fast autocomplete-style lookups while avoiding expensive regex scans.
Genre and popularity compound indexes optimized discovery queries like "popular jazz artists" into efficient indexed reads.
User and timestamp compound indexes on follow and activity data supported feed generation efficiently on the most read-heavy paths in the system.

Most indexed read queries executed in single-digit milliseconds during beta-scale traffic levels, with end-to-end responses typically remaining well under 100ms including API and rendering time.

The Application Architecture

Backend: Python and FastAPI

FastAPI was the right backend framework for two reasons.

First, it is async-native. Database queries, external API calls, and other I/O-heavy operations execute concurrently without blocking the request lifecycle.

Second, automatic OpenAPI generation gave the frontend a live API specification directly from the application code without requiring additional tooling or documentation maintenance.

MongoDB access used Motor, the async MongoDB driver, throughout the application stack.

That mattered because many endpoints required multiple concurrent queries (artist data, related artists, recent activity, and user relationship state) all within a single request.

Frontend: Next.js and React

Next.js provided server-rendered public artist pages for SEO while still allowing highly interactive client-side features for feeds, follows, and user interactions.

The App Router architecture let us progressively stream UI rendering so users saw page structure immediately while slower data loaded asynchronously in the background.

That approach significantly improved perceived responsiveness without introducing unnecessary frontend complexity.

Database: MongoDB

MongoDB was a strong fit for the artist data model because artist documents varied significantly in structure and completeness.

Some artists contained extensive metadata. Others had only basic information. The document model allowed each record to store exactly the data it possessed without introducing large numbers of nullable columns or complex relational workarounds.

Infrastructure: AWS, Docker, and Nginx

The backend was containerized from the beginning.

Development used Docker Compose locally while production deployments ran Docker containers on AWS EC2 instances behind Nginx.

Using effectively the same runtime environment across development and production dramatically reduced deployment inconsistencies and environment-specific debugging.

The deployment pipeline runs automatically on every push to main: tests, build, image publish, and deployment.

A reviewed change merged in the morning can typically be in production within minutes.

The Social Layer: Feeds Without Premature Complexity

Social features are where platforms often become unnecessarily complex early.

Large-scale social platforms eventually evolve toward fan-out architectures, queue systems, and heavily precomputed feeds.

For an MVP, that would have been the wrong tradeoff.

We deliberately optimized for:

simple, correct, maintainable, and fast-enough

The feed is generated synchronously at request time from indexed follow queries. For a user following a few hundred artists, generating the most recent feed items is a straightforward indexed read operation that executes comfortably within MVP-scale performance requirements.

At significantly larger scale, a different architecture would make sense.

At MVP scale, the simpler architecture shipped faster, was easier to debug, and gave us more time to validate whether users actually wanted the product.

This became a recurring principle throughout the project:

Build for the next 18 months, not the next 10 years.

AI-Accelerated Development

AI tooling played a meaningful role during development, but not as a replacement for engineering judgment.

Its biggest impact was reducing repetitive implementation overhead and speeding up iteration cycles.

We used Claude primarily for:

Boilerplate generation: API scaffolding, TypeScript interfaces, and repetitive implementation patterns.
Pipeline iteration: Refining normalization logic and quickly testing alternate approaches to data cleanup and transformation.
Test generation: Creating baseline unit tests and identifying unhandled edge cases during iteration.

The productivity gain was substantial, especially on repetitive or low-risk implementation work.

Our estimate is that AI tooling compressed roughly three weeks of calendar time from the overall build schedule (not by autonomously writing application logic, but by reducing friction throughout the development process).

What Shipped in 12 Weeks

By the end of week 12, the production MVP included:

363,793 indexed artist profiles with normalized metadata and fast search
User accounts with authentication, profiles, and settings
Follow system with follower/following relationships
Personalized feeds driven by followed artists and users
Artist discovery by genre, popularity, and related artists
Artist pages with bios, links, images, and metadata
Automated CI/CD pipeline from commit to deployment
Monitoring and logging for uptime, errors, and application health

The product was stable, demo-ready, and positioned for ongoing iteration.

Lessons From the Build

1. Discovery before implementation saves enormous time later

Two weeks of architecture and scope definition prevented months of refactoring and feature drift.

Every serious project we run now begins with a structured discovery phase.

2. Index around access patterns, not theoretical schema purity

The query patterns were known before the application launched.

Designing indexes around real user behavior early made performance predictable instead of reactive.

3. Build for realistic scale horizons

The synchronous feed model was the right architectural decision for the stage of the product.

Simple systems ship faster, are easier to maintain, and preserve time for validating the business before scaling infrastructure prematurely.

4. AI tooling is a multiplier for experienced engineers

The value came from accelerating execution, not replacing technical decision-making.

Used correctly, AI tooling reduces friction and compresses iteration cycles.

5. At MVP stage, maintainability and iteration speed matter more than theoretical scale

The codebase should make future iteration easier, not force a rewrite six months later.

That principle shaped nearly every technical decision in the project.

If you're building a data-heavy consumer platform and need to move from concept to production quickly without sacrificing technical quality, let's talk.