EngineeringAI

How We Built an AI That Understands Your Practice

Joseph Frantz

February 20, 2026

How We Built an AI That Understands Your Practice

The core technical challenge of AI time tracking is classification: given a signal (an email, a calendar event, a browser session), determine which client and matter it belongs to, what type of work it represents, and how long it took.

This sounds straightforward until you consider the reality of how professional services firms actually work. A single attorney might handle 40 active matters across 15 clients. The same opposing counsel might appear on three different cases. An email thread might span two matters. A calendar event titled “Call with Sarah” could be a client meeting, an internal sync, or a personal appointment.

Getting this right at 85–95% accuracy is what makes passive time capture viable. Here’s how we approached it.

The signal pipeline

Every input source — email, calendar, browser, desktop app, Chrome extension — feeds into a unified signal pipeline. Each signal arrives as a structured event with metadata: sender, recipients, subject, timestamp, duration, URL, application name.

The pipeline normalizes these signals into a common format before classification. An email from Gmail and an email from Outlook look identical by the time they reach the classifier. A Zoom meeting and a Teams meeting are both “video call with attendees X, Y, Z for N minutes.”

This normalization layer is critical. It means the classification engine doesn’t need to understand 15 different input formats — it understands one, and the pipeline handles the translation.

The classification stack

Classification runs in three stages, each adding context:

Stage 1: Entity resolution. Before we can classify the work, we need to know who’s involved. The system maintains a graph of known entities — clients, contacts, opposing counsel, co-counsel, internal team members — built from the firm’s practice management system and enriched over time from observed communications. When an email arrives from “sarah.chen@example.com,” entity resolution maps that address to a known contact, links it to one or more client/matter relationships, and passes those candidates to the next stage.

Stage 2: Matter classification. With entity context established, the AI model evaluates the full signal against the candidate matters. It considers the subject line, the body content (for email), the attendee list (for meetings), the URL patterns (for browser activity), and the timekeeper’s recent work history. The output is a ranked list of matter candidates with confidence scores.

Stage 3: Activity classification. The final stage determines the billing category — research, correspondence, drafting, review, conference, etc. This is informed by both the signal type and the content. An email with attachments mentioning “attached draft” maps to drafting or review. A calendar event with multiple external attendees maps to conference. A long browser session on a legal research platform maps to research.

Each stage is independently testable and independently improvable. When accuracy drops on entity resolution, we can diagnose and fix it without touching the activity classifier.

The learning loop

Static models don’t work for this problem. Every firm has different naming conventions, different matter structures, different workflows. A model trained on one firm’s data would perform poorly at another.

TimeSentry’s classification engine maintains a per-company learning layer that adapts to each firm’s specific patterns. When a timekeeper corrects a mapping — moving an entry from Matter A to Matter B — that correction feeds back into the model as a training signal. Over the first few weeks of use, the system rapidly converges on each firm’s specific conventions.

This is also why we built bulk remap. During onboarding, when the model is still learning and error rates are highest, timekeepers can select 20 misclassified entries, reassign them in one action, and generate 20 training signals simultaneously. The correction velocity in the first week directly determines how quickly the model reaches production accuracy.

Explainability as a feature

We made a deliberate engineering decision early on: every classification must be explainable. Click any AI-generated time entry and you can see the full reasoning chain — which signals contributed, what confidence score each candidate received, why the winning classification won.

This wasn’t just a UX decision. It was an engineering constraint that shaped the entire architecture. We couldn’t use opaque models that produce accurate results without interpretable reasoning, because attorneys need to trust and verify every entry that goes on a bill.

The explainability requirement pushed us toward a hybrid approach: structured feature extraction combined with language model reasoning. The structured features (entity graph lookups, historical pattern matching, rule-based heuristics) provide the interpretable backbone. The language model handles the ambiguous cases — the emails that could belong to two matters, the meetings with mixed-purpose attendees — and surfaces its reasoning in natural language.

Where accuracy breaks down

We’re transparent about the cases that are still hard:

New clients with no history. The learning layer needs signal to learn from. The first few entries for a brand-new client will have lower confidence until the model has seen enough examples.
Multi-matter communications. An email thread that genuinely spans two matters is ambiguous even to a human reviewer. The system flags these as low-confidence for manual review rather than guessing.
Personal vs. professional. The boundary between personal and professional activity isn’t always clear from metadata alone. We err on the side of not capturing entries where the signal is ambiguous, rather than populating timesheets with non-billable noise.

These edge cases are where the review step matters most. TimeSentry is designed to get the clear cases right automatically and surface the ambiguous cases for human judgment. That’s the appropriate division of labor between AI and professional expertise.

The classification engine improves every week as we process more signals and incorporate more feedback. See the latest accuracy metrics and capabilities in our changelog.