Methodology — placement.solutions

The sourcing, normalization, scoring, and auditing rules behind the placement.solutions legal-hiring data index. Every record we publish is meant to be reproducible from its sources. This essay is the contract.

Abstract

The placement.solutions index is a structured, daily-refreshed view of legal-sector hiring activity in the United States, covering roughly 1,200 firm dossiers and a posting ledger in the low five figures of active openings on any given day. We measure four object classes, postings, lateral moves, firm dossiers, and candidate dossiers, and we publish every record with a confidence rating, a source URL, and the timestamps that bracket when we first observed and last verified it. The point of the rigor is that a buyer, a journalist, a litigation funder, or an underwriter can audit the record without permission. Each section below specifies one part of how the index is assembled. Where we still have open questions we say so, and where industry input would improve the system we ask for it.

01 · Data sourcing

Every record originates from one of four source classes, listed below in descending order of evidentiary weight (not volume). Source class is part of the schema and carries into every downstream artifact.

Public-record filings. State bar admissions and reciprocity grants, federal and state court appearance records, pro hac vice filings, and corporate-governance filings that name in-house counsel. The slowest-moving class and the most decisive. A bar admission only updates when an attorney has submitted to the discipline of a jurisdiction; a docket appearance only updates when the lawyer has signed a document for an actual client in an actual matter. We capture filings from all 50 state bars, the federal district and circuit courts, and the administrative bodies that maintain practitioner rolls (USPTO, Tax Court, and the immigration bar). Latency runs one to four days for federal dockets, two to six weeks for state bar rolls.

Career-page feeds. Hiring channels operated by firms on their own canonical infrastructure. We harvest directly from each firm's own published feed, never through an aggregator that strips provenance. Every active feed maps one-to-one to a firm we have a dossier for; when a firm switches its underlying career-page system we re-fingerprint and re-attach the feed so its identity travels with the firm. Latency is real-time on creation and two to seven days on takedown. The index currently pulls from approximately 1,775 distinct feeds across the AmLaw 200 and a deliberate selection of mid-market firms with between 25 and 200 attorneys.

Public press and announcements. Firm newsroom posts, partner-promotion press releases, lateral-arrival notices, and the trade-press hits those announcements seed in the same news cycle. ALM publications, Bloomberg Law, Law360, Reuters Legal, and the regional legal trade press are the spine of this class, supplemented by firm-operated newsrooms and bar-section publications. Latency is hours. We dedupe by canonicalizing the announcement URL and matching person-plus-firm-plus-event-date inside a 14-day window.

Conference and association rosters. CLE faculty lists, panel rosters, named honorees, bar-section leadership announcements, and similar publicly-published professional records. A lawyer named to next spring's faculty has been placed in a category by people who work alongside them, which makes this a strong signal for sub-practice membership even when it is silent on employer or seniority.

Refresh runs in a single coordinated window between 04:00 and 05:00 UTC daily. By midnight Eastern, every announcement from the prior business day has either landed or been embargoed into the next cycle. A customer reading at 7 AM Eastern is reading numbers that already include the previous evening's promotions, press releases, and bar updates. Earlier and we ship stale numbers; later and we cut into the morning the customer uses the data.

What we do not source from: broker-only feeds that decline to disclose origin, anything behind an authentication wall, and records that depend on private compensation negotiations or unpublished placement files. The rule costs us coverage and buys us auditability.

02 · Schema and entity resolution

Records are typed and normalized to a unified schema before reaching a customer. Practice areas resolve to a 47-node taxonomy that is mutually exclusive at the leaf and collectively exhaustive across the AmLaw 200 and NLJ 500. The taxonomy paper, which documents every node, the mapping pipeline, and the audit regime, is published at our research desk. Its SemVer string travels in every record's response payload so a buyer building joins can pin to a version.

Firm dossiers. Each of the 1,200 firms we cover carries a dossier that is the canonical entity for everything else: legal name, common name, prior names through mergers, headquarters and branch offices with mapped state codes, AmLaw or NLJ rank where applicable, attorney-count band, firm-named practice groups, canonical career-page URL, public newsroom URL, and a structural-pattern flag for firms operating under predecessor names after a 2024-or-later merger. Firms outside the dossier set still appear in postings and moves, but at a lower confidence floor.

Candidate-side normalization. Candidate dossiers normalize to a parallel schema: name, jurisdiction admissions, practice-area history (mapped to the same 47-node taxonomy), seniority band, and an attached audit array. Candidate dossiers exist only for individuals who have consented to inclusion or are public figures whose practice information is already in the public record (named partners, bar-section leadership, court-appearance records). We do not assemble dossiers from scraped social profiles.

Entity resolution. The hardest problem in legal-hiring data is the same person and the same firm appearing under three names in three sources within the same week. We resolve entities through a deterministic-then-probabilistic pipeline. Deterministic matchers run first: bar number for attorneys, EIN for firms where available, exact name plus admission date for jurisdictional records, exact URL plus 14-day window for announcements. Probabilistic matchers handle the remainder using a blocked Jaro-Winkler comparison on names plus secondary attributes (firm, practice area, market, year of admission). Thresholds are class-specific: high for jurisdictional records, more lenient for press-release adjacency.

Every resolved entity is assigned a stable canonical ID. Canonical IDs survive renames, mergers, and partial reorganizations. When a firm merges, the resulting firm gets a new canonical ID and the predecessor IDs link through a successor relationship in the dossier; queries against the predecessor still resolve, with the successor explicit in the response.

Deduplication. Postings dedupe on a content fingerprint plus a temporal window. Two postings with the same firm, title (stripped of office suffixes and seniority shorthand), primary practice node, and office location, posted within 21 days, collapse to one record with both source URLs preserved. Lateral-move announcements use a stricter 14-day window and require firm-of-arrival agreement; trade-press write-ups of a firm-newsroom announcement attach as corroborating signals rather than duplicates.

03 · Confidence scoring

Every record carries a published confidence rating in three tiers. The tier is exposed as a structured field on every API response (confidence_tier) and is computed from a continuous score (confidence_score) between 0 and 1 that is also exposed in the same payload. A buyer can filter on either field, and a buyer can recompute the score from the signal contributions in the response.

High Continuous score 0.80 to 1.00. Two or more independent source classes corroborate the fact, including at least one public-record filing or one career-page feed observation. Recommended threshold for downstream automated workflows where a false positive carries real cost: outbound recruiter contact, underwriting decisions, syndicated content.
Medium Continuous score 0.50 to 0.79. One strong-class source plus at least one corroborating signal in a different class, or two strong-class sources where one was timestamped more than 30 days after the other. Suitable for prospecting and editorial research where a human will verify before action.
Low Continuous score below 0.50. Single-source observations, signals from a class that is structurally noisy on the question being asked, or records still inside the post-publication corroboration window. Surfaced in the index because absence is its own signal, and excluded by default from most customer-facing queries.

Three published tiers. The contract is fixed: low is any continuous score below 0.50, medium is 0.50 through 0.79 inclusive, high is 0.80 and above. Every API response carries both confidence_tier and confidence_score so a buyer can filter on the discrete bucket or recompute against a custom threshold without losing information.

The continuous score is a rules-based plus signal-based ensemble. Rules-based components are explicit gates: a public-record filing alone never falls below medium on the field it supports; a record with no filing and no career-page observation is capped at medium regardless of press hits; field-level disagreement caps field-level confidence and surfaces the disagreement in a structured array. Signal-based components are weighted by class strength and decayed by recency under a class-specific exponential. Calibration is anchored to roughly 18 months of confirmed historical placements: at the most recent fit, 0.85 corresponds to a confirmed event within 90 days about 84% of the time, 0.50 to roughly 47%, and 0.20 to roughly 18%. We re-fit quarterly and publish the curve in the changelog.

Source class	Class weight (max)	Floor when sole source
Public-record filing	0.30	medium
Career-page feed	0.25	medium
Firm-newsroom announcement	0.22	medium
Trade-press hit	0.10	low
Conference roster	0.08	low
Multi-class corroboration	0.07 (bonus)	n/a

04 · Source-chain audit trail

Every record carries three timestamps and a signal array. The timestamps are first_seen_at, the moment we first observed the record in any source; last_verified_at, the most recent time we re-confirmed it against an authoritative source class; and next_recheck_at, the scheduled next re-verification under the class-specific decay schedule. The signal array, exposed as signals[], lists every contributing observation: source URL, source class, timestamp, and the contribution that signal made to the continuous confidence score.

If a source URL goes dead, we keep the entry and mark it stale rather than deleting it. The historical fact that we observed the page on a given date is part of the audit trail and is often the only remaining handle after a firm has reorganized its newsroom. We additionally archive a hashed snapshot of the rendered content of every URL we cite. The archive is internal and is available on legitimate audit request from regulators, the firm of record, or an enterprise customer with an executed data-processing addendum.

When sources disagree on a field, the disagreement is published in a parallel disagreements[] array. The most common disagreement is practice-group assignment, where a firm announcement places a lateral in one practice but the early docket pattern places them in another. Roughly 6% of recently confirmed lateral records carry at least one disagreement. We do not paper over them. Journalists and litigation-funder principals have told us repeatedly that the disagreements are often the early signal that a firm's stated strategy and its actual hiring pattern have diverged. Webhook-tier customers can subscribe to disagreement-detection events.

Source-chain retention runs 7 years on the signal array and the rendered-content snapshots, after which both age out under our data-minimization policy unless a customer or regulator has a documented retention requirement that extends further.

05 · Quality assurance

QA runs three parallel loops: sampling-and-relabeling on the practice-area taxonomy, inter-rater agreement on subjective fields, and drift monitoring on the index as a whole.

Sampling protocol. Each week we draw a stratified random sample of 200 newly classified postings, weighted to over-represent boundary leaves where we expect classification difficulty (the boundary between general commercial litigation and complex commercial litigation, for example, or between fund formation and private-funds regulatory). The sample is independently relabeled by a panel of three legal-data analysts. We compute a confusion matrix at the leaf level, compute Cohen's kappa against the production classifier, and feed disagreements back into the rule bank that governs the deterministic stage of classification.

Inter-rater agreement. The panel reaches independent consensus before consulting one another, then meets to resolve disagreements. Kappa above 0.85 is the operating target for taxonomy mapping; kappa above 0.75 is the floor below which we treat the leaf as structurally ambiguous and add a structural-pattern flag to the schema rather than forcing a classification. Across the most recent 12-week window, mean kappa on taxonomy mapping was 0.88 and the lowest weekly kappa was 0.83.

Drift monitoring. A drift dashboard tracks the share of postings that change classification between schema releases, the share of records that fall below the medium confidence tier from one quarter to the next, and the geographic and practice distribution of records relative to the prior quarter's distribution. Material drift, defined as a greater-than-3% delta on any of these metrics, triggers a root-cause review before the next release ships.

An adversarial review runs monthly: a separate analyst constructs queries that surface records the system should have caught and did not, or fails to surface records the system claims to cover. Findings go into the changelog and into the open-questions section below.

06 · Refresh and freshness guarantees

Three freshness contracts, indexed to product tier.

Daily refresh window. 04:00 to 05:00 UTC. Every active record is re-evaluated against its source classes, every new signal from the prior 24 hours is attached, every confidence score is recomputed, and every last_verified_at timestamp is advanced. Customers reading after 05:00 UTC see numbers that include the previous business day in full.

Sub-24-hour latency on the postings feed. Career-page feeds are re-polled hourly between full pulls. A new posting that appeared at 11 AM Eastern lands in the index by 5 PM Eastern at the latest, typically inside an hour. Hourly re-polls are restricted to classes that move hourly; we do not re-poll bar admissions hourly because they do not move hourly, and re-polling for the sake of motion is anti-quality.

Real-time webhook delivery. Push-tier customers receive HMAC-signed webhook events within 90 seconds of a record commit. The SLA is measured from the moment the record passes the dedup and confidence layers to the first delivery attempt leaving our edge. Delivery is at-least-once with idempotency keys; consumers dedupe by event ID. Disagreement-detection events, take-down events, and confidence-tier transitions all flow through the same channel.

Freshness is published per-record as last_verified_at and per-feed as a feed-health endpoint. A buyer can audit a feed's actual freshness against the contract without contacting support.

07 · Privacy and compliance

Privacy is a sourcing constraint before it is a compliance posture. We do not scrape behind authentication. The entire index can be reconstructed from sources visible without a login. This rules out professional-network scraping, account-protected newsrooms, and any source whose terms of service restrict programmatic access behind an authenticated session. We respect robots.txt directives, including delay directives, and publish a contact address for sites that want to opt out beyond what robots.txt expresses.

FCRA awareness. The index is not a consumer report and is not used to make consumer-credit, employment, insurance, or housing eligibility decisions about identified consumers. Customers using our data for lawyers' professional liability underwriting (LPL) work at the firm-aggregate level: posting volume, practice mix, geographic concentration. We do not sell candidate-level data to LPL underwriters, and enterprise contracts include FCRA-aware acceptable-use clauses. The posture aligns with the data-handling expectations of carriers and brokers operating under ALAS, Beazley, and similar professional-liability frameworks.

GLB awareness. Where customer use cases intersect with financial-services personnel data, we operate under the GLB safeguards rule's data-handling expectations: encryption in transit and at rest, role-based access on the customer side, and contractual restrictions on onward transfer. None of the data in the index is non-public personal information under GLB; the posture is conservative because financial-services customers expect it from any vendor in their data supply chain.

GDPR posture for EU candidate dossiers. Candidate dossiers covering EU-resident individuals are processed under a legitimate-interest basis specific to professional-context information and balanced against the candidate's reasonable expectations of professional visibility. EU candidates can request deletion through the privacy contact; deletions propagate within 30 days and are replicated to enterprise customer mirrors on the next refresh cycle. We do not process special-category data, non-public matters, or inferred protected characteristics.

Data lineage. Every record's full lineage is retained for 7 years: source URLs, observation timestamps, the classifier version that mapped it, the calibration version that scored it, and the schema version it was ingested under. The lineage is what lets a regulator or a customer's compliance team ask, three years from now, where a particular fact came from and on what basis it was reported.

08 · Security

Security is a vendor relationship before it is a product feature. The customer-facing API and the underlying data pipeline run as separate services with separate credentials and separate blast radii.

Webhook signatures. Every webhook delivery carries an HMAC-SHA256 signature over the raw body, computed with a secret unique to the customer endpoint. The signature is in the X-Placement-Signature header along with a delivery timestamp; consumers verify both before processing.
API authentication. Bearer tokens scoped per environment (production, sandbox) and per role within the customer organization. Tokens are issued from a console with audit logging on every issuance, rotation, and revocation.
IP allowlisting. Every API key supports an optional IP allowlist. Enterprise contracts default to mandatory allowlisting on the production key; sandbox keys remain unrestricted to keep developer onboarding low-friction.
Rotation policies. Keys roll on a 90-day default cadence with email reminders at 30, 14, and 3 days. Webhook signing secrets roll on the same cadence with a 7-day overlap window during which both old and new signatures verify successfully.
Encryption. TLS 1.3 in transit; AES-256 at rest on every persistent store; per-customer encryption keys for enterprise tenants.
SOC 2 roadmap. Type 1 attestation is targeted for 2026 Q4; Type 2 follows in 2027 Q3 after the required observation window. The audit scope covers the customer-facing API, the data pipeline, the credential-issuance console, and the human-process controls around source ingestion. Until the formal report is available we publish the control framework, the responsible parties, and the gap-remediation status as a security posture page.

Customer-side incident response is a contractual obligation: a security event affecting customer data triggers notification within 72 hours of confirmed impact, and notification carries the affected record IDs, the suspected vector, and the remediation timeline.

09 · Versioning and deprecation

Two versioned surfaces, both governed by Semantic Versioning.

Taxonomy. The 47-node practice taxonomy carries a SemVer string in every response that uses it. Splitting a leaf is a major-version event because a buyer's downstream join keys may change. Renaming a leaf is a minor-version event because the canonical ID is stable and the rename is metadata. Adding a new leaf is a minor-version event when the new leaf does not displace an existing one and a major-version event when it does. Quarterly drift, the share of records that remap between releases, has stayed under 2% since first publication.

API surface. The customer-facing API carries a SemVer string in the URL prefix and in every response. Breaking changes ship behind a new major version; non-breaking additions ship within the current minor. We do not use header-only versioning because URL-based versioning makes accidental upgrades impossible, which is the property a buyer cares about when their pipeline is in production.

Deprecation timeline. Every deprecation carries a minimum 12-month support window from the date of public announcement to the date of removal. The deprecation appears in the changelog, in the response headers (Deprecation and Sunset per RFC 8594), and in a quarterly deprecation digest emailed to every customer with a key issued against the deprecated surface. Major customer accounts can request extended support beyond the 12-month minimum on a contractual basis.

The changelog is maintained at changelog.html, in reverse-chronological order, with every entry tagged by the version it shipped in and the surfaces it touched.

10 · Open questions and invitation

The system is not finished. Three categories of open question are worth naming.

Cross-jurisdictional reciprocity. Bar reciprocity in the United States is patchier than the public posture suggests, and an attorney admitted under reciprocity sometimes shows up in the index as a fresh admission and sometimes as an annotation on the original. We are working on a structured reciprocity field that distinguishes the two. Practitioners familiar with reciprocity records, particularly in jurisdictions where grants are not searchable through standard channels, are invited to contact us.

Counsel-title semantics. The counsel title means structurally different things at different firms: a senior-associate plateau at some, a partner-track parking lot at others, a permanent non-equity track at others. Our current schema flattens the title into a seniority band and a structural-pattern flag. A richer counsel-title model that captures firm-specific semantics is in development; comparative-firm data on counsel structure would meaningfully improve it.

Mid-market firm coverage below 25 attorneys. Firms below 25 attorneys are systematically under-covered: thinner public-record footprint, more variable career-page infrastructure. We have a partial strategy at 25 to 200 attorneys; below 25 the coverage drops sharply. Customers whose use case requires the long tail of small firms can tell us which firms specifically; we expand on a use-case-driven basis rather than uniformly.

Industry input is welcome on every section above. Practitioners, firm marketers, journalists, recruiters, underwriters, and researchers who want to challenge a leaf, supply firm-naming evidence we missed, surface a calibration anomaly, or correct a record can write to hunter@placement.solutions. Substantive submissions land in the editorial queue; surviving submissions ship in a future minor release with attribution.

Methodology.

Abstract

01 · Data sourcing

02 · Schema and entity resolution

03 · Confidence scoring

04 · Source-chain audit trail

05 · Quality assurance

06 · Refresh and freshness guarantees

07 · Privacy and compliance

08 · Security

09 · Versioning and deprecation

10 · Open questions and invitation