Why the screenplay is a hard training corpus
Screenplays are famously bad training data. Not because the text is bad — it’s some of the best dialogue ever written — but because the structure is locked inside a proprietary binary or a flat PDF. Parsing a PDF screenplay into anything resembling “scene 14, spoken by MARA, reacting to the previous line” is a non-trivial research project on its own, repeated per dataset.
ScreenJSON is, among other things, the correct answer to that parsing problem. Because the schema is structural from the start, every scene, every cue, every line of dialogue, every character, every transition is already a typed node with a stable UUID. The structure is the interface.
Retrieval-ready, out of the box
The schema has a dedicated analysis block designed specifically for
retrieval and AI workflows. It’s optional and discardable — a document
without analysis is still canonical — but when present, it gives you
three things:
- Embeddings, grouped by target UUID (scene, element, or character), each with its model name, dimensions, source field, language, token count, and creation timestamp.
- Passages, retrieval-sized chunks of text keyed back to the scene and element UUIDs they cover, with token counts and optional sliding-window overlap.
- Summaries, either document-scoped or scene-scoped, marked as human-written or machine-generated, with the producing model recorded.
Plus an analysis.settings block that records how the analysis was
produced — the model, chunk size, overlap, tokeniser — because
reproducibility matters.
{
"analysis": {
"embeddings": {
"7a1c0b4e-…": [{
"id": "…",
"model": "text-embedding-3-large",
"dimensions": 1536,
"values": [0.023, -0.041, 0.017, …],
"source": "text",
"lang": "en",
"tokens": 420,
"created": "2026-01-14T10:30:00Z"
}]
},
"passages": [{
"id": "…",
"scene": "7a1c0b4e-…",
"elements": ["9d3f-…", "a1c2-…"],
"text": { "en": "MARA and ELLIS argue about the heist timing…" },
"tokens": 256,
"overlap": 32
}],
"summaries": [{
"id": "…",
"scope": "scene",
"target": "7a1c0b4e-…",
"generated": true,
"model": "claude-opus-4-7",
"text": { "en": "MARA pressures ELLIS to commit." },
"created": "2026-01-14T10:45:00Z"
}],
"settings": {
"model": "text-embedding-3-large",
"size": 512,
"overlap": 64,
"tokeniser": "cl100k"
}
}
}
You can compute this once at ingestion time and point a RAG stack at it directly.
Typical pipelines
A RAG pipeline over a studio catalogue
- Ingest screenplays into ScreenJSON with Greenlight.
- As a pipeline step, chunk into passages (scene-grouped is a sensible default) and embed.
- Store the ScreenJSON (with
analysispopulated) in your document store. The passage collection goes into a vector database (Pinecone, Weaviate, pgvector, Qdrant) keyed by passage UUID. - On query, retrieve top-k passages, look up their source element UUIDs in the canonical document for provenance, and hand the LLM both the passage text and the structured metadata (scene heading, character list, scene tags).
The key property: because every retrieved passage points back to concrete scene and element UUIDs, you can cite. “This claim comes from scene 14, line 207, spoken by MARA.” That’s difficult to do reliably when your chunking tool has lost the structure.
A per-character dialogue corpus
Because the schema indexes characters at the document root and every dialogue element references a character UUID, projecting out “every line spoken by a given character, across every script in the corpus” is a query, not a project.
jq --arg id "8d2f-mara" '
[.document.scenes[].body[]
| select(.type == "dialogue" and .character == $id)
| .text.en]
' screenplay.json
Build a per-character fine-tuning set by concatenating those queries across your catalogue. Because UUIDs are stable, you can rebuild the dataset later with zero re-parsing.
Training data with structural signal
A model trained on ScreenJSON sees richer signal than one trained on flat text:
- Scene boundaries are explicit (
heading+ sceneid). - Character identity is resolved across cues (no more “is ‘MARA’ the same as ‘MARA (V.O.)’?”).
- Origin markers are typed:
V.O.,O.S.,O.C.,FILTERare enum values, not parenthetical conventions. - Parentheticals are separated from dialogue.
- Transitions are separated from action.
If your training harness ignores this, it still has plain text to fall back on. If your harness uses this, the loss function gets better gradients.
On hallucination and citation
When you build an internal tool that answers questions about a script — “summarise Mara’s emotional arc”, “find the scene where Ellis betrays the team” — hallucination is the whole game. ScreenJSON’s UUID discipline makes it tractable.
Every LLM response can be structured as:
{
"answer": "…",
"citations": [
{ "scene": "<uuid>", "element": "<uuid>", "excerpt": "…" }
]
}
Validate citations against the source document: does scene <uuid> exist,
does it contain element <uuid>, does element.text match the excerpt? If
any check fails, you have a hallucination, and you can refuse the response
automatically. If they pass, you can render a viewer (via
screenjson-ui) that highlights the exact cited
span.
On privacy and training consent
A substantial fraction of the world’s screenplays were never released to the public. A substantial fraction are under NDA. A substantial fraction are the intellectual property of writers who have opinions about AI training.
ScreenJSON’s encryption layer gives you a technical lever here. Text runs can be AES-256-CTR-encrypted while leaving structure, UUIDs, and metadata visible. You can index structure, run classification, and build search indexes without ever decrypting the text. Training requires a deliberate, auditable step: decrypt, train, re-encrypt. That’s a different policy conversation from “we had everything in plain text on an S3 bucket.”
Next
- Tool: screenjson-cli — the engine that produces the data.
- Tool: Greenlight — batch pipelines.
- How-to: Generate embeddings for semantic search
- How-to: Extract dialogue for a single character
- How-to: Feed ScreenJSON into an LLM for analysis
- How-to: Build a screenplay search index
- Specification: analysis