Working with Structured Text > Migrating content to Structured Text

Migrating content to Structured Text

The goal of this guide is to teach you how to migrate an existing DatoCMS project to Structured Text fields. To illustrate the process, we'll use an example project that you can clone on your account to follow each step.

In a hurry? Download the final result!

If you prefer to skip the tutorial and just take a look at the final code, head over to this GitHub repo.

Setup

First of all, to follow this guide, make sure to clone this example project into your own DatoCMS account.

Done? Great! Now's open the terminal, create a new directory for the migration project, and install the DatoCMS CLI:

mkdir structured-text-migrations
cd structured-text-migrations
npm init --yes
npm i --save-dev typescript @datocms/cli
tsc --init
mkdir -p migrations/utils

Now let's setup a profile for the CLI:

$ datocms profile:set
Config file not present in "datocms.config.json", will be created from scratch
Requested to configure profile "default"
* Level of logging to use for the profile (NONE, BASIC, BODY, BODY_AND_HEADERS) [NONE]:
* Directory where script migrations will be stored [./migrations]:
* API key of the DatoCMS model used to store migration data [schema_migration]:
* Path of the file to use as migration script template (optional):
Writing "datocms.config.json"... done

We're also creating a local .env file containing the full-access API token to the project, so that the datocms CLI can communicate with it (make sure you don't commit it to your repo!):

echo 'DATOCMS_API_TOKEN=<YOUR_READWRITE-TOKEN>' > .env

High-level strategy & Project skeleton

This is the content schema of the cloned project:

The fields we want to convert into Structured Text are the following:

  • HTML Article > Content (HTML multi-paragraph text);

  • Markdown Article > Content (Markdown multi-paragraph text);

  • Modular Content Article > Content (Modular content);

To do that, we're going to write three migration scripts (one for each model) and test the result inside a sandbox environment.

For every field, the high-level plan will be the same:

  1. Create a new Structured Text field for the model;

  2. For every article, take the old content, convert it to Structured Text and save it in the new field;

  3. Destroy the old field.

Inside the migrations/utils directory, we're adding some functions that we're going to use for all three migrations:

  • createStructuredTextFieldFrom creates a new Structured Text field with the same label and API key as an existing field, but prefixed with structured_text_ (basically, step 1 of our plan);

  • getAllRecords fetches all the records of a specific model using the nested option, so that for modular content fields we get the full payload of the inner block records instead of just their ID (that's the first bit of step 2);

  • swapFields destroys the old field, and renames the new Structured Text field as the old one (that's step 3 of our plan);

Lastly, since:

  • some API calls expect the model ID and not the model API key, and

  • model IDs are different on each environment, and

  • we want our migrations to work on any environment

we can avoid hardcoding model IDs writing a getModelIdsByApiKey function that returns an object mapping API keys to model IDs:

// ./migrations/utils/createStructuredTextFieldFrom.ts
import { Client, SimpleSchemaTypes } from '@datocms/cli/lib/cma-client-node';
export default async function createStructuredTextFieldFrom(
client: Client,
modelApiKey: string,
fieldApiKey: string,
modelBlockIds: SimpleSchemaTypes.ItemTypeIdentity[],
): Promise<SimpleSchemaTypes.Field> {
const legacyField = await client.fields.find(
`${modelApiKey}::${fieldApiKey}`,
);
const newApiKey = `structured_text_${fieldApiKey}`;
const label = `${legacyField.label} (Structured-text)`;
console.log(`Creating ${modelApiKey}::${newApiKey}`);
return client.fields.create(modelApiKey, {
label,
api_key: newApiKey,
field_type: 'structured_text',
fieldset: legacyField.fieldset,
validators: {
structured_text_blocks: {
item_types: modelBlockIds,
},
structured_text_links: { item_types: [] },
},
});
}
// ./migrations/utils/getAllRecords.ts
import { Client } from '@datocms/cli/lib/cma-client-node';
export default async function getAllRecords(
client: Client,
modelApiKey: string,
) {
const records = await client.items.list({
filter: { type: modelApiKey },
nested: true,
});
console.log(`Found ${records.length} records!`);
return records;
}
// ./migrations/utils/swapFields.ts
import { Client } from '@datocms/cli/lib/cma-client-node';
export default async function swapFields(
client: Client,
modelApiKey: string,
fieldApiKey: string,
) {
const oldField = await client.fields.find(`${modelApiKey}::${fieldApiKey}`);
const newField = await client.fields.find(
`${modelApiKey}::structured_text_${fieldApiKey}`,
);
// destroy the old field
await client.fields.destroy(oldField.id);
// rename the new field
await client.fields.update(newField.id, {
api_key: fieldApiKey,
label: oldField.label,
position: oldField.position,
});
}
// ./migrations/utils/getModelIdsByApiKey.ts
import { Client } from '@datocms/cli/lib/cma-client-node';
import { ItemType } from '@datocms/cma-client/dist/types/generated/SimpleSchemaTypes';
export default async function getModelIdsByApiKey(
client: Client,
): Promise<Record<string, ItemType>> {
const models = await client.itemTypes.list();
return models.reduce(
(acc, itemType) => ({
...acc,
[itemType.api_key]: itemType,
}),
{},
);
}
// migrations/utils/findOrCreateUploadWithUrl.ts
import { Client } from '@datocms/cli/lib/cma-client-node';
import path from 'path';
export default async function findOrCreateUploadWithUrl(
client: Client,
url: string,
) {
let upload;
if (url.startsWith('https://www.datocms-assets.com')) {
const pattern = path.basename(url).replace(/^[0-9]+\-/, '');
const matchingUploads = await client.uploads.list({
filter: {
fields: {
filename: {
matches: {
pattern,
case_sensitive: false,
regexp: false,
},
},
},
},
});
upload = matchingUploads.find((u) => {
return u.url === url;
});
}
if (!upload) {
upload = await client.uploads.createFromUrl({ url });
}
return upload;
}

Migrating HTML content

Let's create the first migration script:

> datocms migrations:new convertHtmlArticles
Created migrations/1612281851_convertHtmlArticles.ts

Replace the content of the file with the following skeleton, which uses the utilities we just created:

// ./migrations/1612281851_convertHtmlArticles.rs
import getModelIdsByApiKey from './utils/getModelIdsByApiKey';
import createStructuredTextFieldFrom from './utils/createStructuredTextFieldFrom';
import htmlToStructuredText from './utils/htmlToStructuredText';
import getAllRecords from './utils/getAllRecords';
import swapFields from './utils/swapFields';
import convertImgsToBlocks from './utils/convertImgsToBlocks';
import { Client, SimpleSchemaTypes } from '@datocms/cli/lib/cma-client-node';
type HtmlArticleType = SimpleSchemaTypes.Item & {
title: string;
content: string;
};
export default async function convertHtmlArticles(client: Client) {
const modelIds = await getModelIdsByApiKey(client);
await createStructuredTextFieldFrom(client, 'html_article', 'content', [
modelIds.image_block.id,
]);
const records = (await getAllRecords(
client,
'html_article',
)) as HtmlArticleType[];
for (const record of records) {
const structuredTextContent = await htmlToStructuredText(
record.content,
convertImgsToBlocks(client, modelIds),
);
await client.items.update(record.id, {
structured_text_content: structuredTextContent,
});
if (record.meta.status !== 'draft') {
await client.items.publish(record.id);
}
}
await swapFields(client, 'html_article', 'content');
}

A couple of notes:

  • Inside the HTML field there might be image tags (<img />). Structured Text does not have a specific node to handle images because it offers block nodes, which is a more powerful primitive. This means that, during the transformation process, we'll need to convert those <img /> tags into block records of type "Image" (that's the same block currently used by the Modular Content field). For this reason, in line 19 we pass the image_block model ID to configure the newly created Structured Text field to accept such type of blocks;

  • In the highlighted lines we're going to perform the actual records update and make sure we republish updated records (unless they were in draft).

So what is left to do is to implement the htmlToStructuredText() function.

The datocms-html-to-structured-text package offers a parse5ToStructuredText function that is meant to be used in NodeJS environments to perform the conversion from HTML to Structured Text (parse5 is a popular HTML parser for NodeJS).

Internally, the parse5ToStructuredText will take the parse5 Document, convert it into a hast tree, and then convert the hast tree into a dast tree (that's the format of our Structured Text document). All these conversions might seem an overkill, but we will see later how having hast as an intermediate representation will come in handy.

Let's install some dependencies:

npm install --save-dev parse5 \
datocms-html-to-structured-text \
datocms-structured-text-utils \
unist-utils-core@1.0.5

Now we have everything we need to build our htmlToStructuredText function:

// ./migrations/utils/htmlToStructuredText
import { parse } from 'parse5';
import {
parse5ToStructuredText,
Options,
} from 'datocms-html-to-structured-text';
import { validate } from 'datocms-structured-text-utils';
export default async function htmlToStructuredText(
html: string,
settings: Options,
) {
if (!html) {
return null;
}
const result = await parse5ToStructuredText(
parse(html, {
sourceCodeLocationInfo: true,
}),
settings,
);
const validationResult = validate(result);
if (!validationResult.valid) {
throw new Error(validationResult.message);
}
return result;
}

Please note that in the highlighted line we use the validate function from the datocms-structured-text-utils package to make sure that the final result is valid Structured Text.

Converting image tags into blocks

The code above will convert 99% of the HTML correctly, but images present in the content will be skipped.

As we already noted before, that's because Structured Text does not have a specific node to handle images. Instead, it offers block nodes, which can handle images and much more. We have to pass some additional settings to the parse5ToStructuredText function to tell it how to convert <img /> tags to block nodes:

// ./migrations/utils/convertImgsToBlocks.ts
import {
buildBlockRecord,
Client,
SimpleSchemaTypes,
} from '@datocms/cli/lib/cma-client-node';
import { visit, find } from 'unist-utils-core';
import {
HastNode,
HastElementNode,
CreateNodeFunction,
Context,
} from 'datocms-html-to-structured-text';
import { Options } from 'datocms-html-to-structured-text';
import findOrCreateUploadWithUrl from './findOrCreateUploadWithUrl';
export default function convertImgsToBlocks(
client: Client,
modelIds: Record<string, SimpleSchemaTypes.ItemType>,
): Options {
return {
preprocess: (tree: HastNode) => {
const liftedImages = new WeakSet();
const body = find(
tree,
(node: HastNode) =>
(node.type === 'element' && node.tagName === 'body') ||
node.type === 'root',
);
visit<HastNode, HastElementNode & { children: HastNode[] }>(
body,
(node, index, parents) => {
if (
node.type !== 'element' ||
node.tagName !== 'img' ||
liftedImages.has(node) ||
parents.length === 1
) {
return;
}
const imgParent = parents[parents.length - 1];
imgParent.children.splice(index, 1);
let i = parents.length;
let splitChildrenIndex = index;
let childrenAfterSplitPoint: HastNode[] = [];
while (--i > 0) {
const parent = parents[i];
const parentsParent = parents[i - 1];
childrenAfterSplitPoint =
parent.children.splice(splitChildrenIndex);
splitChildrenIndex = parentsParent.children.indexOf(parent);
let nodeInserted = false;
if (i === 1) {
splitChildrenIndex += 1;
parentsParent.children.splice(splitChildrenIndex, 0, node);
liftedImages.add(node);
nodeInserted = true;
}
splitChildrenIndex += 1;
if (childrenAfterSplitPoint.length > 0) {
parentsParent.children.splice(splitChildrenIndex, 0, {
...parent,
children: childrenAfterSplitPoint,
});
}
if (parent.children.length === 0) {
splitChildrenIndex -= 1;
parentsParent.children.splice(
nodeInserted ? splitChildrenIndex - 1 : splitChildrenIndex,
1,
);
}
}
},
);
},
// now that images are top-level, convert them into `block` dast nodes
handlers: {
img: async (
createNode: CreateNodeFunction,
node: HastNode,
_context: Context,
) => {
if (node.type !== 'element' || !node.properties) {
return;
}
const { src: url } = node.properties;
const upload = await findOrCreateUploadWithUrl(client, url);
return createNode('block', {
item: buildBlockRecord({
item_type: { id: modelIds.image_block.id, type: 'item_type' },
image: {
upload_id: upload.id,
},
}),
});
},
},
};
}

A couple notes:

  • We use the handlers option to specify how to convert the <img /> hast nodes tags to dast block nodes (the default behavior, as we saw, is to simply skip them);

  • The block node should contain a block record of type Image (that's the same block currently used by the Modular Content field), which in turn has a single-asset image field. In line 79 we create a new asset starting from the src tag of the image, to feed it to the image field. Luckily, the handlers are async functions, so we can easily perform an asyncronous operation inside of it.

  • Since in the dast format, a block node can only be at root level, we use the preprocess option to tweak the hast tree and lift every image node up to the root (in case they're inside paragraphs or other tags).

We can test the migration with the following command from the Terminal, which will clone the primary environment into a sandbox, and run the migration:

datocms migrations:run --destination=with-structured-text
✔ Running 1612281851_convertHtmlArticles.ts...
Done!

Success! The article content is correctly converted to structured text.

Migrating Markdown content

Once we know how to perform the HTML-to-Structured-Text conversion, we only have to do some minor changes to make it work also for Markdown content.

As we just saw, the datocms-html-to-structured-text package knows how to convert an hast tree (HTML) to a dast tree (Structured Text), so if we can convert a Markdown string to hast, then the rest of the code will be basically the same.

Luckily, hast is part of the unified ecosystem, which also includes:

  • an analogue specification for representing Markdown in a syntax tree called mdast;

  • a tool to convert Markdown strings to mdast;

  • a tool to convert mdast trees to hast.

Let's install all the packages we need:

npm install --save-dev unified@9 remark-parse@9 mdast-util-to-hast@10

We can now create a function similar to htmlToStructuredText called markdownToStructuredText that connects all the dots:

// ./migrations/utils/markdownToStructuredText.ts
import unified from 'unified';
import toHast from 'mdast-util-to-hast';
import parse from 'remark-parse';
import {
hastToStructuredText,
Options,
HastRootNode,
} from 'datocms-html-to-structured-text';
import { validate } from 'datocms-structured-text-utils';
export default async function markdownToStructuredText(
markdown: string,
options: Options,
) {
if (!markdown) {
return null;
}
const mdastTree = unified().use(parse).parse(markdown);
const hastTree = toHast(mdastTree) as HastRootNode;
const result = await hastToStructuredText(hastTree, options);
const validationResult = validate(result);
if (!validationResult.valid) {
throw new Error(validationResult.message);
}
return result;
}

We can now create a new migration script:

> datocms migrations:new convertMarkdownArticles
Created migrations/1612340785_convertMarkdownArticles.ts

And basically copy the previous migration, just replacing the name of the model (from html_article to markdown_article), and the call to htmlToStructuredText with a call to markdownToStructuredText:

// ./migrations/1612340785_convertMarkdownArticles.ts
import getModelIdsByApiKey from './utils/getModelIdsByApiKey';
import createStructuredTextFieldFrom from './utils/createStructuredTextFieldFrom';
import markdownToStructuredText from './utils/markdownToStructuredText';
import convertImgsToBlocks from './utils/convertImgsToBlocks';
import getAllRecords from './utils/getAllRecords';
import swapFields from './utils/swapFields';
import { Client, SimpleSchemaTypes } from '@datocms/cli/lib/cma-client-node';
type MdArticleType = SimpleSchemaTypes.Item & {
title: string;
content: string;
};
export default async function (client: Client) {
const modelIds = await getModelIdsByApiKey(client);
await createStructuredTextFieldFrom(client, 'markdown_article', 'content', [
modelIds.image_block.id,
]);
const records = (await getAllRecords(
client,
'markdown_article',
)) as MdArticleType[];
for (const record of records) {
const structuredTextContent = await markdownToStructuredText(
record.content,
convertImgsToBlocks(client, modelIds),
);
await client.items.update(record.id, {
structured_text_content: structuredTextContent,
});
if (record.meta.status !== 'draft') {
await client.items.publish(record.id);
}
}
await swapFields(client, 'markdown_article', 'content');
}

We can now run the new migration inside the sandbox environment we already created for the first migration:

> datocms migrations:run --source=with-structured-text --in-place
✔ Running 1612340785_convertMarkdownArticles.ts...
Done!

Migrating Modular Content fields

To migrate Modular Content fields into Structured Text fields, we must acknowledge the fact that both fields allow nested record blocks: the difference between the two is that Modular Content is basically an array of record blocks, while in Structed Text record blocks are inside the dast tree in nodes of type block. In other words, our task here is, for every modular content, to transform an array of block records into a single dast document. It's up to us to decide how to convert each block we encounter into one/many nodes into our dast document.

Let's take a look at the project schema again:

The existing Modular Content field supports three block types:

  • Text (which in turn contains a text Markdown field);

  • Code (which has two fields, one that contains the actual code and another that stores the language);

  • Image (which, as we already know, it contains a single-asset field called image).

Here's the code for our migration:

// ./migrations/1612340785_convertModularArticles.ts
import { Document, Node, validate } from 'datocms-structured-text-utils';
import getModelIdsByApiKey from './utils/getModelIdsByApiKey';
import createStructuredTextFieldFrom from './utils/createStructuredTextFieldFrom';
import getAllRecords from './utils/getAllRecords';
import swapFields from './utils/swapFields';
import markdownToStructuredText from './utils/markdownToStructuredText';
import convertImgsToBlocks from './utils/convertImgsToBlocks';
import { Client, SimpleSchemaTypes } from '@datocms/cli/lib/cma-client-node';
type ModularArticleType = SimpleSchemaTypes.Item & {
title: string;
content: any;
};
export default async function (client: Client) {
const modelIds = await getModelIdsByApiKey(client);
await createStructuredTextFieldFrom(
client,
'modular_content_article',
'content',
[modelIds.image_block.id, modelIds.text_block.id, modelIds.code_block.id],
);
const records = (await getAllRecords(
client,
'modular_content_article',
)) as ModularArticleType[];
for (const record of records) {
const rootNode = {
type: 'root',
children: [] as Node[],
};
for (const block of record.content) {
switch (block.relationships.item_type.id) {
case modelIds.text_block.id: {
const markdownSt = await markdownToStructuredText(
block.text,
convertImgsToBlocks(client, modelIds),
);
if (markdownSt) {
rootNode.children = [
...rootNode.children,
...markdownSt.document.children,
];
}
break;
}
case modelIds.code_block.id: {
rootNode.children.push({
type: 'code',
language: block.language,
code: block.code,
});
break;
}
default: {
delete block.id;
delete block.meta;
delete block.createdAt;
delete block.updatedAt;
rootNode.children.push({
type: 'block',
item: block,
});
break;
}
}
}
const result = {
schema: 'dast',
document: rootNode,
} as Document;
const validationResult = validate(result);
if (!validationResult.valid) {
throw new Error(validationResult.message);
}
await client.items.update(record.id, {
structured_text_content: result,
});
if (record.meta.status !== 'draft') {
await client.items.publish(record.id);
}
}
await swapFields(client, 'modular_content_article', 'content');
}

Every time we need to convert a Modular Content field, we start by creating an empty Dast root node (that is, one with no children, line 33).

Then, for every block contained in the modular content (line 38), we're going to accumulate children inside the root node:

  • If it is a Text block (line 40), we use the markdownToStructuredText function to convert its Markdown content into a Dast tree, then take the children of the resulting root node and add them to our accumulator;

  • Since Dast supports nodes of type code, if we encounter a Code block (line 55), we simply convert it to code node, and add it to the accumulator;

  • If we find an Image block (line 63), we'll wrap the block into a Dast block node, and add it to the accumulator as it is.

Wrapping up

Once you get to know the Structured Text format, it becomes quite straightforward converting from/to its Dast tree representation of nodes, and the DatoCMS API, coupled with migrations/sandbox environments, makes it easy to perform any kind of treatment to your content.

You can download the final code from this GitHub repo.