Public | Automated Build

Last pushed: 3 months ago
Short Description
Docker image for NYT ingredient-phrase-tagger
Full Description

CRF Ingredient Phrase Tagger

This repo contains scripts to extract the Quantity, Unit, Name, and Comments
from unstructured ingredient phrases. We use it on Cooking to format
incoming recipes. Given the following input:

1 pound carrots, young ones if possible
Kosher salt, to taste
2 tablespoons sherry vinegar
2 tablespoons honey
2 tablespoons extra-virgin olive oil
1 medium-size shallot, peeled and finely diced
1/2 teaspoon fresh thyme leaves, finely chopped
Black pepper, to taste

Our tool produces something like:

    "qty":     "1",
    "unit":    "pound"
    "name":    "carrots",
    "other":   ",",
    "comment": "young ones if possible",
    "input":   "1 pound carrots, young ones if possible",
    "display": "<span class='qty'>1</span><span class='unit'>pound</span><span class='name'>carrots</span><span class='other'>,</span><span class='comment'>young ones if possible</span>",

We use a conditional random field model (CRF) to extract tags from labelled
training data, which was tagged by human news assistants. We wrote about our
approach on the New York Times Open blog. More information about
CRFs can be found here.

On a 2012 Macbook Pro, training the model takes roughly 30 minutes for 130k
examples using the CRF++ library.



brew install crf++
python install


docker pull mtlynch/ingredient-phrase-tagger

Quick Start

The most common usage is to train the model with a subset of our data, test the
model against a different subset, then visualize the results. We provide a shell
script to do this, at:


You can edit this script to specify the size of your training and testing set.
The default is 20k training examples and 2k test examples.



To train the model, we must first convert our input data into a format which
crf_learn can accept:

bin/generate_data --data-path=input.csv --count=1000 --offset=0 > tmp/train_file

The count argument specifies the number of training examples (i.e. ingredient
lines) to read, and offset specifies which line to start with. There are
roughly 180k examples in our snapshot of the New York Times cooking database
(which we include in this repo), so it is useful to run against a subset.

The output of this step looks something like:

1            I1      L8      NoCAP  NoPAREN  B-QTY
cup          I2      L8      NoCAP  NoPAREN  B-UNIT
white        I3      L8      NoCAP  NoPAREN  B-NAME
wine         I4      L8      NoCAP  NoPAREN  I-NAME

1/2          I1      L4      NoCAP  NoPAREN  B-QTY
cup          I2      L4      NoCAP  NoPAREN  B-UNIT
sugar        I3      L4      NoCAP  NoPAREN  B-NAME

2            I1      L8      NoCAP  NoPAREN  B-QTY
tablespoons  I2      L8      NoCAP  NoPAREN  B-UNIT
dry          I3      L8      NoCAP  NoPAREN  B-NAME
white        I4      L8      NoCAP  NoPAREN  I-NAME
wine         I5      L8      NoCAP  NoPAREN  I-NAME

Next, we pass this file to crf_learn, to generate a model file:

crf_learn template_file tmp/train_file tmp/model_file


To use the model to tag your own arbitrary ingredient lines (stored here in
input.txt), you must first convert it into the CRF++ format, then run against
the model file which we generated above. We provide another helper script to do

python bin/ input.txt > results.txt

The output is also in CRF++ format, which isn't terribly helpful to us. To
convert it into JSON:

python bin/ results.txt > results.json

See the top of this README for an example of the expected output.



Apache 2.0.

Docker Pull Command