llama.cpp Experiment (wip)

This is part of my effort to update my resume using LLM. I want to experiment with llama.cpp to see if I can use it to generate my resume in JSON format.

I was inspired by these 2 websites/articles:

JSON schema for resume: JSON Resume
This article in Medium on restraining LLM to JSON format

so based on that, I want to experiment if I can use llama.cpp to generate output in JSON format.

the article I linked above has couple of alternative methods to restrain LLM to JSON format. I don’t like the “Prompt Engineering” approach, since it seems so unreliable. After reading the article, I’d like to try the “Grammar”/“JSON Schema” approach.

The scripts on that page requires llama.cpp - which I currently don’t have and don’t know at all, so I will need to install it first.

Prerequisite - Install llama.cpp

Steps:

Clone the repo

# go to your project directory in your computer, clone the repo
git clone [email protected]:ggerganov/llama.cpp.git

Build llama.cpp (using CUDA)

# build using make
make

# I have NVIDIA GPU, so I will build the CUDA version
make LLAMA_CUDA=1
# i got this error:
# bin/sh: 1: nvcc: not found

# I need to install nvidia-cuda-toolkit
sudo apt install nvidia-cuda-toolkit

# try to build again
make LLAMA_CUDA=1
# I got another error:
# Makefile:657: *** I ERROR: For CUDA versions < 11.7 a target CUDA architecture must be explicitly provided
# via environment variable CUDA_DOCKER_ARCH, e.g. by running "export CUDA_DOCKER_ARCH=compute_XX" on Unix-like
# systems, where XX is the minimum compute capability that the code needs to run on.
# A list with compute capabilities can be found here: https://developer.nvidia.com/cuda-gpus .  Stop.

# I need to set the CUDA_DOCKER_ARCH
# see the list here: https://developer.nvidia.com/cuda-gpus
# and then try to build again - however, clean it first to ensure that the build include CUDA
export CUDA_DOCKER_ARCH=compute_86
make clean
make LLAMA_CUDA=1

# built successfully !

Prerequisite - Install LLM Models

I need models to run with llama.cpp. I previously used ollama which makes downloading models very simple. Now I need to understand how to download the models manually.
I’m open to any models, I want to try with either llama3 models which seems very powerful, or phi3 which is more lightweight.
Q: it looks like llama.cpp works with gguf format - which looks different from ollama, which is like a blob file. How do I download it?
- Note: It looks like I can download/find those models in huggingface - but I need to understand how to download and convert it to gguf format.
- A: Found the instruction from PrunaAI (from down below)
Q: What model to start with?
- A: I want to start with phi3 model, since it seems smaller and hopefully faster to download and run. I want to use 128k because I want to have more context/tokens (i.e. my resume)
Q: I want to start with phi3 model, but microsoft doesn’t have gguf format for phi-3-mini-128k-instruct: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct
- A: I found it done by PrunaAI: https://huggingface.co/PrunaAI/Phi-3-mini-128k-instruct-GGUF-Imatrix-smashed
- NOTE: Shoutout to PrunaAI - and https://huggingface.co/PrunaAI - they make the model smaller and give very good instruction to noobs like me! very thankful for that!

Download the model from hugging face - instructions credit to PrunaAI

Reference Instructions

Steps/Rationale:

I’m using command line hugging face cli to download the model. I prefer cli so it can be automated later
See my notes below for the steps !!

# ensure that huggingface-cli is already installed
# if not, instruction is here:  https://huggingface.co/docs/huggingface_hub/en/guides/cli
# install hf_transfer: `pip3 install hf_transfer` to speed up download on fast internet connection

# ensure you're logged in
huggingface-cli login
# insert token from hugging face, from: https://huggingface.co/settings/token

# validate you're logged in
huggingface-cli whoami

# change to model directory first
cd models/

# download the model
huggingface-cli download PrunaAI/Phi-3-mini-128k-instruct-GGUF-smashed Phi-3-mini-128k-instruct.IQ3_M.gguf --local-dir . --local-dir-use-symlinks False
# Unfortunately, I got this error:
# Repository Not Found for url: https://huggingface.co/PrunaAI/Phi-3-mini-128k-instruct-GGUF-smashed/resolve/main/Phi-3-mini-128k-instruct.IQ3_M.gguf.

# I need to find the correct model name
huggingface-cli download PrunaAI/Phi-3-mini-128k-instruct-GGUF-Imatrix-smashed/blob/main/Phi-3-mini-128k-instruct.Q5_K_M.gguf  --local-dir . --local-dir-use-symlinks False
# I got this error:
# huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'PrunaAI/Phi-3-mini-128k-instruct-GGUF-Imatrix-smashed/blob/main/Phi-3-mini-128k-instruct.Q5_K_M.gguf'. Use `repo_type` argument if needed.

# third try
huggingface-cli download PrunaAI/Phi-3-mini-128k-instruct-GGUF-Imatrix-smashed/resolve/main/Phi-3-mini-128k-instruct.Q5_K_M.gguf  --local-dir . --local-dir-use-symlinks False
# I got same error

# and now I just realized that I used it incorrectly. The right way is probably this:
huggingface-cli download PrunaAI/Phi-3-mini-128k-instruct-GGUF-Imatrix-smashed Phi-3-mini-128k-instruct.IQ3_M.gguf --local-dir . --local-dir-use-symlinks False
# and it works !!!!

# download llama3 8b model as well, why not
# downloading it from FradayDotDev
huggingface-cli download FaradayDotDev/llama-3-8b-Instruct-GGUF llama-3-8b-Instruct.Q5_K_M.gguf  --local-dir . --local-dir-use-symlinks False

# However, I just realized that you can just download it manually from hugging face website :( ...
# in any case, it is good to know that you can download it using cli for automation purposes

Note: I just realized that I can download the model manually from hugging face website. This is probably a better way if you only need to download it once. But if you need to automate it, you can use the cli.

Prerequisite - experiment with llama.cpp

# I have the model now, let's experiment with llama.cpp

# show helps
./main -h | less
# helps show interactive mode, looks interesting

# first I tried with interactive mode
./main -i
# apparently I need to specify the model


# I tried to use interactive mode with models, but it gave a long weird output
./main -m models/Phi-3-mini-128k-instruct.IQ3_M.gguf -i
# long weird random output

# I'm trying now with the example from the page I linked above
./main -ngl 35 -m models/Phi-3-mini-128k-instruct.IQ3_M.gguf --color -c 32768 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<s>[INST] {prompt\} [/INST]"
# looks promising - it give some results

# it looks like llama.cpp doesn't use CUDA - let me see if I can use it
# according to llama.cpp, I can use CUDA by setting the environment variable CUDA_VISIBLE_DEVICES
# and based on this: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars - I can get the value from nvidia-smi
# we need to use nvidia-smi -L to list all gpus
nvidia-smi -L

# output is something like this:
nvidia-smi -L
# GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-1f000d41-bbba-2144-c214-e4a4fac78d5b)
# export CUDA_VISIBLE_DEVICES=GPU-1f000d41-bbba-2144-c214-e4a4fac78d5b


# now try it again with instruction mode
./main -m models/Phi-3-mini-128k-instruct.IQ3_M.gguf --color -c 32768 --temp 0.7 --repeat_penalty 1.1 -n -1 --instruct "Who is the president of the US"
# didn't work, just showing me the help file

# try again without prompt
export CUDA_VISIBLE_DEVICES=GPU-1f000d41-bbba-2144-c214-e4a4fac78d5b
./main -m models/Phi-3-mini-128k-instruct.IQ3_M.gguf --color -c 32768 --temp 0.7 --repeat_penalty 1.1 -n -1 --instruct


# apparently to use cuda, i'll need -ngl <some number> where llama will try to offload as many as possible
# from: https://github.com/ggerganov/llama.cpp/blob/master/docs/token_generation_performance_tips.md
./main -ngl 10000 -m models/Phi-3-mini-128k-instruct.IQ3_M.gguf --color -c 32768 --temp 0.7 --repeat_penalty 1.1 -n -1 --instruct
# now I see a different warning - maybe I didn't make llama.cpp cleanly before
# warning: not compiled with GPU offload support
# i'm doing `make clean` & `make LLAMA_CUDA=1` again

# I'm trying again with new llama.cpp
./main -ngl 10000 -m models/Phi-3-mini-128k-instruct.IQ3_M.gguf --color -c 32768 --temp 0.7 --repeat_penalty 1.1 -n -1 --instruct
# it works !!

Experiment - Use llama.cpp to create json file from a specific prompt

Experiment with Llama.cpp Grammar
llama.cpp grammar documentation: https://github.com/ggerganov/llama.cpp/tree/master/grammars
- examples/main
- examples/server


./main \
  --model models/llama-3-8b-Instruct.Q5_K_M.gguf \
  --ctx-size 0 \
  --temp 0.8 \
  --repeat_penalty 1.1 \
  --n-predict -1  \
  --threads 4 \
  --n-gpu-layers 2000000 \
  --grammar-file grammars/json.gbnf \
  -p "An extremely detailed description of the 10 best ethnic dishes in json format only: "

Output:

An extremely detailed description of the 10 best ethnic dishes in json format only: {" dishes": [{"name": "Pad Thai", "description": "Stir-fried rice noodles with shrimp, tofu, and peanuts"}, {"name": "Chicken Tikka Masala", "description": "Marinated chicken cooked in a creamy tomato sauce"}, {"name": "Jamaican Jerk Chicken", "description": "Spicy jerk seasoning on grilled chicken with Caribbean flair"}, {"name": "Kung Pao Chicken", "description": "Stir-fried chicken with peanuts, vegetables, and chili peppers"}, {"name": "Fajitas", "description": "Sizzling beef or chicken strips cooked with onions and bell peppers"}, {"name": "Sushi", "description": "Raw fish wrapped in seaweed with soy sauce and wasabi"}, {"name": "Tacos al pastor", "description": "Marinated pork cooked on a rotisserie with pineapple and onion"}, {"name": "Currywurst", "description": "Grilled sausage sliced and smothered in spicy tomato-based curry sauce"}, {"name": "Feijoada", "description": "Hearty stew of beans, beef, and pork from Brazil"}, {"name": "Tagine", "description": "Slow-cooked lamb or chicken with dried fruits and spices"}]} <|eot_id|> [end of text]

Pretty Good! - but after doing some reading, I realized that this work by basically validating the output of the model to follow certain grammar. We still need the prompt to be very specific to get the output we want. I think this is a good start, but I need to experiment more to get the output I want.

Experiment - try with json schema

Json Resume schema reference: https://github.com/jsonresume/resume-schema

# download schema, save it as jsonresume.schema.json
curl https://raw.githubusercontent.com/jsonresume/resume-schema/master/schema.json > jsonresume.schema.json

# using it also as example file for the prompt
cp jsonresume.schema.json prompt_with_jsonresume.txt

Now add this text to the beginning & end prompt_with_jsonresume.txt

Create a resume for Dr. Evil from Austin Power movie, in json format, using this json schema:

...
...
the schema here
...
...

Please ensure to output ONLY VALID JSON:

Now try with json schema


./main \
  --model models/llama-3-8b-Instruct.Q5_K_M.gguf \
  --ctx-size 0 \
  --temp 0.8 \
  --repeat_penalty 1.1 \
  --n-predict -1  \
  --threads 4 \
  --n-gpu-layers 2000000 \
  --json-schema jsonresume.schema.json \
  --file prompt_with_jsonresume.txt

unfortunately, I got this error:

terminate called after throwing an instance of 'nlohmann::json_abi_v3_11_3::detail::parse_error'
  what():  [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - invalid literal; last read: 'j'

let’s try without json schema parameter first:

./main \
  --model models/llama-3-8b-Instruct.Q5_K_M.gguf \
  --ctx-size 0 \
  --temp 0.8 \
  --repeat_penalty 1.1 \
  --n-predict -1  \
  --threads 4 \
  --n-gpu-layers 2000000 \
  --file prompt_with_jsonresume.txt

** it works ! ** I think the original one just failed because output contain invalid character. I should try to play with the prompt more & maybe combine with grammar file as well


./main \
  --model models/llama-3-8b-Instruct.Q5_K_M.gguf \
  --ctx-size 0 \
  --temp 0.8 \
  --repeat_penalty 1.1 \
  --n-predict -1  \
  --threads 4 \
  --n-gpu-layers 2000000 \
  --grammar-file grammars/json.gbnf \
  --json-schema jsonresume.schema.json \
  --file prompt_with_jsonresume.txt

Unfortunately, I still got that invalid character error even when I specified grammar-file. I should do more research


./main \
  --model models/llama-3-8b-Instruct.Q5_K_M.gguf \
  --ctx-size 0 \
  --temp 0.8 \
  --repeat_penalty 1.1 \
  --n-predict -1  \
  --threads 4 \
  --n-gpu-layers 2000000 \
  --grammar-file grammars/json.gbnf \
  --json-schema ./jsonresume.schema.json \
  --file prompt_with_jsonresume.txt

Oh found the issue - it expects the actual schema, not file path. there is a trick in the llama.cpp/example/main/README.md for external schema

--json-schema SCHEMA: Specify a JSON schema to constrain model output to (e.g. {} for any JSON object, or {"items": {"type": "string", "minLength": 10, "maxLength": 100}, "minItems": 10} for a JSON array of strings with size constraints).
- If a schema uses external $refs, you should use --grammar "$( python examples/json_schema_to_grammar.py myschema.json )" instead.


./main \
  --model models/llama-3-8b-Instruct.Q5_K_M.gguf \
  --ctx-size 0 \
  --temp 0.8 \
  --repeat_penalty 1.1 \
  --n-predict -1  \
  --threads 4 \
  --n-gpu-layers 2000000 \
  --grammar-file grammars/json.gbnf \
  --grammar "$( python examples/json_schema_to_grammar.py jsonresume.schema.json )" \
  --file prompt_with_jsonresume.txt

It kinda works! - unfortunately the output is worse than before.

{"$schema":"http://json-schema.org/draft-04/schema#","basics":{"name":"Dr. Evil","label":"","image":"","email":"","phone":"","url":"","summary":"","location":{"address":"","postalCode":"","city":"","countryCode":"","region":"""}," :{ }},"profiles":[] "]," :{ }},"work":[],"volunteer":[],"education":[{"institution":"","studyType":"","startDate": "1989-02-15","endDate": "1994-06-15","score":"","courses":[] "}]," :{ "}},
  " :{} },"awards":{},"certificates":{},"publications":{},"skills":{},"languages":{},"interests":{},"references":{},"projects":{},"meta":{"canonical":"","version":"1.0.0","lastModified":""}} ]} <|eot_id|> [end of text]

Maybe it is because the schema is too complex. I should try with simpler schema

Ideas

try with simpler schemas, with multiple level of questions,
- e.g. first ask for name, address, phone numbers
- then ask for all companies they work for
- then for each company ask for the position, start date, end date, etc
- it going to be rather expensive tho