Ollama API Tutorial - AI chatbots 100% local to use in your own projects

p.kaczmarek2 07 Feb 2025 21:41 13 2103 Cool? (+3)

📢 Listen (AI):

Screenshot of the Ollama Chat App with a model dropdown and chat fields.

.
How to make a project based on the latest language models available locally, such as deepseek-r1, llama, qwen, gemma and mistral? How does Ollama's uniform interface based on HTTP requests work? Here I will try to demonstrate this. We will learn how to send chat requests containing both text and images. The method discussed will allow you to run simple chatbots/assistants from virtually any device capable of making HTTP requests - so even from a Raspberry or from boards with ESP8266/ESP32....

As an introduction, let me remind you of related topics in the series. I have already been running the language model API from OpenAI on the ESP8266:
ESP and ChatGPT - how to use OpenAI API on ESP8266? GPT-3.5, PlatformIO .
I have also already presented a way to run language models on your own machine based on the aforementioned Ollama WebUI:
ChatGPT locally? AI/LLM assistants to run on your computer - download and install .
I have also shown DeepSeek derived models based on this:
Running a miniaturised DeepSeek-R1 on consumer hardware - Ollama WebUI

Now it's time for an intermediate topic - here I will show how to use the Ollama API, quite analogous to the API from OpenAI, only here the server will also be our machine.

Initial requirements .
In the theme I assume we already have the WebUI set up along with some sample models downloaded, this can be done according to my old tutorial .
As a check that everything is working, we fire up our WebUI in Docker and ask it a question:

User interface with content on calculating copper trace resistance.

.
We make sure that everything works. Only then can we move on.

Basics of the API .
First we need to know the port on which Ollama is set up. This is not the same port that WebUI is on. It is easy to find it in Docker:

Docker Desktop screen showing the ollama container highlighted

.
Then, as a test, you can look in your browser at the address listing the available models:


http://192.168.0.213:11434/api/tags

.
We should receive the list in JSON format:

Screenshot of a browser displaying an API model list.

.
This confirms that the API is working and allows us to move on to the next step.

Listing models in practice .
So let's make a simple chat imitation application for a test. I decided to make it in C# - based on WinForms, or Windows windows. I wrote in Visual Studio 2017, but of course you can pick up a newer version.
We start by inserting the available models into the ComboBox dropdown list. We retrieve the models from the aforementioned API via HttpWebRequest, which returns JSON for us to parse via Newtonsoft.Json. We install this library beforehand:

Screenshot from Visual Studio showing the NuGet Package Manager with Newtonsoft.Json installed.

.
And now the code - all in one file, without a visual editor:

.
Subsequently the web request HttpWebRequest fetched data, we read it into the StreamReader stream and then create a JSON object from it using JObject.Parse. There we iterate through the array of models.
It works:

Screenshot of an application with a dropdown menu for selecting language models.

.

Simple query and response streaming .
We already have the models available. Now it is time to put them to use.
An endpoint is used to simply complete the text:


http://192.168.0.213:11434/api/generate

.
We send a POST request with the data in the format:

.
There is also an optional stream argument - this specifies whether the response is streamed. We rather care about this, then there is a better effect and you can see in real time what is happening.
This way we get a stream of JSON files in the response - word by word. This is the same effect as in ChatGPT. We have to add the received words to the displayed window ourselves.

.
The done field specifies whether the given JSON is the last fragment of the response.
Updated code (I added a text box, button, etc):

.
Now StreamReader reads line by line - and we convert these lines into separate JSONs.
Example result:

Ollama Chat App interface showing selected language model and sample conversation exchange.

.

Household minichat - basis .
The chat allows you to send a conversation history split between the user and the assistant. An endpoint is used to create the chat:


http://192.168.0.213:11434/api/chat

.
A POST request with data in JSON format needs to be sent there. At the very least, we need model selection (model field), conversation history (user messages and AI).
An example of the JSON sent:

.
In response, we will receive a JSON stream of this format:

.
The last JSON received will be different:

.
I've added a text box, button, etc to the code here, but that's the least important.
Updated code:

.
The most interesting thing is the SendButton_Click method - this is where the request is sent. I do this in a blocking way, so I suspend the GUI for the duration of the query, but this is just a demo version.
The result:

Window of Ollama Chat App with a dropdown menu for model selection and a text field.

.

His own minichat - a history of exchanges .
One thing left to do is to keep the full history of the conversation. Then the query looks like this for example:

.
So we separate the messageHistory into a window class and update it when the AI response is received.

.
With each query we resend it. It's time to check the AI's memory:

Screenshot of the Ollama Chat App application showing a chat conversation.

.
It works!
Forgive the lack of spaces in the conversation, it's just a display issue. You can immediately have a longer conversation.
Now one thing is worth noting - we can freely modify the text that the user or the model has written.

Recognition of photos .
Some AI models also support photos - llava, for example. Photos can be sent to them encoded in Base64. We append them to the chat query or autocomplete as an image array. The example query then looks like this:

.
Endpoint:


http://192.168.0.213:11434/api/generate

.
The answer will be obtained in the format:

.
It is easy to convert bytes in C# to a string in Base64 format, we have a function ready for this:

.
In this way, we can convert our chat so that it also supports images. By the way, we will use the drag&drop mechanism to be able to simply drag files onto our window:

.
Result:

Screenshot of the Ollama chat application with a model dropdown and Send button.

Ollama chat app interface with a model selection box, text entry window, image field, and Send button.

.
Analogously, you can attach images to a chat conversation. We then place them in the chat history, example below: .

.
Of course, the reliability of the llava model itself is a separate topic, which I have already presented on the forum:
Minitest: robot vision? Multimodal AI LLaVA and workshop image analysis - 100% local .

Summary .
Ollama offers a unified system that allows multiple large language models to be run, both downloaded from the official project website and added manually from the GGUF file . These usually only support text, but there also happen to be 'multimodal' models, i.e. also supporting images, which we include here encoded in Base64 format. The API discussed here supports the possibility of 'streaming' operation, i.e. previewing responses in real time, which strongly resembles the operation of ChatGPT and allows us, in the event of a change of mind, to interrupt the generation of a response earlier and change the query.
The API demonstration shown here was based on the C# language and WinForms, but it could just as well be realised on another platform, perhaps I'll present that soon too. So far I have another project in the frame, but I'll just say for now that it's quite related to electronics, details in the next topic.
Have you already used the Ollama API in your projects? .
PS: For more information I would refer you to the Ollam documentation , and especially their own description of the API . .

About Author

p.kaczmarek2 wrote 13169 posts with rating 10976 , helped 605 times. Been with us since 2014 year.

Comments

Add a comment

kjoxa 08 Feb 2025 03:04

Great post, thanks! What is the configuration of the machine that the example code was communicating with? The response rate is accelerated? [Read more]

p.kaczmarek2 08 Feb 2025 08:32

I presented details of the machine used and a test of the speed of response here: Running the miniaturised DeepSeek-R1 on consumer hardware - Ollama WebUI . Response rates for different model sizes: ... [Read more]

gulson 09 Feb 2025 09:01

The most important thing is that our data does not leak. Thanks for the tutorial. [Read more]

krzbor 09 Feb 2025 12:29

I have a question - how do these simplified deepseek models cope with the Polish language? [Read more]

p.kaczmarek2 09 Feb 2025 14:14

At the moment, in my spare time, I'm working on an electronics exam for AI - I want to automatically see how these models will cope with various tasks. For this there will be a ready-made program so that... [Read more]

krzbor 09 Feb 2025 18:57

How about the 32b model? Unfortunately the 14b performed poorly. [Read more]

RebellionArts 09 Feb 2025 21:36

Hey, maybe instead of just putting up LLM models tell how they can be trained. What does such a training file look like, where to get the data from. I am putting up models myself, I even had to deal with... [Read more]

p.kaczmarek2 10 Feb 2025 01:23

I haven't delved into the topic of training/fine-tuning modei yet. Test with 32b deepseek-r1 for @krzbor : . I don't think these models were trained for Polish.... [Read more]

krzbor 10 Feb 2025 09:18

Thanks for the test. As I read 32b already requires a good graphics card or a lot of patience (when operating on RAM). The results are still poor. However, I noticed that even the weakest model understood... [Read more]

Kera62 11 Feb 2025 02:30

Hello, "Ol lama API Tutorial-chatbots AI 100% locally for use in your own projects" I may be too old (62 years old) to go into this topic, but what I've read here is really fascinating, especially... [Read more]

Jacek Rutkowski 12 Feb 2025 05:32

. Unfortunately, everything is for cash and armaments. What is 'civilian' and 'penny-wise' is beta testing or corpo and military.... [Read more]

katakrowa 12 Feb 2025 15:09

If anyone wants to test different models without, as it were, complicated games of manual configuration and editing JSON files I recommend: https://lmstudio.ai/ The Windows application installs like... [Read more]

p.kaczmarek2 13 Jun 2025 09:02

I add to the mini-program the option to download a new model, just endpoint for this: http://192.168.0.213:11434/api/pull . I will then attach the new code to the post. https://obr... [Read more]

FAQ

TL;DR: A single laptop with 64 GB RAM can stream word-by-word replies from Llama3 over HTTP, and “our data does not leak” [Elektroda, gulson, post #21431307], using Ollama’s /api/chat endpoint—no cloud needed. With three endpoints and one JSON body you have a private, upgradeable AI chatbot.

Why it matters: Makers can embed modern LLMs in IoT, test rigs or desktop tools without vendor lock-in or usage fees.

Quick Facts

• Default REST port: 11434 [Elektroda, p.kaczmarek2, post #21429505]
• Key endpoints: /api/tags, /api/generate, /api/chat, /api/pull [Elektroda, p.kaczmarek2, post #21577977]
• RAM needed: approx. 8-10 GB for 7 B, 28 GB VRAM or ≥64 GB system RAM for 32 B [DeepSeek Paper, 2024]
• Typical stream rate: ≈15 tokens / s on i7-6700HQ + GTX 1060 [Elektroda, p.kaczmarek2, post #21429787]
• Security: traffic stays on-device; no external API calls [Elektroda, gulson, post #21431307]

What exactly is the Ollama API and why use it locally instead of OpenAI?

Ollama wraps multiple GGUF models behind a small HTTP server. Running it locally removes cloud latency, eliminates per-token fees, and keeps all prompts on your machine [Elektroda, p.kaczmarek2, post #21429505]

Which ports and endpoints do I need to open?

The daemon listens on port 11434. Core routes: GET /api/tags (list models), POST /api/generate (single prompt), POST /api/chat (conversational history) and POST /api/pull (download new model) [Elektroda, p.kaczmarek2, #21429505; #21577977].

How do I list installed models from C#?

Send GET http://host:11434/api/tags, parse the returned JSON array "models" and drop the names into your UI [Elektroda, p.kaczmarek2, post #21429505]

How can I stream tokens like ChatGPT?

Set "stream": true in the POST body. The server then sends one JSON line per token until "done": true arrives. Forgetting this flag forces you to wait for the full answer [Elektroda, p.kaczmarek2, post #21429505]

Can you give me a 3-step WinForms example?

Call /api/tags and populate ComboBox.
On Send, POST to /api/chat with model, messages[], "stream":true.
Read lines with StreamReader, append json["message"]["content"] to TextBox [Elektroda, p.kaczmarek2, post #21429505]

How do I send an image to a multimodal model like LLaVA?

Convert the file to Base64 (Convert.ToBase64String), then include an "images" array in the JSON body alongside "prompt". Use /api/generate; response structure mirrors text generation [Elektroda, p.kaczmarek2, post #21429505]

What hardware spec gives smooth replies?

A 6-year-old laptop (i7-6700HQ, 64 GB RAM, GTX 1060) streams around 15 tokens/s with 7 B models [Elektroda, p.kaczmarek2, post #21429787] Larger 32 B models need ≥24 GB VRAM or run 10× slower on CPU [DeepSeek Paper, 2024].

How well do DeepSeek-R models handle Polish?

The 1.5 B model answered only in English, while 32 B produced grammatically correct Polish but still mistranslated technical terms like “mach” for “obwód” [Elektroda, p.kaczmarek2, post #21432862]

How do I download a new model programmatically?

POST {"name":"model-name"} to /api/pull and poll for status until the server returns "completed":true. The model appears in /api/tags once ready [Elektroda, p.kaczmarek2, post #21577977]

What common errors should I watch for?

Empty response: you forgot "prompt" or "messages".
GUI freeze: synchronous StreamReader blocks UI; move to async.
Endless stream: "done":true never received—usually a model crash; restart the daemon [Elektroda, p.kaczmarek2, post #21429505]

Is my data really private?

Yes. All inference runs on your hardware and the API never calls external servers. “Our data does not leak” [Elektroda, gulson, post #21431307]

Can I fine-tune or train my own model for Ollama?

Ollama currently loads pre-quantised GGUF files. You must fine-tune elsewhere (e.g., LoRA in PyTorch), convert to GGUF, then drop it into the models folder. Author p.kaczmarek2 hasn’t covered this yet [Elektroda, p.kaczmarek2, post #21432862]

Are there easier GUIs than coding from scratch?

Yes. LM Studio offers a one-click installer, model catalogue, and built-in HTTP API toggle for Windows and macOS [Elektroda, katakrowa, post #21436454]