{"id":1030,"date":"2024-06-26T11:54:19","date_gmt":"2024-06-26T18:54:19","guid":{"rendered":"https:\/\/lucidbeaming.com\/blog\/?p=1030"},"modified":"2025-12-04T13:41:12","modified_gmt":"2025-12-04T21:41:12","slug":"skulls-composing-music-with-computer-vision-and-a-custom-yolo5-ai-model","status":"publish","type":"post","link":"https:\/\/lucidbeaming.com\/blog\/skulls-composing-music-with-computer-vision-and-a-custom-yolo5-ai-model\/","title":{"rendered":"Skulls: composing music with computer vision and a custom YOLO5 AI model"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/lucidbeaming.com\/blog\/wp-content\/uploads\/2024\/06\/skull-detail-1024x576.jpg\" alt=\"\" class=\"wp-image-1032\" srcset=\"https:\/\/lucidbeaming.com\/blog\/wp-content\/uploads\/2024\/06\/skull-detail-1024x576.jpg 1024w, https:\/\/lucidbeaming.com\/blog\/wp-content\/uploads\/2024\/06\/skull-detail-653x367.jpg 653w, https:\/\/lucidbeaming.com\/blog\/wp-content\/uploads\/2024\/06\/skull-detail-768x432.jpg 768w, https:\/\/lucidbeaming.com\/blog\/wp-content\/uploads\/2024\/06\/skull-detail.jpg 1200w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>A few years ago, I built a primitive computer vision music player (Oracle) using analog video and a basic threshold detector with an Arduino. Since then, outboard AI vision modules have gotten much more specialized and powerful. I decided to try an advanced build of Oracle using the new Grove Vision AI Module V2.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"alignright size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"494\" height=\"496\" src=\"https:\/\/lucidbeaming.com\/blog\/wp-content\/uploads\/2024\/06\/objects.png\" alt=\"\" class=\"wp-image-1035\" style=\"width:234px;height:auto\" srcset=\"https:\/\/lucidbeaming.com\/blog\/wp-content\/uploads\/2024\/06\/objects.png 494w, https:\/\/lucidbeaming.com\/blog\/wp-content\/uploads\/2024\/06\/objects-150x150.png 150w\" sizes=\"auto, (max-width: 494px) 100vw, 494px\" \/><\/figure>\n<\/div>\n\n\n<p>This post describes the approach and build, as well as a few pitfalls to avoid. Seeed sent me one of their boards for free and that was the motivation to try this out. Ultimately, I want to use the lessons learned here to finish a more comprehensive build of Oracle with more capability. This particular project is called Skulls because of the plastic skulls and teeth used for training and inference targeting.<\/p>\n\n\n\n<p>The components are a <a href=\"https:\/\/wiki.seeedstudio.com\/grove_vision_ai_v2\/\" title=\"\">Grove Vision AI Module V2<\/a> (retails for about $26) with an Xiao ESP32 C3 as a controller and interface. The data from the object recognition gets passed to an old Raspberry Pi 3 model A+ using MQTT. The Pi runs <a href=\"https:\/\/mosquitto.org\/\" title=\"\">Mosquitto<\/a> as an MQTT broker and client, as well as <a href=\"https:\/\/lucidbeaming.com\/blog\/running-fluidsynth-on-a-raspberry-pi-zero-w\/\" title=\"\">Fluidsynth<\/a> to play the resulting music. A generic 8226 board is used as a WiFi access point to connect the two assemblies wirelessly.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What worked<\/h3>\n\n\n\n<p>Assembling the hardware was very simple. The AI Module is very small and mated well with an ESP32. Each board has a separate USB connector. The AI Module needs that for uploading models and checking the camera feed. The ESP32 worked with the standard Arduino IDE. I added the <a href=\"https:\/\/wiki.seeedstudio.com\/XIAO_ESP32C3_Getting_Started\/#software-setup\" title=\"\">custom board libraries<\/a> from Seeed to ensure compatibility.<\/p>\n\n\n\n<p>In the beginning I did most of the AI work directly connected to the AI Module and not through the ESP32. It was the only way to see a video feed of what I was getting.<\/p>\n\n\n\n<p>One of the reasons I put so much effort into this project was to use custom AI models. I wasn&#8217;t interested in doing yet another demo of a face recognition or pets or whatever. I&#8217;m interested in exploring new human-machine interfaces for creative output. This particular module has the ability to use custom models.<\/p>\n\n\n\n<p>So, I tried to follow the Seeed instructions for creating a model. It was incredibly time consuming and there were many problems. The most effective tip I can offer is to use the actual camera connected to the board to generate training images AND to clean-up those images in Photoshop or Gimp. I went through a lot of trial and error with paramters and context. Having clean images fixed a lot of the recognition issues. I generated and annotated 176 images for training. That took 5-6 hours and the actual training in the Collab notebook took 2-3 hours with different options.<\/p>\n\n\n\n<p>Here is my recipe:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use a simple Arduino sketch to record jpegs from the camera onto an SD card.<\/li>\n\n\n\n<li>In an image editor, apply Reduce Noise and Levels to the images to normalize them. Don&#8217;t use &#8220;Auto Levels&#8221; or any other automatic toning.<\/li>\n\n\n\n<li>The images will be 240px X 240px. Leave them that size. Don&#8217;t export larger.<\/li>\n\n\n\n<li>In <a href=\"https:\/\/roboflow.com\/\" title=\"\">Roboflow<\/a> choose &#8220;Object Detection&#8221;, not &#8220;Instance Segmentation&#8221;, for the project. <\/li>\n\n\n\n<li>When annotating, choose consistent spacing between your bounding box and the edges of your object.<\/li>\n\n\n\n<li>Yes, you can annotate multiple objects in a single image. It&#8217;s recommended.<\/li>\n\n\n\n<li>For preprocessing, I chose &#8220;Filter Null and &#8220;Grayscale&#8221;.<\/li>\n\n\n\n<li>For augmentation, I chose &#8220;Rotate 90&#8221;, &#8220;Rotation&#8221;, and &#8220;Cutout&#8221;. I did NOT use &#8220;Mosaic&#8221; as recommended in the Seeed Wiki. That treatment already happens in the Collab training script.<\/li>\n\n\n\n<li>I exported the dataset using JSON &gt; COCO. None of the other options were relevant.<\/li>\n\n\n\n<li>The <a href=\"https:\/\/colab.research.google.com\/github\/seeed-studio\/sscma-model-zoo\/blob\/main\/notebooks\/en\/Gesture_Detection_Swift-YOLO_192.ipynb\" title=\"\">example Google Collab notebook<\/a> I used was the rock\/paper\/scissors version (Gesture_Detection_Swift-YOLO_192). I only had a few objects and it was the most relevant.<\/li>\n\n\n\n<li>I left the image size at 192&#215;192 and trained for 150 epochs. The resulting TFLite INT8 model was 10.9mb.<\/li>\n\n\n\n<li>I used the recommend web tool to connect directly to the AI Module and upload the model. It took multiple tries.<\/li>\n\n\n\n<li>On the ESP32 I installed MQTT and used that to transmit the data to my Raspberry Pi. I did not use the on-board wifi\/MQTT setup of the AI Module.<\/li>\n<\/ul>\n\n\n\n<p>This was a difficult project because of very confusing and incomplete documentation at mutiple stages. It&#8217;s clear to me that the larger companies don&#8217;t actually want us to be able to do all this ourselves. There were times it felt intentionally obfuscated to force me to buy a premium tier and some unrelated commercial application. I&#8217;m glad I did it though, because I learned some important concepts and limitations of AI training.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Demo<\/h3>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube\"><div style=\"display: contents;\" >\n\n<div data-mode=\"normal\" data-oembed=\"1\" data-provider=\"youtube\" id=\"arve-youtube-7ifflzaazou\" class=\"arve\">\n\t<div class=\"arve-inner\">\n\t\t<div class=\"arve-embed arve-embed--has-aspect-ratio\">\n\t\t\t<div class=\"arve-ar\" style=\"padding-top:56.250000%\"><\/div>\n\t\t\t<iframe allow=\"accelerometer &apos;none&apos;;autoplay &apos;none&apos;;bluetooth &apos;none&apos;;browsing-topics &apos;none&apos;;camera &apos;none&apos;;clipboard-read &apos;none&apos;;clipboard-write;display-capture &apos;none&apos;;encrypted-media &apos;none&apos;;gamepad &apos;none&apos;;geolocation &apos;none&apos;;gyroscope &apos;none&apos;;hid &apos;none&apos;;identity-credentials-get &apos;none&apos;;idle-detection &apos;none&apos;;keyboard-map &apos;none&apos;;local-fonts;magnetometer &apos;none&apos;;microphone &apos;none&apos;;midi &apos;none&apos;;otp-credentials &apos;none&apos;;payment &apos;none&apos;;picture-in-picture;publickey-credentials-create &apos;none&apos;;publickey-credentials-get &apos;none&apos;;screen-wake-lock &apos;none&apos;;serial &apos;none&apos;;summarizer &apos;none&apos;;sync-xhr;usb &apos;none&apos;;web-share;window-management &apos;none&apos;;xr-spatial-tracking &apos;none&apos;;\" allowfullscreen=\"\" class=\"arve-iframe fitvidsignore\" credentialless data-arve=\"arve-youtube-7ifflzaazou\" data-lenis-prevent=\"\" data-src-no-ap=\"https:\/\/www.youtube-nocookie.com\/embed\/7iffLzaAzOU?feature=oembed&amp;iv_load_policy=3&amp;modestbranding=1&amp;rel=0&amp;autohide=1&amp;playsinline=0&amp;autoplay=0\" frameborder=\"0\" height=\"0\" loading=\"lazy\" name=\"\" referrerpolicy=\"strict-origin-when-cross-origin\" sandbox=\"allow-scripts allow-same-origin allow-presentation allow-popups allow-popups-to-escape-sandbox\" scrolling=\"no\" src=\"https:\/\/www.youtube-nocookie.com\/embed\/7iffLzaAzOU?feature=oembed&#038;iv_load_policy=3&#038;modestbranding=1&#038;rel=0&#038;autohide=1&#038;playsinline=0&#038;autoplay=0\" title=\"\" width=\"0\"><\/iframe>\n\t\t\t\n\t\t<\/div>\n\t\t\n\t<\/div>\n\t\n\t\n\t<script type=\"application\/ld+json\">{\"@context\":\"http:\\\/\\\/schema.org\\\/\",\"@id\":\"https:\\\/\\\/lucidbeaming.com\\\/blog\\\/skulls-composing-music-with-computer-vision-and-a-custom-yolo5-ai-model\\\/#arve-youtube-7ifflzaazou\",\"type\":\"VideoObject\",\"embedURL\":\"https:\\\/\\\/www.youtube-nocookie.com\\\/embed\\\/7iffLzaAzOU?feature=oembed&iv_load_policy=3&modestbranding=1&rel=0&autohide=1&playsinline=0&autoplay=0\"}<\/script>\n\t\n<\/div>\n<\/div><figcaption class=\"wp-element-caption\">A demo of different sounds and arrangements produced by the assembly.<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Conclusion<\/h3>\n\n\n\n<p>I&#8217;ll use this knowledge to finish a new build of the actual Oracle music composition platform I started. This particular demo is interesting, but a somewhat unpredictable and technically fragile. I found the research on generative music to be the most interesting part. As for the AI, I&#8217;m sure all this will be simplified and optimized in the future. I just hope the technology stays open enough for artists to use independently.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A few years ago, I built a primitive computer vision music player (Oracle) using analog video and a basic threshold detector with an Arduino. Since then, outboard AI vision modules have gotten much more specialized and powerful. I decided to try an advanced build of Oracle using the new Grove Vision AI Module V2.<\/p>\n","protected":false},"author":1,"featured_media":1032,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":"","_links_to":"","_links_to_target":""},"categories":[2,3],"tags":[34,22,11,6,5],"class_list":["post-1030","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-building","category-playing","tag-ai","tag-coding","tag-interactive","tag-raspberry-pi","tag-synthesizer"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/lucidbeaming.com\/blog\/wp-json\/wp\/v2\/posts\/1030","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lucidbeaming.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lucidbeaming.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lucidbeaming.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/lucidbeaming.com\/blog\/wp-json\/wp\/v2\/comments?post=1030"}],"version-history":[{"count":4,"href":"https:\/\/lucidbeaming.com\/blog\/wp-json\/wp\/v2\/posts\/1030\/revisions"}],"predecessor-version":[{"id":1037,"href":"https:\/\/lucidbeaming.com\/blog\/wp-json\/wp\/v2\/posts\/1030\/revisions\/1037"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/lucidbeaming.com\/blog\/wp-json\/wp\/v2\/media\/1032"}],"wp:attachment":[{"href":"https:\/\/lucidbeaming.com\/blog\/wp-json\/wp\/v2\/media?parent=1030"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lucidbeaming.com\/blog\/wp-json\/wp\/v2\/categories?post=1030"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lucidbeaming.com\/blog\/wp-json\/wp\/v2\/tags?post=1030"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}