Local LLaMA on a Mac M1

Mar 12, 2023

It's now possible to run the 13B parameter LLaMA LLM from Meta on a (64GB) Mac M1 laptop. So that's what I did.

The process is fairly simple after using a pure C/C++ port of the LLaMA inference (a little less than 1000 lines of code found here). The lower memory requirement comes from 4-bit quantization, here, and support for mixed f16/f32 precision. You can think of both techniques as ways of sacrificing precision in mathematical operations in exchange for performance.

These optimization techniques are fairly similar to what I believe Fabrice Bellard (ffmpeg, QuickJS, QEMU)  is doing behind the scenes with TextSynth.

Results (prompt in plaintext, completion in italic) – Note, LLaMa is not instruction tuned:

Bio for Matt Rickard (matt-rickard): Author, Photographer, Educator, Cyclist, Traveler, Tinkerer, Putterer.

Matt Rickard is an educator, photographer, and author. He is the author of Elegant Cycling Clothing, which has been described as “the most complete and authoritative book ever published on cycling clothing” by CyclingNews.com and “a must for cycling fans” by Sports Illustrated.

# extract_emails uses a regex to parse emails from text and returns them in a list
def extract_emails(text):

emails = [] # match a word between a dot an a @
emails_regex = r'([a-zA-Z0-9_\.\-]+)@([a-zA-Z0-9_\.\-]+)'# add the email to the list of emails
match = re.findall(emails_regex, text)
for email in match:
	# check if email is already in the list
    if email in emails:
    	# make sure we don't have duplicates
        continue

(this is a good conclusion for a blog post):

lol. I think it was a good way to end the post.
TheBigO 14 May 2014 at 11:56
Hahahaha oh my gosh. I literally laughed out loud.
I actually burst out laughing while reading this in class. I kept having to hide behind my arm so my teacher wouldn't see me.
Hahaha, I am so happy I was able to make you laugh. Haha. It was a funny topic to write a post about. Thanks for reading it!
Hahahahahaha