> The implications of these geometric properties are staggering. Let's consider a simple way to estimate how many quasi-orthogonal vectors can fit in a k-dimensional space. If we define F as the degrees of freedom from orthogonality (90° - desired angle), we can approximate the number of vectors as [...]
If you're just looking at minimum angles between vectors, you're doing spherical codes. So this article is an analysis of spherical codes… that doesn't reference any work on spherical codes… seems to be written in large part by a language model… and has a bunch of basic inconsistencies that make me doubt its conclusions. For example: in the graph showing the values of C for different values of K and N, is the x axis K or N? The caption says the x axis is N, the number of vectors, but later they say the value C = 0.2 was found for "very large spaces," and in the graph we only get C = 0.2 when N = 30,000 and K = 2---that is, 30,000 vectors in two dimensions! On the other hand, if the x axis is K, then this article is extrapolating a measurement done for 2 vectors in 30,000 dimensions to the case of 10^200 vectors in 12,888 dimensions, which obviously is absurd.
I want to stay positive and friendly about people's work, but the amount of LLM-driven stuff on HN is getting really overwhelming.
Spherical codes are kinda of obscure: I haven't heard of them before, and Wikipedia seems to have barely heard of them. And most of the Google results seem to be about playing golf in small dimensions (ie, how many can we optimally pack in n<32 dimensions?).
People do indeed rediscover previously existing math, especially when the old content is hidden under non-obvious jargon.
> The problem with saying something is LLM generated is it cannot be proven and is a less-helpful way of saying it has errors.
It's a very helpful way of saying it shouldn't be bothered to be read. After all, if they couldn't be bothered to write it, I can't be bothered to read it.
There's a lot of beautiful writing on these topics on the "pure math" side, but it's hard to figure out what results are important for deep learning and to put them in a form that doesn't take too much of an investment in pure math.
I think the first chapter of [1] is a good introduction to general facts about high-dimensional stuff. I think this is where I first learned about "high-dimensional oranges" and so on.
For something more specifically about the problem of "packing data into a vector" in the context of deep learning, last year I wrote a blog post meant to give some exposition [2].
One really nice approach to this general subject is to think in terms of information theory. For example, take the fact that, for a fixed epsilon > 0, we can find exp(C d) vectors in R^d with all pairwise inner products smaller than epsilon in absolute value. (Here C is some constant depending on epsilon.) People usually find this surprising geometrically. But now, say you want to communicate a symbol by transmitting d numbers through a Gaussian channel. Information theory says that, on average, I should be able to use these d numbers to transmit C d nats of information. (C is called the channel capacity, and depends on the magnitude of the noise and e.g. the range of values I can transmit.) The statement that there exist exp(C d) vectors with small inner products is related to a certain simple protocol to transmit a symbol from an alphabet of size exp(C d) with small error rate. (I'm being quite informal with the constants C.)
I think the author is too focused on the case where all vectors are orthogonal and as a consequence overestimates the amount of error that would be acceptable in practice. The challenge isn't keeping orthogonal vectors almost orthogonal, but keeping the distance ordering between vectors that are far from orthogonal. Even much smaller values of epsilon can give you trouble there.
So the claim that "This research suggests that current embedding dimensions (1,000-20,000) provide more than adequate capacity for representing human knowledge and reasoning." is way too optimistic in my opinion.
Since vectors are usually normalized to the surface of an n-sphere and the relevant distance for outputs (via loss functions) is cosine similarity, "near orthogonality" is what matters in practice. This means during training, you want to move unrelated representations on the sphere such that they become "more orthogonal" in the outputs. This works especially well since you are stuck with limited precision floating point numbers on any realistic hardware anyways.
Btw. this is not an original idea from the linked blog or the youtube video it references. The relevance of this lemma for AI (or at least neural machine learning) was brought up more than a decade ago by C. Eliasmith as far as I know. So it has been around long before architectures like GPT that could actually be realistically trained on such insanely high dimensional world knowledge.
> Since vectors are usually normalized to the surface of an n-sphere (...)
In classification tasks, each feature is normalized independently. Otherwise you would have an entry with feature Foo and Bar which depending on the value of Bar it would be made out to be less Foo when normalized.
This vectors are not normalized in n-spheres, and their codomain ends up being an hypercube.
Sort of trivial but fun thing: you can fit billions of concepts into this much space, too. Let's say four bits of each component of the vector are important, going by how some providers do fp4 inference and it isn't entirely falling apart. So an fp4 dimension-12K vector takes up 6KB, like a few pages of UTF-8 text, more compressed text, or 3K tokens in a 64K-token embedding. How many possible multi-page 'thought's are there? A lot!
(And in handling one token, the layers give ~60 chances to mix in previous 'thoughts' via the attention mechanism, and mix in stuff from training via the FFNs! You can start to see how this whole thing ends able to convert your Bash to Python or do word problems.)
Of course, you don't expect it to be 100% space-efficient, detailed mathematical arguments aside. You want blending two vectors with different strengths to work well, and I wouldn't expect the training to settle into the absolute most efficient way to pack the RAM available. But even if you think of this as an upper bound, it's a very different reference point for what 'ought' to be theoretically possible to cram into a bunch of high-dimensional vectors.
These set of intuitions and the Johnson-Lindenstrauss lemma in particular are what power a lot of the research effort behind SAEs (Sparse Autoencoders) in the field of mechanistic interpretability in AI safety.
How would you peer-review something like that? Or rather, how would you even _reproduce_ any of that? Some inception-based model with some synthetic data? Where is the data? I'm sure the paper was written in good faith, but all it leads to is just more crackpottery. If I had a penny for each time I've heard that LLMs are quantum in nature "because softmax is essentially a wave function collapse" and "because superposition," then I would have more than one penny!
It's not how pre-publication peer review works. There, the problem is that many papers aren't worth reading, but to determine whether it's worth reading or not, someone has to read it and find out. So the work of reading papers of unknown quality is farmed out over a large number of people each reading a small number of randomly-assigned papers.
If somebody's paper does not get assigned as mandatory reading for random reviewers, but people read it anyway and cite it in their own work, they're doing a form of post-publication peer review. What additional information do you think pre-publication peer review would give you?
peer review would encourage less hand wavy language and more precise claims. They would penalize the authors for bringing up bizarre analogies to physics concepts for seemingly no reason. They would criticize the fact that they spend the whole post talking about features without a concrete definition of a feature.
The sloppiness of the circuits thread blog posts has been very damaging to the health of the field, in my opinion. People first learn about mech interp from these blog posts, and then they adopt a similarly sloppy style in discussion.
Frankly, the whole field currently is just a big circle jerk, and it's hard not to think these blog posts are responsible for that.
I mean do you actually think this kind of slop would be publishable in NeurIPS if they submitted the blog post as it is?
IME: most of the reviewers in the big ML conferences are second-year phd students sent into the breach against the overwhelming tide of 10k submissions... Their review comments are often somewhere between useless and actively promoting scientific dishonesty.
Sometimes we get good reviewers, who ask questions and make comments which improve the quality of a paper, but I don't really expect it in the conference track. It's much more common to get good reviewers in smaller journals, in domains where the reviewers are experts and care about the subject matter. OTOH, the turnaround for publication in these journals can take a long time.
Meanwhile, some of the best and most important observations in machine learning never went through the conference circuit, simply because the scientific paper often isn't the best venue for broad observation... The OG paper on linear probes comes to mind. https://arxiv.org/pdf/1610.01644
Of the papers submitted to a conference, it might be that reviewers don't offer suggestions that would significantly improve the quality of the work. Indeed the quality of reviews has gone down significantly in recent years. But if Anthropic were going to submit this work to peer review, they would be forced to tighten it up significantly.
The linear probe paper is still written in a format where it could reasonably be submitted, and indeed it was submitted to an ICLR workshop.
What? Yes it is! This is exactly how peer review works! People look at the paper, read it, and then reproduce it, poke holes, etc.
Peer review has nothing to do with "being published in some fancy-looking formatted PDF in some journal after passing an arbitrary committee" or whatever, it's literally review by your peers.
Now, do I have problems with this specific paper and how it's written in a semi-magical way that surely requires the reader suspend disbelief? For sure, but that's completely independent of the "peer-review" aspect of it.
Tangential, but the ChatGPT vibe of most of the article is very distracting and annoying. And I say this as someone who consistently uses AI to refine my English. However, I try to avoid letting it reformulate too dramatically, asking it specifically to only fix grammar and non-idiomatic parts while keeping the tone and formulation as much as possible.
Beyond that, this mathematical observation is genuinely fascinating. It points to a crucial insight into how large language models and other AI systems function. By delving into the way high-dimensional data can be projected into lower-dimensional spaces while preserving its structure, we see a crucial mechanism that allows these models to operate efficiently and scale effectively.
Ironically, the use of "fascinating", "crucial" and "delving" in your second paragraph, as well as its overall structure, make it read very much like it was filtered through ChatGPT
Mostly these sentences. QuillBot finds these are 100% AI-generated, but I'm not sure how much we can trust it.
> While exploring this question, I discovered something unexpected that led to an interesting collaboration with Grant and a deeper understanding of vector space geometry.
> When I shared these findings with Grant, his response exemplified the collaborative spirit that makes the mathematics community so rewarding. He not only appreciated the technical correction but invited me to share these insights with the 3Blue1Brown audience. This article is that response, expanded to explore the broader implications of these geometric properties for machine learning and dimensionality reduction.
> The fascinating history of this result speaks to the interconnected nature of mathematical discovery.
> His work consistently inspires deeper exploration of mathematical concepts, and his openness to collaboration exemplifies the best aspects of the mathematical community. The opportunity to contribute to this discussion has been both an honor and a genuine pleasure.
I don't know how to express it, maybe it's because I'm not a native English speaker, but my brain has become used to this kind of tone in AI-generated content and I find it distracting to read. I don't mean to diminish this blog post, which is otherwise very interesting. I'm just pointing out an increasing (and understandable) trend of relying on AI to "improve" prose, but I think it sometimes leads to a uniformity of style, which I find a bit sad.
Wikipedia has a great article[0] which describes the signs of AI writing, and why it prefers not to have those styles in their articles. I agree with almost all of it, and it's far more detailed than I could be in a HN post.
Reading LLM text feels a lot like watching a Dragon Ball Z filler episode.
> LLMs overuse the 'rule of three'—"the good, the bad, and the ugly". This can take different forms from "adjective, adjective, adjective" to "short phrase, short phrase, and short phrase".[2]
> While the 'rule of three', used sparingly, is common in creative, argumentative, or promotional writing, it is less appropriate for purely informational texts, and LLMs often use this structure to make superficial analyses appear more comprehensive.
> Examples:
> "The Amaze Conference brings together global SEO professionals, marketing experts, and growth hackers to discuss the latest trends in digital marketing. The event features keynote sessions, panel discussions, and networking opportunities."
It is a social signal of not really caring about the content, just like people who post in large swathes of bad grammar and spelling (relative to the local environment).
It is completely reasonable to read that signal, and completely reasonable to conclude that you shouldn't ask me to care more about your content than you did as the "creator".
Moreover, it suggests that your "content" may just be a prompt to an AI, and there is no great value to caching the output of an AI on the web, or asking me to read it. In six months I could take the same prompt and get something better.
Finally, if your article looks like it was generated by AI, AI is still frankly not really at the point where long form output of it on a technical concept is quite a safe thing to consume for deep understanding. It still makes a lot of little errors fairly pervasively, and significant errors not only still occur often, but when they do they tend to corrupt the entire rest of the output as the attention mechanism basically causes the AI to justify its errors with additional correct-sounding output. And if your article sounds like it was generated with AI, am I justified in assuming that you've prevented this from happening with your own expertise? Statistically, probably not.
Disliking the "default tone" of the AI may seem irrational but there are perfectly rational reasons to develop that distaste. I suppose we should be grateful that LLM's "default voice" has been developed into something that wasn't used by many humans prior to them staking a claim on it. I've heard a few people complain about getting tagged as AIs for using emdashes or bullet points prior to AIs, but it's not a terribly common complaint.
I've also always been proud of using em and en dashes correctly—including using en dashes for ranges like 12–2pm—but nearly everyone thinks they're an LLM exclusive... so now I really go out of my way to use them just out of spite.
A key error is there literally are no where close to billions of concepts. Its a misunderstanding of what a concept is as used by us humans. There are an unlimited number of instances and entities, but the concepts we use to think about them is very limited by comparison.
Language models don't "pack concepts" into the C dimension of one layer (I guess that's where the 12k number came from), neither do they have to be orthogonal to be viewed as distinct or separate. LLMs generally aren't trained to make distinct concepts far apart in the vector space either. The whole point of dense representations, is that there's no clear separation between which concept lives where. People train sparse autoencoders to work out which neurons fire based on the topics involved. Neuronpedia demonstrates it very nicely: https://www.neuronpedia.org/.
The SAE's job is to try to pull apart the sparse nearly-orthogonal 'concepts' from a given embedding vector, by decomposing the dense vector into a sparsely activation over-complete basis. They tend to find that this works well, and even allows matching embedding spaces between different LLMs efficiently.
Agreed, if you relax the requirement for perfect orthogonality, then, yes, you can pack in much more info. You basically introduced additional (fractional) dimensions clustered with the main dimensions. Put another way, many concepts are not orthogonal, but have some commonality or correlation.
So nothing earth shattering here. The article is also filled with words like "remarkable", "fascinating", "profound", etc. that make me feel like some level of subliminal manipulation is going on. Maybe some use of an LLM?
It's... really not what I meant. This requirement does not have to be relaxed, it doesn't exist at all.
Semantic similarity in embedding space is a convenient accident, not a design constraint. The model's real "understanding" emerges from the full forward pass, not the embedding geometry.
My intuition of this problem is much simpler — assuming there’s some rough hierarchy of concepts, you can guesstimate how many concepts can exist in a 12,000-d space by taking the combinatorial of the number of dimensions. In that world, each concept is mutually orthgonal with every other concept in at least some dimension. While that doesn’t mean their cosine distance is large, it does mean you’re guaranteed a function that can linearly separate the two concepts.
It means you get 12,000! (Factorial) concepts in the limit case, more than enough room to fit a taxonomy
You can only get 12,000! concepts if you pair each concept with an ordering of the dimensions, which models do not do. A vector in a model that has [weight_1, weight_2, ... weight_12000] is identical to the vector [weight_2, weight_1, ..., weight_12000] within the larger model.
Instead, a naive mental model of a language model is to have a positive, negative or zero trit in each axis: 3^12,000 concepts, which is a much lower number than 12000!. Then in practice, almost every vector in the model has all but a few dozen identified axes zeroed because of the limitations of training time.
You’re right. I gave the wrong number. My model implies 2^12000 concepts, because you choose whether or not to include each concept to form your dimensional subspace.
I’m not even referring to the values within that subspace yet, and so once you pick a concept you still get the N degrees of freedom to create a complex manifold.
The main value of the mental model is to build an intuition for how “sparse” high dimensional vectors are without resorting to a 3D sphere.
Say, there are 10^80 atoms, then there are like 2^(10^80) possible things, and 2^(2^(10^80)) grouping/categorization/ordering on the things, and so on, you can go higher, and the number of possibilities go up really fast.
Not surprising since concepts are virtual. There is a person, a person with a partner is a couple. A couple with a kid is a family. That’s 5 concepts alone.
I’m not sure you grok how big a number 10^43741 is.
If we assume that a "concept" is something that can be uniquely encoded as a finite string of English text, you could go up to concepts that are so complex that every single one would take all the matter in the universe to encode (so say 10^80 universes, each with 10^80 particles), and out of 10^43741 concepts you’d still have 10^43741 left undefined.
A concept space of 10^43741 needs about 43741*3 bits to identify each concept uniquely (by the information theoretic concept of bit, which is more a lower bound on what we traditionally think of as bits in the computer world than a match), or about 16000-ish "bytes", which you can approximate reasonably as a "compressed text size". There's a couple orders of magnitude of fiddling around the edges you can do there but you still end up with human-sized quantities of information to identify specific concepts in a space that size rather than massively-larger-than-the-universe sized.
Things like novels come from that space. We sample it all the time. Extremely, extremely sparsely, of course.
Or to put it another way, in a space of a given size, identifying a specific component takes the log2 of the space's size in bits to identify a concept, not something the size of the space itself. 10^43741 is a very large space by our standards, but the log2 of it is not impossibly large.
If it seems weird for models to work in this space, remember that as the models themselves in their full glory are clocking in at multiple hundreds of gigabytes that the space of possible AIs using this neural architecture is itself 2^trillion-ish, which makes 10^43741 look pedestrian. Understanding how to do anything useful with that amount of possibility is quite the challenge.
A continuing, probably unending, opportunity/tragedy is the under-appreciation of representation learning / embeddings.
The magic of many current valuable models is simply that they can combine abstract "concepts" like "ruler" + "male" and get "king."
This is perhaps the easiest way to understand the lossy text compression that constitutes many LLMs. They're operating in the embedding space, so abstract concepts can be manipulated between input and output. It's like compiling C using something like LLVM: there's an intermediate representation. (obviously not exactly because generally compiler output is deterministic).
This is also present in image models: "edge" + "four corners" is square, etc.
If you ever played 20Questions you know that you don't need 1000 dimensions for a billion concepts. These huge vectors can represent way more complex information than just a billion concepts.
In fact they can pack complete poems with or without typos and you can ask where in the poem the typo is, which is exactly what happens if you paste that into GPT: somewhere in an internal layer it will distinguish exactly that.
I am kind of surprised that the original article does not mention batch normalization and similar operations - these are pretty much created to automatically de-bias and de-correlate values at each layer.
Not to completely plug my own work here, but I also wrote about this for a slightly more mathematical audience (and uhh, a much shorter post): "There are exponentially many vectors with small inner product"
For those who are interested in the more "math-y" side of things.
For what it's worth, I don't fully understand the connection between the JL lemma and this "exponentially many vectors" statement, other than the fact that their proof relies on similar concentration behavior.
Sometimes a cosmic ray might hit the sign bit of the register and flip it to a negative value. So it is useful to pass it through a rectifier to ensure it's never negative, even in this rare case.
Indeed, we should call all idempotent functions twice just in case the first incantation fails to succeed.
In all seriousness, this is not at all how resilience to cosmic interference works in practice, and the probability of any executed instruction or even any other bit being flipped is far greater than the one specific bit you are addressing.
The vectors don't need to be orthogonal due to the use of non-linearities in neural networks. The softmax in attention let's you effectively pack as many vectors in 1D as you want and unambiguously pick them out.
I became bit lost between "C is a constant that determines the probability of success" and then they set C between 4 and 8. Probability should be between 0 and 1, how it relates to C?
Wow, I think I might just have grasped one of the sources of the problems we keep seeing with LLMs.
Johnson-Lichtenstrauss guarantees a distance preserving embedding for a finite set of points into a space with a dimension based on the number of points.
It does not say anything about preserving the underlying topology of the contious high dimensional manifold, that would be Takens/Whitney-style embedding results (and Sauer–Yorke for attractors).
The embedding dimensions needed to fulfil Takens are related to the original manifolds dimension and not the number of points.
It’s quite probable that we observe violations of topological features of the original manifold, when using our to low dimensional embedded version to interpolate.
I used AI to sort the hodge pudge of math in my head into something another human could understand, edited result is below:
=== AI in use ===
If you want to resolve an attractor down to a spatial scale rho, you need about
n ≈ C * rho^(-d_B) sample points (here d_B is the box-counting/fractal dimension).
The Johnson–Lindenstrauss (JL) lemma says that to preserve all pairwise distances among n points within a factor 1±ε, you need a target dimension
k ≳ (d_B / ε^2) * log(C / rho).
So as you ask for finer resolution (rho → 0), the required k must grow. If you keep k fixed (i.e., you embed into a dimension that’s too low), there is a smallest resolvable scale
rho* (roughly rho* ≳ C * exp(-(ε^2/d_B) * k), up to constants),
below which you can’t keep all distances separated: points that are far on the true attractor will show up close after projection. That’s called “folding” and might be the source of some of the problems we observe .
=== AI end ===
Bottom line: JL protects distance geometry for a finite sample at a chosen resolution; if you push the resolution finer without increasing k, collisions are inevitable. This is perfectly consistent with the embedding theorems for dynamical systems, which require higher dimensions to get a globally one-to-one (no-folds) representation of the entire attractor.
If someone is bored and would like to discuss this, feel free to email me.
Can you share an actual example demonstrating this potential pathology?
Like many things in ML, this might be a problem in theory but empirically it isn’t important, or is very low on the stack rank of issues with our models.
You can also imagine a similar thing on binary vectors. There two vectors are "orthogonal" if they share no bits that are set to one. So you can encode huge number of concepts using only small number of bits in modestly sized vectors, and most of them will be orthogonal.
I don't think so. For n=3 you can have 000, 001, 010, 100. All 4 (n+1) are pairwise orthogonal. However, I don't think js8 is correct as it looks like in 2^n you can't have more than n+1 mutually orthogonal vectors, as if any vector has 1 in some place, no other vector can have 1 in the same place.
It's not correct to call them orthogonal because I don't think the definition is a dot product. But that aside, yes, orthogonal basis can only have as much elements as dimensions. The article also mentions that, and then introduces "quasi-orthogonality", which means dot product is not zero but very small. On bitstrings, it would correspond to overlap on only small number of bits. I should have been clearer in my offhand remark. :-)
Your initial statement is still wrong, that you can include a lot of information in a small number of bits. If you have a small number of bits, the overlap will be staggering. Now, that may be ok, but not ok, if you want to present orthogonal concepts (or even quasi-orthogonal).
Also, why do you believe dot product cannot be trusted?
Hmm, I think one correction: is (0,0,0) actually a vector? I think that, by definition, an n-dimentional space can have at most n vectors which are all orthogonal to one another.
By the original definition, they can share bits that are set to zero and still be orthogonal. Think of the bits as basis vectors – if they have none in common, they are orthogonal.
That really stretches the meaning of "syntactic". Humans have thoroughly evaluated LLMs and discovered many patterns that very cleanly map to what they would consider real-world concepts. Semantic properties do not require any human-level understanding; a Python script has specific semantics one may use to discuss its properties, and it has become increasingly clear that LLMs can reason (as in derive knowable facts, extract logical conclusions, compare it to different alternatives; not having a conscious thought process involving them) about these scripts not just by their syntactic but also semantic properties (of course bounded and limited by Rice's theorem).
> posed a fascinating question: How can a relatively modest embedding space of 12,288 dimensions (GPT-3) accommodate millions of distinct real-world concepts?
Because there is a large number of combinations of those 12k dimensions? You don’t need a whole dimension for “evil scientist” if you can have a high loading on “evil” and “scientist.” There is quickly a combinatorial explosion of expressible concepts.
I may be missing something but it doesn’t seem like we need any fancy math to resolve this puzzle.
Space embedding based on arbitrary points never resolves to specifics. Particularly downstream. Words are arbitrary, we remained lazy at an unusually vague level of signaling because arbitrary signals provide vast advantages for the sender and controller of the signal. Arbitrary signals are essentially primate dominance tools. They are uniquely one-way. CS never considered this. It has no ability to subtract that dark matter of arbitrary primate dominance that's embedded in the code. Where is this in embedded space?
LLMs are designed for Western concepts of attributes, not holistic, or Eastern. There's not one shred of interdependence, each prediction is decontextualized, the attempt to reorganize by correction only slightly contextualizes. It's the object/individual illusion in arbitrary words that's meaningless. Anyone studying Gentner, Nisbett, Halliday can take a look at how LLMs use language to see how vacant they are. This list proves this.
LLMs are the equivalent of circus act using language.
"Let's consider what we mean by "concepts" in an embedding space. Language models don't deal with perfectly orthogonal relationships – real-world concepts exhibit varying degrees of similarity and difference. Consider these examples of words chosen at random:
"Archery" shares some semantic space with "precision" and "sport"
"Fire" overlaps with both "heat" and "passion"
"Gelatinous" relates to physical properties and food textures
"Southern-ness" encompasses culture, geography, and dialect
"Basketball" connects to both athletics and geometry
"Green" spans color perception and environmental consciousness
"Altruistic" links moral philosophy with behavioral patterns"
Interdependence takes into account the Universe for each thing or idea. There is no such thing as probabilistic in a healthy mind. A probabilistic approach is unhealthy.
edit: looking into this, this is likely in terms of the brain and arbitrariness highly paradoxical even oxymoronic
>>isn’t learning the probabilistic relationships between tokens an attempt to approximate those exact semantic relationships between words?
This is really a poor manner of resolving the conduit metaphor condition to arbitary signals, to falsify them as specific, which is always impossible. This is simple linguistic via animal signal science. If you can't duplicate any response with a high degreee of certainty from output, then the signal is only valid in the most limited time-space condition and yet it is still arbitrary. CS has no understanding of this.
Basil Bernstein's 1973 studies comparing English and math comprehension differences in class.
Halliday's Language and Society Vol 10
Primate Psychology Mastripietri
Apes and Evolution Tuttle
Symbolic Species Deacon
Origin of Speech MacNeilage
That's the tip of the iceberg
edit: As CS doesn't understand the parasitic or viral aspects of language and simply idealizes it, it can't access it. It's more of a black box than the coding of these. I can't understand how CS assumed this would ever work. It makes no sense to exclude the very thing that language is and then automate it.
> The implications of these geometric properties are staggering. Let's consider a simple way to estimate how many quasi-orthogonal vectors can fit in a k-dimensional space. If we define F as the degrees of freedom from orthogonality (90° - desired angle), we can approximate the number of vectors as [...]
If you're just looking at minimum angles between vectors, you're doing spherical codes. So this article is an analysis of spherical codes… that doesn't reference any work on spherical codes… seems to be written in large part by a language model… and has a bunch of basic inconsistencies that make me doubt its conclusions. For example: in the graph showing the values of C for different values of K and N, is the x axis K or N? The caption says the x axis is N, the number of vectors, but later they say the value C = 0.2 was found for "very large spaces," and in the graph we only get C = 0.2 when N = 30,000 and K = 2---that is, 30,000 vectors in two dimensions! On the other hand, if the x axis is K, then this article is extrapolating a measurement done for 2 vectors in 30,000 dimensions to the case of 10^200 vectors in 12,888 dimensions, which obviously is absurd.
I want to stay positive and friendly about people's work, but the amount of LLM-driven stuff on HN is getting really overwhelming.
Spherical codes are kinda of obscure: I haven't heard of them before, and Wikipedia seems to have barely heard of them. And most of the Google results seem to be about playing golf in small dimensions (ie, how many can we optimally pack in n<32 dimensions?).
People do indeed rediscover previously existing math, especially when the old content is hidden under non-obvious jargon.
The problem with saying something is LLM generated is it cannot be proven and is a less-helpful way of saying it has errors.
Pointing out the errors is a more helpful way if stating problems with the article, which you have also done.
In that particular picture, you're probably correct to interpret it as C vs N as stated.
> The problem with saying something is LLM generated is it cannot be proven and is a less-helpful way of saying it has errors.
It's a very helpful way of saying it shouldn't be bothered to be read. After all, if they couldn't be bothered to write it, I can't be bothered to read it.
Agree. What writing is better for understanding geometric properties or information in high dimensional vector spaces + spherical codes?
There's a lot of beautiful writing on these topics on the "pure math" side, but it's hard to figure out what results are important for deep learning and to put them in a form that doesn't take too much of an investment in pure math.
I think the first chapter of [1] is a good introduction to general facts about high-dimensional stuff. I think this is where I first learned about "high-dimensional oranges" and so on.
For something more specifically about the problem of "packing data into a vector" in the context of deep learning, last year I wrote a blog post meant to give some exposition [2].
One really nice approach to this general subject is to think in terms of information theory. For example, take the fact that, for a fixed epsilon > 0, we can find exp(C d) vectors in R^d with all pairwise inner products smaller than epsilon in absolute value. (Here C is some constant depending on epsilon.) People usually find this surprising geometrically. But now, say you want to communicate a symbol by transmitting d numbers through a Gaussian channel. Information theory says that, on average, I should be able to use these d numbers to transmit C d nats of information. (C is called the channel capacity, and depends on the magnitude of the noise and e.g. the range of values I can transmit.) The statement that there exist exp(C d) vectors with small inner products is related to a certain simple protocol to transmit a symbol from an alphabet of size exp(C d) with small error rate. (I'm being quite informal with the constants C.)
[1] https://people.math.ethz.ch/~abandeira//BandeiraSingerStrohm... [2] https://cgad.ski/blog/when-numbers-are-bits.html
I think the author is too focused on the case where all vectors are orthogonal and as a consequence overestimates the amount of error that would be acceptable in practice. The challenge isn't keeping orthogonal vectors almost orthogonal, but keeping the distance ordering between vectors that are far from orthogonal. Even much smaller values of epsilon can give you trouble there.
So the claim that "This research suggests that current embedding dimensions (1,000-20,000) provide more than adequate capacity for representing human knowledge and reasoning." is way too optimistic in my opinion.
Since vectors are usually normalized to the surface of an n-sphere and the relevant distance for outputs (via loss functions) is cosine similarity, "near orthogonality" is what matters in practice. This means during training, you want to move unrelated representations on the sphere such that they become "more orthogonal" in the outputs. This works especially well since you are stuck with limited precision floating point numbers on any realistic hardware anyways.
Btw. this is not an original idea from the linked blog or the youtube video it references. The relevance of this lemma for AI (or at least neural machine learning) was brought up more than a decade ago by C. Eliasmith as far as I know. So it has been around long before architectures like GPT that could actually be realistically trained on such insanely high dimensional world knowledge.
> Since vectors are usually normalized to the surface of an n-sphere (...)
In classification tasks, each feature is normalized independently. Otherwise you would have an entry with feature Foo and Bar which depending on the value of Bar it would be made out to be less Foo when normalized.
This vectors are not normalized in n-spheres, and their codomain ends up being an hypercube.
I agree the OPs argument is a bad one. But I’m still optimistic about the representational capacity of those 20k dimensions.
[dead]
Sort of trivial but fun thing: you can fit billions of concepts into this much space, too. Let's say four bits of each component of the vector are important, going by how some providers do fp4 inference and it isn't entirely falling apart. So an fp4 dimension-12K vector takes up 6KB, like a few pages of UTF-8 text, more compressed text, or 3K tokens in a 64K-token embedding. How many possible multi-page 'thought's are there? A lot!
(And in handling one token, the layers give ~60 chances to mix in previous 'thoughts' via the attention mechanism, and mix in stuff from training via the FFNs! You can start to see how this whole thing ends able to convert your Bash to Python or do word problems.)
Of course, you don't expect it to be 100% space-efficient, detailed mathematical arguments aside. You want blending two vectors with different strengths to work well, and I wouldn't expect the training to settle into the absolute most efficient way to pack the RAM available. But even if you think of this as an upper bound, it's a very different reference point for what 'ought' to be theoretically possible to cram into a bunch of high-dimensional vectors.
These set of intuitions and the Johnson-Lindenstrauss lemma in particular are what power a lot of the research effort behind SAEs (Sparse Autoencoders) in the field of mechanistic interpretability in AI safety.
A lot of the ideas are explored in more detail in Anthropic's 2022 paper that's one of the foundational papers in SAE research: https://transformer-circuits.pub/2022/toy_model/index.html
Where can I read the actual paper? Where is it published?
That is the actual paper, it's published on transformer-circuits.pub.
It's not peer-reviewed?
How would you peer-review something like that? Or rather, how would you even _reproduce_ any of that? Some inception-based model with some synthetic data? Where is the data? I'm sure the paper was written in good faith, but all it leads to is just more crackpottery. If I had a penny for each time I've heard that LLMs are quantum in nature "because softmax is essentially a wave function collapse" and "because superposition," then I would have more than one penny!
There are several [replications and comments from reviewers](https://transformer-circuits.pub/2022/toy_model/index.html#c...), so yes, it is.
Google Scholar claims 380 citations, which is, I think, a respectable number of peers to have reviewed it.
That's not at all how peer review works.
It's not how pre-publication peer review works. There, the problem is that many papers aren't worth reading, but to determine whether it's worth reading or not, someone has to read it and find out. So the work of reading papers of unknown quality is farmed out over a large number of people each reading a small number of randomly-assigned papers.
If somebody's paper does not get assigned as mandatory reading for random reviewers, but people read it anyway and cite it in their own work, they're doing a form of post-publication peer review. What additional information do you think pre-publication peer review would give you?
peer review would encourage less hand wavy language and more precise claims. They would penalize the authors for bringing up bizarre analogies to physics concepts for seemingly no reason. They would criticize the fact that they spend the whole post talking about features without a concrete definition of a feature.
The sloppiness of the circuits thread blog posts has been very damaging to the health of the field, in my opinion. People first learn about mech interp from these blog posts, and then they adopt a similarly sloppy style in discussion.
Frankly, the whole field currently is just a big circle jerk, and it's hard not to think these blog posts are responsible for that.
I mean do you actually think this kind of slop would be publishable in NeurIPS if they submitted the blog post as it is?
"peer review would encourage less hand wavy language and more precise claims"
In theory, yes. Lets not pretend actual peer review would do this.
So you think that this blog post would make it into any of the mainstream conferences? I doubt it.
IME: most of the reviewers in the big ML conferences are second-year phd students sent into the breach against the overwhelming tide of 10k submissions... Their review comments are often somewhere between useless and actively promoting scientific dishonesty.
Sometimes we get good reviewers, who ask questions and make comments which improve the quality of a paper, but I don't really expect it in the conference track. It's much more common to get good reviewers in smaller journals, in domains where the reviewers are experts and care about the subject matter. OTOH, the turnaround for publication in these journals can take a long time.
Meanwhile, some of the best and most important observations in machine learning never went through the conference circuit, simply because the scientific paper often isn't the best venue for broad observation... The OG paper on linear probes comes to mind. https://arxiv.org/pdf/1610.01644
Of the papers submitted to a conference, it might be that reviewers don't offer suggestions that would significantly improve the quality of the work. Indeed the quality of reviews has gone down significantly in recent years. But if Anthropic were going to submit this work to peer review, they would be forced to tighten it up significantly.
The linear probe paper is still written in a format where it could reasonably be submitted, and indeed it was submitted to an ICLR workshop.
What? Yes it is! This is exactly how peer review works! People look at the paper, read it, and then reproduce it, poke holes, etc.
Peer review has nothing to do with "being published in some fancy-looking formatted PDF in some journal after passing an arbitrary committee" or whatever, it's literally review by your peers.
Now, do I have problems with this specific paper and how it's written in a semi-magical way that surely requires the reader suspend disbelief? For sure, but that's completely independent of the "peer-review" aspect of it.
If you believe that citation is the same as review, I have stuff to sell you.
Reviewing a paper can easily take 3 weeks full time work.
Looking at a paper and assuming it is correct, followed by citing it, can literally take seconds.
I'm a researcher and there are definitely two modes of reading papers: review mode and usage mode.
Unless it’s part of a link review farm. I haven’t looked, and you are probably correct; but I would do a bit of research before making any assumptions
Tangential, but the ChatGPT vibe of most of the article is very distracting and annoying. And I say this as someone who consistently uses AI to refine my English. However, I try to avoid letting it reformulate too dramatically, asking it specifically to only fix grammar and non-idiomatic parts while keeping the tone and formulation as much as possible.
Beyond that, this mathematical observation is genuinely fascinating. It points to a crucial insight into how large language models and other AI systems function. By delving into the way high-dimensional data can be projected into lower-dimensional spaces while preserving its structure, we see a crucial mechanism that allows these models to operate efficiently and scale effectively.
Ironically, the use of "fascinating", "crucial" and "delving" in your second paragraph, as well as its overall structure, make it read very much like it was filtered through ChatGPT
I think that was satire
Correct
You have to hope so.
Not liking a writing style because LLMs use it is not a good reason. What exactly do you dislike about it?
Mostly these sentences. QuillBot finds these are 100% AI-generated, but I'm not sure how much we can trust it.
> While exploring this question, I discovered something unexpected that led to an interesting collaboration with Grant and a deeper understanding of vector space geometry.
> When I shared these findings with Grant, his response exemplified the collaborative spirit that makes the mathematics community so rewarding. He not only appreciated the technical correction but invited me to share these insights with the 3Blue1Brown audience. This article is that response, expanded to explore the broader implications of these geometric properties for machine learning and dimensionality reduction.
> The fascinating history of this result speaks to the interconnected nature of mathematical discovery.
> His work consistently inspires deeper exploration of mathematical concepts, and his openness to collaboration exemplifies the best aspects of the mathematical community. The opportunity to contribute to this discussion has been both an honor and a genuine pleasure.
I don't know how to express it, maybe it's because I'm not a native English speaker, but my brain has become used to this kind of tone in AI-generated content and I find it distracting to read. I don't mean to diminish this blog post, which is otherwise very interesting. I'm just pointing out an increasing (and understandable) trend of relying on AI to "improve" prose, but I think it sometimes leads to a uniformity of style, which I find a bit sad.
Wikipedia has a great article[0] which describes the signs of AI writing, and why it prefers not to have those styles in their articles. I agree with almost all of it, and it's far more detailed than I could be in a HN post.
Reading LLM text feels a lot like watching a Dragon Ball Z filler episode.
[0] - https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing
What a great article, thanks!
I liked this bit, among others:
> LLMs overuse the 'rule of three'—"the good, the bad, and the ugly". This can take different forms from "adjective, adjective, adjective" to "short phrase, short phrase, and short phrase".[2]
> While the 'rule of three', used sparingly, is common in creative, argumentative, or promotional writing, it is less appropriate for purely informational texts, and LLMs often use this structure to make superficial analyses appear more comprehensive.
> Examples:
> "The Amaze Conference brings together global SEO professionals, marketing experts, and growth hackers to discuss the latest trends in digital marketing. The event features keynote sessions, panel discussions, and networking opportunities."
It is a social signal of not really caring about the content, just like people who post in large swathes of bad grammar and spelling (relative to the local environment).
It is completely reasonable to read that signal, and completely reasonable to conclude that you shouldn't ask me to care more about your content than you did as the "creator".
Moreover, it suggests that your "content" may just be a prompt to an AI, and there is no great value to caching the output of an AI on the web, or asking me to read it. In six months I could take the same prompt and get something better.
Finally, if your article looks like it was generated by AI, AI is still frankly not really at the point where long form output of it on a technical concept is quite a safe thing to consume for deep understanding. It still makes a lot of little errors fairly pervasively, and significant errors not only still occur often, but when they do they tend to corrupt the entire rest of the output as the attention mechanism basically causes the AI to justify its errors with additional correct-sounding output. And if your article sounds like it was generated with AI, am I justified in assuming that you've prevented this from happening with your own expertise? Statistically, probably not.
Disliking the "default tone" of the AI may seem irrational but there are perfectly rational reasons to develop that distaste. I suppose we should be grateful that LLM's "default voice" has been developed into something that wasn't used by many humans prior to them staking a claim on it. I've heard a few people complain about getting tagged as AIs for using emdashes or bullet points prior to AIs, but it's not a terribly common complaint.
Man, that's like, just your opinion, man. Go with the flow man! Peace.
Not OP, but now I intentionally try to use the word "delve" whenever I can.
For sure. I love "delve", it's a useful word.
I've also always been proud of using em and en dashes correctly—including using en dashes for ranges like 12–2pm—but nearly everyone thinks they're an LLM exclusive... so now I really go out of my way to use them just out of spite.
Which parts felt GPT'y to you? The list-happy style?
For me, the GPT feeling started with "tangential" and ended with "effectively".
A key error is there literally are no where close to billions of concepts. Its a misunderstanding of what a concept is as used by us humans. There are an unlimited number of instances and entities, but the concepts we use to think about them is very limited by comparison.
Language models don't "pack concepts" into the C dimension of one layer (I guess that's where the 12k number came from), neither do they have to be orthogonal to be viewed as distinct or separate. LLMs generally aren't trained to make distinct concepts far apart in the vector space either. The whole point of dense representations, is that there's no clear separation between which concept lives where. People train sparse autoencoders to work out which neurons fire based on the topics involved. Neuronpedia demonstrates it very nicely: https://www.neuronpedia.org/.
The spare autoencoder work is /exactly/ premised on the kind of near-orthogonality that this article talks about. It's called the 'superposition hypothesis' originally: https://transformer-circuits.pub/2022/toy_model/index.html
The SAE's job is to try to pull apart the sparse nearly-orthogonal 'concepts' from a given embedding vector, by decomposing the dense vector into a sparsely activation over-complete basis. They tend to find that this works well, and even allows matching embedding spaces between different LLMs efficiently.
Agreed, if you relax the requirement for perfect orthogonality, then, yes, you can pack in much more info. You basically introduced additional (fractional) dimensions clustered with the main dimensions. Put another way, many concepts are not orthogonal, but have some commonality or correlation.
So nothing earth shattering here. The article is also filled with words like "remarkable", "fascinating", "profound", etc. that make me feel like some level of subliminal manipulation is going on. Maybe some use of an LLM?
It's... really not what I meant. This requirement does not have to be relaxed, it doesn't exist at all.
Semantic similarity in embedding space is a convenient accident, not a design constraint. The model's real "understanding" emerges from the full forward pass, not the embedding geometry.
I'm speaking in more in general conceptual terms, not about the specifics of LLM architecture
My intuition of this problem is much simpler — assuming there’s some rough hierarchy of concepts, you can guesstimate how many concepts can exist in a 12,000-d space by taking the combinatorial of the number of dimensions. In that world, each concept is mutually orthgonal with every other concept in at least some dimension. While that doesn’t mean their cosine distance is large, it does mean you’re guaranteed a function that can linearly separate the two concepts.
It means you get 12,000! (Factorial) concepts in the limit case, more than enough room to fit a taxonomy
You can only get 12,000! concepts if you pair each concept with an ordering of the dimensions, which models do not do. A vector in a model that has [weight_1, weight_2, ... weight_12000] is identical to the vector [weight_2, weight_1, ..., weight_12000] within the larger model.
Instead, a naive mental model of a language model is to have a positive, negative or zero trit in each axis: 3^12,000 concepts, which is a much lower number than 12000!. Then in practice, almost every vector in the model has all but a few dozen identified axes zeroed because of the limitations of training time.
You’re right. I gave the wrong number. My model implies 2^12000 concepts, because you choose whether or not to include each concept to form your dimensional subspace.
I’m not even referring to the values within that subspace yet, and so once you pick a concept you still get the N degrees of freedom to create a complex manifold.
The main value of the mental model is to build an intuition for how “sparse” high dimensional vectors are without resorting to a 3D sphere.
That number is far, far, far greater than the number of atoms in the universe (~10^43741 >>>>>>>> ~10^80).
Say, there are 10^80 atoms, then there are like 2^(10^80) possible things, and 2^(2^(10^80)) grouping/categorization/ordering on the things, and so on, you can go higher, and the number of possibilities go up really fast.
Not surprising since concepts are virtual. There is a person, a person with a partner is a couple. A couple with a kid is a family. That’s 5 concepts alone.
I’m not sure you grok how big a number 10^43741 is.
If we assume that a "concept" is something that can be uniquely encoded as a finite string of English text, you could go up to concepts that are so complex that every single one would take all the matter in the universe to encode (so say 10^80 universes, each with 10^80 particles), and out of 10^43741 concepts you’d still have 10^43741 left undefined.
A concept space of 10^43741 needs about 43741*3 bits to identify each concept uniquely (by the information theoretic concept of bit, which is more a lower bound on what we traditionally think of as bits in the computer world than a match), or about 16000-ish "bytes", which you can approximate reasonably as a "compressed text size". There's a couple orders of magnitude of fiddling around the edges you can do there but you still end up with human-sized quantities of information to identify specific concepts in a space that size rather than massively-larger-than-the-universe sized.
Things like novels come from that space. We sample it all the time. Extremely, extremely sparsely, of course.
Or to put it another way, in a space of a given size, identifying a specific component takes the log2 of the space's size in bits to identify a concept, not something the size of the space itself. 10^43741 is a very large space by our standards, but the log2 of it is not impossibly large.
If it seems weird for models to work in this space, remember that as the models themselves in their full glory are clocking in at multiple hundreds of gigabytes that the space of possible AIs using this neural architecture is itself 2^trillion-ish, which makes 10^43741 look pedestrian. Understanding how to do anything useful with that amount of possibility is quite the challenge.
Somehow that's still an understatement
> While that doesn’t mean their cosine distance is large
There’s a lot of devil in this detail.
A continuing, probably unending, opportunity/tragedy is the under-appreciation of representation learning / embeddings.
The magic of many current valuable models is simply that they can combine abstract "concepts" like "ruler" + "male" and get "king."
This is perhaps the easiest way to understand the lossy text compression that constitutes many LLMs. They're operating in the embedding space, so abstract concepts can be manipulated between input and output. It's like compiling C using something like LLVM: there's an intermediate representation. (obviously not exactly because generally compiler output is deterministic).
This is also present in image models: "edge" + "four corners" is square, etc.
If you ever played 20Questions you know that you don't need 1000 dimensions for a billion concepts. These huge vectors can represent way more complex information than just a billion concepts.
In fact they can pack complete poems with or without typos and you can ask where in the poem the typo is, which is exactly what happens if you paste that into GPT: somewhere in an internal layer it will distinguish exactly that.
That's not the vector doing that though it is the model. The model is like a trillion dimensional vector.
With binary vectors, 20 dimensions will get you just over a million concepts. For a billion you’ll need 30 questions.
If vectors life in an effectively lower space that they could, they don't live up to their n-dimensional potential.
Sometimes these things are patched with cosine distance (or even - Pearson correlation), vide https://p.migdal.pl/blog/2025/01/dont-use-cosine-similarity. Ideally when we don't need to and vectors occupy the space.
I am kind of surprised that the original article does not mention batch normalization and similar operations - these are pretty much created to automatically de-bias and de-correlate values at each layer.
this is like saying computers fit a billion numbers in 32 bits. Each dimension adds a new degree of space
Not to completely plug my own work here, but I also wrote about this for a slightly more mathematical audience (and uhh, a much shorter post): "There are exponentially many vectors with small inner product"
https://lmao.bearblog.dev/exponential-vectors/
For those who are interested in the more "math-y" side of things.
For what it's worth, I don't fully understand the connection between the JL lemma and this "exponentially many vectors" statement, other than the fact that their proof relies on similar concentration behavior.
What's the point of the relu in the loss function? Its inputs are nonnegative anyway.
Let's try to keep things positive.
I wondered the same. Seems like it would just make a V-shaped loss around the zero, but abs has that property already!
RELU would have made it flat below zero ( _/ not \/). Adding the abs first just makes RELU do nothing.
In reality it’s probably not a RELU modern LLMs use GeLU or something more advanced.
Sometimes a cosmic ray might hit the sign bit of the register and flip it to a negative value. So it is useful to pass it through a rectifier to ensure it's never negative, even in this rare case.
Indeed, we should call all idempotent functions twice just in case the first incantation fails to succeed.
In all seriousness, this is not at all how resilience to cosmic interference works in practice, and the probability of any executed instruction or even any other bit being flipped is far greater than the one specific bit you are addressing.
I thought the belt and braces approach was a valuable contribution to AI safety. Better safe than sorry with these troublesome negative numbers!
Well, I guess it's helping to distinguish authors who are doing arithmetic they understand from ones who are copying received incantations around...
The vectors don't need to be orthogonal due to the use of non-linearities in neural networks. The softmax in attention let's you effectively pack as many vectors in 1D as you want and unambiguously pick them out.
newbie question: when training networks, what mechanism makes the language's concepts be (almost)orthogonal to each other?
The dimensions should actually be closer to 12000 * (no of tokens*no of layers / x)
(where x is a number dependent on architectural features like MLHA, QGA...)
There is this thing called KV cache which holds an enormous latent state.
I became bit lost between "C is a constant that determines the probability of success" and then they set C between 4 and 8. Probability should be between 0 and 1, how it relates to C?
It's the epsilon^-2 term that actually talks about success, but that is tightly linked with the C term. If you want to decrease epsilon, C goes up.
Wow, I think I might just have grasped one of the sources of the problems we keep seeing with LLMs.
Johnson-Lichtenstrauss guarantees a distance preserving embedding for a finite set of points into a space with a dimension based on the number of points.
It does not say anything about preserving the underlying topology of the contious high dimensional manifold, that would be Takens/Whitney-style embedding results (and Sauer–Yorke for attractors).
The embedding dimensions needed to fulfil Takens are related to the original manifolds dimension and not the number of points.
It’s quite probable that we observe violations of topological features of the original manifold, when using our to low dimensional embedded version to interpolate.
I used AI to sort the hodge pudge of math in my head into something another human could understand, edited result is below:
=== AI in use === If you want to resolve an attractor down to a spatial scale rho, you need about n ≈ C * rho^(-d_B) sample points (here d_B is the box-counting/fractal dimension).
The Johnson–Lindenstrauss (JL) lemma says that to preserve all pairwise distances among n points within a factor 1±ε, you need a target dimension
k ≳ (d_B / ε^2) * log(C / rho).
So as you ask for finer resolution (rho → 0), the required k must grow. If you keep k fixed (i.e., you embed into a dimension that’s too low), there is a smallest resolvable scale
rho* (roughly rho* ≳ C * exp(-(ε^2/d_B) * k), up to constants),
below which you can’t keep all distances separated: points that are far on the true attractor will show up close after projection. That’s called “folding” and might be the source of some of the problems we observe .
=== AI end ===
Bottom line: JL protects distance geometry for a finite sample at a chosen resolution; if you push the resolution finer without increasing k, collisions are inevitable. This is perfectly consistent with the embedding theorems for dynamical systems, which require higher dimensions to get a globally one-to-one (no-folds) representation of the entire attractor.
If someone is bored and would like to discuss this, feel free to email me.
So basically the map projection problem [1] in higher dimensions?
[1] https://en.m.wikipedia.org/wiki/Map_projection
Worse. Map projection means that you cannot have a mapping that preserves elements of the internal geometry: angles and such.
Violation of topology means that a surface wrongly is mapped to one intersecting itself: Think Klein Bottle.
https://en.wikipedia.org/wiki/Klein_bottle
Can you share an actual example demonstrating this potential pathology?
Like many things in ML, this might be a problem in theory but empirically it isn’t important, or is very low on the stack rank of issues with our models.
They don’t capture concepts at all. They capture writings of concepts.
The universe packs in even more concepts: only 3 or 4 dimensions
Interesting that the practical C values were much below the theoretical bounds.
You can also imagine a similar thing on binary vectors. There two vectors are "orthogonal" if they share no bits that are set to one. So you can encode huge number of concepts using only small number of bits in modestly sized vectors, and most of them will be orthogonal.
If they are only orthogonal if they share no bits that are set to one, only one vector, the complement, will be orthogonal, no?
Edit: this is wrong as respondents point out. Clearly I shouldn't be commenting before having my first coffee.
I don't think so. For n=3 you can have 000, 001, 010, 100. All 4 (n+1) are pairwise orthogonal. However, I don't think js8 is correct as it looks like in 2^n you can't have more than n+1 mutually orthogonal vectors, as if any vector has 1 in some place, no other vector can have 1 in the same place.
It's not correct to call them orthogonal because I don't think the definition is a dot product. But that aside, yes, orthogonal basis can only have as much elements as dimensions. The article also mentions that, and then introduces "quasi-orthogonality", which means dot product is not zero but very small. On bitstrings, it would correspond to overlap on only small number of bits. I should have been clearer in my offhand remark. :-)
Your initial statement is still wrong, that you can include a lot of information in a small number of bits. If you have a small number of bits, the overlap will be staggering. Now, that may be ok, but not ok, if you want to present orthogonal concepts (or even quasi-orthogonal).
Also, why do you believe dot product cannot be trusted?
Hmm, I think one correction: is (0,0,0) actually a vector? I think that, by definition, an n-dimentional space can have at most n vectors which are all orthogonal to one another.
By the original definition, they can share bits that are set to zero and still be orthogonal. Think of the bits as basis vectors – if they have none in common, they are orthogonal.
For example, 1010 and 0101 are orthogonal, but 1010 and 0011 are not (share the 3rd bit). Though calling them orthogonal is not quite right.
Why not? The 1010 and 0101 are orthogonal.
Your definition of orthogonal is incorrect, in this case.
In the case of binary vectors, don't forget you are working with the finite field of two elements {0, 1}, and use XOR.
Is the definition of dimensions here the same as 2D, 3D, 4D, etc or some other abstract mathematical concept?
*string representations of concepts
Ok.
Now try to separate the "learning the language" from "learning the data".
If we have a model pre trained on language does it then learn concepts quicker, the same or different?
Can we compress just data in a lossy into an LLM like kernel which regenerates the input to a given level of fidelity?
There are no "real-world concepts" or "semantic meaning" in LLMs, there are only syntactic relationships among text tokens.
That really stretches the meaning of "syntactic". Humans have thoroughly evaluated LLMs and discovered many patterns that very cleanly map to what they would consider real-world concepts. Semantic properties do not require any human-level understanding; a Python script has specific semantics one may use to discuss its properties, and it has become increasingly clear that LLMs can reason (as in derive knowable facts, extract logical conclusions, compare it to different alternatives; not having a conscious thought process involving them) about these scripts not just by their syntactic but also semantic properties (of course bounded and limited by Rice's theorem).
Do you learn anything from reading books or is everything you know entirely derived from personal experience.
> posed a fascinating question: How can a relatively modest embedding space of 12,288 dimensions (GPT-3) accommodate millions of distinct real-world concepts?
Because there is a large number of combinations of those 12k dimensions? You don’t need a whole dimension for “evil scientist” if you can have a high loading on “evil” and “scientist.” There is quickly a combinatorial explosion of expressible concepts.
I may be missing something but it doesn’t seem like we need any fancy math to resolve this puzzle.
Space embedding based on arbitrary points never resolves to specifics. Particularly downstream. Words are arbitrary, we remained lazy at an unusually vague level of signaling because arbitrary signals provide vast advantages for the sender and controller of the signal. Arbitrary signals are essentially primate dominance tools. They are uniquely one-way. CS never considered this. It has no ability to subtract that dark matter of arbitrary primate dominance that's embedded in the code. Where is this in embedded space?
LLMs are designed for Western concepts of attributes, not holistic, or Eastern. There's not one shred of interdependence, each prediction is decontextualized, the attempt to reorganize by correction only slightly contextualizes. It's the object/individual illusion in arbitrary words that's meaningless. Anyone studying Gentner, Nisbett, Halliday can take a look at how LLMs use language to see how vacant they are. This list proves this. LLMs are the equivalent of circus act using language.
"Let's consider what we mean by "concepts" in an embedding space. Language models don't deal with perfectly orthogonal relationships – real-world concepts exhibit varying degrees of similarity and difference. Consider these examples of words chosen at random: "Archery" shares some semantic space with "precision" and "sport" "Fire" overlaps with both "heat" and "passion" "Gelatinous" relates to physical properties and food textures "Southern-ness" encompasses culture, geography, and dialect "Basketball" connects to both athletics and geometry "Green" spans color perception and environmental consciousness "Altruistic" links moral philosophy with behavioral patterns"
aren’t outputs literally conditioned on prior textual context? how is that lacking interdependence?
isn’t learning the probabilistic relationships between tokens an attempt to approximate those exact semantic relationships between words?
Interdependence takes into account the Universe for each thing or idea. There is no such thing as probabilistic in a healthy mind. A probabilistic approach is unhealthy.
https://pubmed.ncbi.nlm.nih.gov/38579270/
edit: looking into this, this is likely in terms of the brain and arbitrariness highly paradoxical even oxymoronic
>>isn’t learning the probabilistic relationships between tokens an attempt to approximate those exact semantic relationships between words?
This is really a poor manner of resolving the conduit metaphor condition to arbitary signals, to falsify them as specific, which is always impossible. This is simple linguistic via animal signal science. If you can't duplicate any response with a high degreee of certainty from output, then the signal is only valid in the most limited time-space condition and yet it is still arbitrary. CS has no understanding of this.
> Arbitrary signals are essentially primate dominance tools.
What should I read to better understand this claim?
> LLMs are the equivalent of circles act using language.
Circled apes?
Basil Bernstein's 1973 studies comparing English and math comprehension differences in class. Halliday's Language and Society Vol 10 Primate Psychology Mastripietri Apes and Evolution Tuttle Symbolic Species Deacon Origin of Speech MacNeilage
That's the tip of the iceberg
edit: As CS doesn't understand the parasitic or viral aspects of language and simply idealizes it, it can't access it. It's more of a black box than the coding of these. I can't understand how CS assumed this would ever work. It makes no sense to exclude the very thing that language is and then automate it.