Skip to content

Conversation

@chauvinSimon
Copy link

@remi-or @ArthurZucker @McPatate thanks for the great content and format!

One correction I am not 100% about:

  • Just below the first figure, it says: "Computing QKT requires O(n²d) operations".
  • Because Q and K are of shape n*A, shouldn't it say "requires O(n²A) operations"?
  • I was confused when reading the first time, but maybe d is used because the h heads are considered and d = Ah.

Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of the post, I think the line is referring to a single attention head. Personally, I think we could just use O(n^2) without loss of generality :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants