I hope libraries are figuring out how to archive today’s absolutely remarkable but potentially illicitly created AIs.
Large language models like GPT-3 are trained by hoovering up all the text on the internet. Image synthesis AIs are a language model plus another AI trained on all the images that are similarly hoovered up. It’s all pretty indiscriminate.
FOR EXAMPLE: Andy Baio and Simon Willison built an interactive explorer for some of the training images in Stable Diffusion (exploring 12 million of the 2.3 billion included) - unsurprisingly there’s a lot of commercial art there. And that’s why you can say “in the style of David Hockney” or whatever in an image prompt and it comes back looking like a previously-unknown Hockey print.
ASIDE:
Take a moment to visit Everest Pipkin’s project Lacework (2020) in which they viewed, personally, every single one of the one million 3 second videos in the MIT Moments In Time dataset._
Very slowly, over and over, my body learns the rules and edges of the dataset. I come to understand so much about it; how each source is structured, how the videos are found, the words that are caught in the algorithmic gathering.
I don’t think anyone, anywhere will have such an understanding of what constitutes an AI, and given the growth in datasets, I don’t think anyone could ever again.
Repetition is devotional, says Pipkin.
It brings tears to my eyes. So good!
Who owns style?
When it comes to code the problem is even more pointed because code often explicitly has a license attached. GitHub Copilot is an amazing code autocompletion AI – it’s like pair programming. (I can see a near-term future where being a human engineer is more like being an engineering manager today, and you spend your days briefing and reviewing pull requests from your team of AI copilot juniors.)
But it’s trained on GPL code. When code is licensed with GPL, the authors say that it’s free to use, but any code based on it must also be licensed as GPL. Viral freedom. Now, if I learn how to code by reading GPL code and then go on to work on proprietary code, that’s fine. But used as AI training data?
Legally GitHub Copilot is probably in the clear but it’s also probably not what the authors of the open source, GPL code would have intended.
Simon Willison talks about vegan datasets: I’m not qualified to speak to the legality of this. I’m personally more concerned with the morality. - It’s a useful distinction.
There’s a lot to figure out. Have I been trained? is a tool to bring some transparency: as an artist you can search for your own work in the image synthesis training data. It’s a first of a series of tools from a new organisation called Spawning, also including Source+:
… Dryhurst and Herndon are developing a standard they’re calling Source+, which is designed as a way of allowing artists to and opt into - or out of - allowing their work being used as training data for AI. (The standard will cover not just visual artists, but musicians and writers, too.)
Provenance, attribution, consent, and being compensated for one’s labour (and being able to opt in/out of the market) are all important values. But I can’t quite visualise the eventual shape of the accommodation. The trained AIs are just too valuable; the voices of artists, creatives, and coders are just too diffuse.
v buckingham calls this copyright laundering, as previously discussed in this post about ownership, in which I also said:
Maybe there is a market for a future GPT-PD, where PD stands for public domain, and the AI model is guaranteed to be trained only on public domain and out-of-copyright works.
And litigiously cautious megacorporations like Apple will use GPT-PD for their AI needs, such as autocomplete and auto-composing emails and how Siri has conversations and so on.
The consequence will be that Gen Beta will communicate with the lilt and cadence of copyright-expired Victorian novels, and anyone older (like us) will carry textual tells marking us as born in the Pre Attribution Age.
Perhaps:
GPT-3 and the Laion-5b dataset, with their gotta-catch-em-all approaches to hoovering up training data, will in the future be seen as just a blip.
ALSO we’re poisoning the groundwater.
Attribution or not, GPT-3, DALL-E, Stable Diffusion and the rest were trained on an internet where synthesised text and images were mostly absent.
DALL-E at least watermarks its output with a rainbow telltale in the bottom right, so these can be excluded from future sets of training data, but other synthesisers don’t.
What freaky feedback loops come about when models are being trained on data swept up monthly, but the data has a high proportion of output from previous models?
Long story short, today’s AIs are unique, trained as they are on pure, unethically harvested data.
Given all of the above, they are perhaps the most complete models we’ll ever get? Future datasets will be edited and will be muddied.
And given that: we have an obligation to save them, right? Troubling provenance or no.
In a funny way I’m reminded of the immortal cell line of Henrietta Lacks – the moral framework wasn’t in place in 1951 to see what we see clearly now: that it wasn’t ok to collect and appropriate Lacks’ cells. But the HeLa cancer cell line has been used in all kinds of advances over the years, and at the point where the moral framework was established, the choice was made to keep the cell line going. (I’d love to learn more about the moral philosophy of this one.)
Tricky.
Anyway.
How does a library save a snapshot of the current DALL-E, the current GPT-3, the current Stable Diffusion? Complete, usable, and frozen.
There’s going to be pressure to not retain these AIs, given the stolen words, art, and code inside them. If not that then the march of upgrades: version 1.1, version 2, a database migration, and at a certain point the mostly proprietary tooling to access the original version of the synthesis models will be gone too. It won’t seem important.
How can they be kept for future research? And for just, you know, history.
I hope there are librarians and archivists working on this, today. I hope that folks from the Internet Archive are already in conversation with OpenAI.
And:
What happens when we find, buried in the model weights, data that is as culturally sensitive as - say - some of the objects appropriated and kept in the British Museum? What arguments are there to be had about data, in centuries to come?
If you enjoyed this post, please consider sharing it by email or on social media. Here’s the link. Thanks, —Matt.
‘Yes, we’ll see them together some Saturday afternoon then,’ she said. ‘I won’t have any hand in your not going to Cathedral on Sunday morning. I suppose we must be getting back. What time was it when you looked at your watch just now?’ "In China and some other countries it is not considered necessary to give the girls any education; but in Japan it is not so. The girls are educated here, though not so much as the boys; and of late years they have established schools where they receive what we call the higher branches of instruction. Every year new schools for girls are opened; and a great many of the Japanese who formerly would not be seen in public with their wives have adopted the Western idea, and bring their wives into society. The marriage laws have been arranged so as to allow the different classes to marry among[Pg 258] each other, and the government is doing all it can to improve the condition of the women. They were better off before than the women of any other Eastern country; and if things go on as they are now going, they will be still better in a few years. The world moves. "Frank and Fred." She whispered something to herself in horrified dismay; but then she looked at me with her eyes very blue and said "You'll see him about it, won't you? You must help unravel this tangle, Richard; and if you do I'll--I'll dance at your wedding; yours and--somebody's we know!" Her eyes began forewith. Lawrence laughed silently. He seemed to be intensely amused about something. He took a flat brown paper parcel from his pocket. making a notable addition to American literature. I did truly. "Surely," said the minister, "surely." There might have been men who would have remembered that Mrs. Lawton was a tough woman, even for a mining town, and who would in the names of their own wives have refused to let her cross the threshold of their homes. But he saw that she was ill, and he did not so much as hesitate. "I feel awful sorry for you sir," said the Lieutenant, much moved. "And if I had it in my power you should go. But I have got my orders, and I must obey them. I musn't allow anybody not actually be longing to the army to pass on across the river on the train." "Throw a piece o' that fat pine on the fire. Shorty," said the Deacon, "and let's see what I've got." "Further admonitions," continued the Lieutenant, "had the same result, and I was about to call a guard to put him under arrest, when I happened to notice a pair of field-glasses that the prisoner had picked up, and was evidently intending to appropriate to his own use, and not account for them. This was confirmed by his approaching me in a menacing manner, insolently demanding their return, and threatening me in a loud voice if I did not give them up, which I properly refused to do, and ordered a Sergeant who had come up to seize and buck-and-gag him. The Sergeant, against whom I shall appear later, did not obey my orders, but seemed to abet his companion's gross insubordination. The scene finally culminated, in the presence of a number of enlisted men, in the prisoner's wrenching the field-glasses away from me by main force, and would have struck me had not the Sergeant prevented this. It was such an act as in any other army in the world would have subjected the offender to instant execution. It was only possible in—" "Don't soft-soap me," the old woman snapped. "I'm too old for it and I'm too tough for it. I want to look at some facts, and I want you to look at them, too." She paused, and nobody said a word. "I want to start with a simple statement. We're in trouble." RE: Fruyling's World "MACDONALD'S GATE" "Read me some of it." "Well, I want something better than that." HoME大香蕉第一时间
ENTER NUMBET 0016lucoqn.com.cn jqchain.com.cn www.fgchain.com.cn www.hrmsh.com.cn www.nyd3aily.com.cn smwphs.com.cn www.skicms.com.cn www.tyqbke.com.cn pear168.com.cn pcyfoh.com.cn
I hope libraries are figuring out how to archive today’s absolutely remarkable but potentially illicitly created AIs.
Large language models like GPT-3 are trained by hoovering up all the text on the internet. Image synthesis AIs are a language model plus another AI trained on all the images that are similarly hoovered up. It’s all pretty indiscriminate.
FOR EXAMPLE: Andy Baio and Simon Willison built an interactive explorer for some of the training images in Stable Diffusion (exploring 12 million of the 2.3 billion included) - unsurprisingly there’s a lot of commercial art there. And that’s why you can say “in the style of David Hockney” or whatever in an image prompt and it comes back looking like a previously-unknown Hockey print.
ASIDE:
Take a moment to visit Everest Pipkin’s project Lacework (2020) in which they viewed, personally, every single one of the one million 3 second videos in the MIT Moments In Time dataset._
I don’t think anyone, anywhere will have such an understanding of what constitutes an AI, and given the growth in datasets, I don’t think anyone could ever again.
says Pipkin.
It brings tears to my eyes. So good!
Who owns style?
When it comes to code the problem is even more pointed because code often explicitly has a license attached. GitHub Copilot is an amazing code autocompletion AI – it’s like pair programming. (I can see a near-term future where being a human engineer is more like being an engineering manager today, and you spend your days briefing and reviewing pull requests from your team of AI copilot juniors.)
But it’s trained on GPL code. When code is licensed with GPL, the authors say that it’s free to use, but any code based on it must also be licensed as GPL. Viral freedom. Now, if I learn how to code by reading GPL code and then go on to work on proprietary code, that’s fine. But used as AI training data?
Legally GitHub Copilot is probably in the clear but it’s also probably not what the authors of the open source, GPL code would have intended.
Simon Willison talks about vegan datasets:
- It’s a useful distinction.There’s a lot to figure out. Have I been trained? is a tool to bring some transparency: as an artist you can search for your own work in the image synthesis training data. It’s a first of a series of tools from a new organisation called Spawning, also including Source+:
Provenance, attribution, consent, and being compensated for one’s labour (and being able to opt in/out of the market) are all important values. But I can’t quite visualise the eventual shape of the accommodation. The trained AIs are just too valuable; the voices of artists, creatives, and coders are just too diffuse.
v buckingham calls this in this post about ownership, in which I also said:
as previously discussedThe consequence will be that Gen Beta will communicate with the lilt and cadence of copyright-expired Victorian novels, and anyone older (like us) will carry textual tells marking us as born in the Pre Attribution Age.
Perhaps:
GPT-3 and the Laion-5b dataset, with their gotta-catch-em-all approaches to hoovering up training data, will in the future be seen as just a blip.
ALSO we’re poisoning the groundwater.
Attribution or not, GPT-3, DALL-E, Stable Diffusion and the rest were trained on an internet where synthesised text and images were mostly absent.
DALL-E at least watermarks its output with a rainbow telltale in the bottom right, so these can be excluded from future sets of training data, but other synthesisers don’t.
What freaky feedback loops come about when models are being trained on data swept up monthly, but the data has a high proportion of output from previous models?
Long story short, today’s AIs are unique, trained as they are on pure, unethically harvested data.
Given all of the above, they are perhaps the most complete models we’ll ever get? Future datasets will be edited and will be muddied.
And given that: we have an obligation to save them, right? Troubling provenance or no.
In a funny way I’m reminded of the immortal cell line of Henrietta Lacks – the moral framework wasn’t in place in 1951 to see what we see clearly now: that it wasn’t ok to collect and appropriate Lacks’ cells. But the HeLa cancer cell line has been used in all kinds of advances over the years, and at the point where the moral framework was established, the choice was made to keep the cell line going. (I’d love to learn more about the moral philosophy of this one.)
Tricky.
Anyway.
How does a library save a snapshot of the current DALL-E, the current GPT-3, the current Stable Diffusion? Complete, usable, and frozen.
There’s going to be pressure to not retain these AIs, given the stolen words, art, and code inside them. If not that then the march of upgrades: version 1.1, version 2, a database migration, and at a certain point the mostly proprietary tooling to access the original version of the synthesis models will be gone too. It won’t seem important.
How can they be kept for future research? And for just, you know, history.
I hope there are librarians and archivists working on this, today. I hope that folks from the Internet Archive are already in conversation with OpenAI.
And:
What happens when we find, buried in the model weights, data that is as culturally sensitive as - say - some of the objects appropriated and kept in the British Museum? What arguments are there to be had about data, in centuries to come?