Generative AI in Digital Painting and Large Scale Dataset Creation

Introduction and greetings

  • Members introduce themselves and exchange pleasantries

Generative AI for digital painting

  • Discussion on the potential for creating artist-focused digital painting tools using Generative AI
  • Current methods involve hacking with automatic tools and using Photoshop to achieve desired results

Large scale dataset creation for LLM model training

  • Member asks if anyone has experience creating large scale datasets

The description and link can be mismatched because of extraction errors.

  • https://twitter.com/alexwan55/status/1653437581768663040?t=dDWO7Li2FAECVcszYGsF6A&s=19 - The message in the same link discusses the impracticality of covering all modes of adversarial attack for any model with many parameters due to the curse of dimensionality.
  • https://huggingface.co/aashay96/indic-BloomLM - The link is mentioned in the message as an example of a large scale dataset (300GB) for llm model training.
  • https://github.com/EleutherAI/lm-evaluation-harness: Can you please look at this?
  • https://42papers.com/ - The creator of this website and other websites, including artcompute.com and mindsjs.com, is mentioned in the message. The message also asks for suggestions on specific topics related to model training and deployment.
  • For folks interested in Generative Art, cool results from inpainting and other SD tricks there: https://chat.whatsapp.com/GThJJhoF3cL7QCmrfIoY8J and https://github.com/unum-cloud/usearch. The message also suggests that separate collections may be the cleanest way.
  • https://chat.whatsapp.com/GThJJhoF3cL7QCmrfIoY8J - For folks interested in Generative Art, cool results from inpainting and other SD tricks there.
  • https://github.com/unum-cloud/usearch - No context provided.
  • https://github.com/NimbleBoxAI/ChainFury - From NimbleBox folks in India, [PHONE REMOVED] and friends — low-code for making chat experiences in particular.
  • https://github.com/NimbleBoxAI/ChainFury - A low-code tool for making chat experiences, mentioned in a message from NimbleBox folks in India along with a phone number and a shout-out to Nirant.
  • https://huggingface.co/gemasphi/laprador-document-encoder: Request for invite link to a group and inquiry about the use of langchain or llama index for hybrid embeddings.
  • https://github.com/jerryjliu/llama_index/blob/main/examples/query/CustomRetrievers.ipynb - Examples of custom retrievers for Llama Index, a decent Retriever design mentioned in the conversation.
  • https://python.langchain.com/en/latest/modules/indexes/getting_started.html: The message mentions that the indexes feel like a semantic search interface and thanks the discussion for helping understand the strengths of Langchain and/or Haystack.
  • https://docs.haystack.deepset.ai/docs/document_store - discussed in relation to the strengths of Langchain and Haystack, and discovered as a potential solution for creating a single API for vector DBs. Also mentioned as an example of Haystack’s strong design, particularly in the implementation of PromptNode compared to Langchain.
  • https://vitess.io/ is a tool for sharding, as mentioned in a message about Vitess being a great tool and a primer for sharding at https://aws.amazon.com/what-is/database-sharding/.
  • https://vitess.io/ is a tool for sharding, as explained in a message praising its usefulness.
  • https://www.youtube.com/watch?v=3MqJzMvHE3E - Jeremy Howard introduces Modular, an AI programming language that is 3000x faster than equivalent Python code for matrix multiplication. The programming language is called Mojo and is not Free or Open Source. Razorpay was exploring managed Vitess at https://planetscale.com/.
  • https://open.substack.com/pub/semianalysis/p/google-we-have-no-moat-and-neither?r=2gao6&utm_medium=ios&utm_campaign=post (from an article discussing Google’s lack of a competitive advantage and mentioning Jeremy’s support for TF Swift)
  • https://lmsys.org/blog/2023-03-30-vicuna/ - The message on the webpage talks about using GPT-4 as a judge to rate LLM outputs and the need for further evaluation.
  • If you’re curious about the reason why it’s not a good idea to train models on chatgpts output like Alpaca did, check out this tweet: https://twitter.com/Shahules786/status/1650898925178720256. The message is related to the topic of LLM evaluation and benchmarking.
  • https://github.com/databrickslabs/dolly/blob/master/training/trainer.py - A link to a Python script for training a GPT4-fork finetuned on Python/code, with context about the difference between Generative Pretraining and GPT training objectives.
  • The message mentions “DM’d my list” and includes two URLs: https://crfm.stanford.edu/ecosystem-graphs/index.html?mode=table and https://www.reddit.com/r/StableDiffusion/comments/137ex2j/controlnet_tile_can_generate_details_for_each/.
  • https://www.reddit.com/r/StableDiffusion/comments/137ex2j/controlnet_tile_can_generate_details_for_each/ - Discussion about GPT4 for coding in Rust/working with libraries.
  • https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture11-prompting-rlhf.pdf - A resource from Stanford that explains the big picture of pretraining, fine-tuning, and RLHF in natural language processing. The message suggests that one needs to deep dive into each of those areas for more details.
  • The Twitter link https://twitter.com/pbteja1998/status/1654095756200931328?t=Q6vtkqrBGqOTgRE39s30Gg&s=08 is related to a discussion about preferring Flash Attention over multi-query attention for certain use-cases.
  • Interesting read from Simon Wilson on the disappearing moats in closed source models: https://simonwillison.net/2023/May/4/no-moat/ (related to StarCoder’s use of GPT2 model)
  • https://simonwillison.net/2023/May/4/no-moat/ - Simon Wilson’s article on moats in closed source models disappearing, mentioned in a message about resources for learning about GPT models and evaluating open source models for context-based conversation.
  • LM-evaluation-harness for Huggingface compatible models: https://github.com/EleutherAI/lm-evaluation-harness - recommended as a resource for evaluating open source models comparable to GPT models for context-based conversation, in response to a query about learning more about GPT model training and training for a specific domain. The message also mentions Dolly and OpenAssist as other options, and notes that GPT-JT is comparable to text-davinci-003.
  • LM-evaluation-harness for Huggingface compatible models: https://github.com/EleutherAI/lm-evaluation-harness - The message discusses the comparison of GPT-JT to text-davinci-003 and the surprise at the progress made through training data selection and parameters.
  • The Twitter link https://twitter.com/lmsysorg/status/1653843200975704069?s=46 is mentioned and the message implies that the content of the link is reasonable.