Is DocProduct open source?

Yes — re-search/DocProduct is open source, released under the MIT license.

What language is DocProduct written in?

re-search/DocProduct is primarily written in Jupyter Notebook.

How popular is DocProduct?

re-search/DocProduct has 570 stars on GitHub.

Where can I find DocProduct?

re-search/DocProduct is on GitHub at https://github.com/re-search/DocProduct.

← all repositories

re-search/DocProduct

BERT + GPT-2: a very expensive WebMD search bar

A hackathon project that pipelines two language models to retrieve and generate medical answers, with the authors explicitly begging you not to use it for actual medical advice.

★570 stars Jupyter Notebook Domain Apps Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does DocProduct takes a medical question, encodes it with a fine-tuned BioBERT model, runs similarity search via FAISS over 700k scraped Q&A pairs from Reddit and WebMD, then feeds the retrieved context to a fine-tuned GPT-2 (117M parameters) to generate an answer. The whole thing is glued together with custom Keras feedforward networks and a lot of TensorFlow version contortion.

The interesting bit The training trick is the clever part: instead of standard negative sampling, they compute every question-answer dot product in a batch, softmax across rows, and use cross-entropy against a ground-truth pairing matrix. It’s a neat workaround for the fact that embeddings change every step, so NCE loss won’t fly.

Key highlights

Scraped and wrangled 700k medical Q&A pairs from six different forums, each with its own HTML mess
Re-implemented BERT in TF 2.0 alpha and got it talking to a TF 1.x GPT-2 model via tf.compat.v1.disable_eager_execution
Top-6 finalist in the #PoweredByTF 2.0 Challenge; presented to the TensorFlow team
Provides Colab notebooks for retrieval, training, and an “experimental” end-to-end pipeline
Authors are upfront: “IT SHOULD NOT TO BE USED FOR ACTIONABLE MEDICAL ADVICE” (their caps, their wisdom)

Caveats

Built on TF 2.0.0-alpha0, which is now archaeological; expect dependency pain
The full pipeline is explicitly labeled experimental in the README
Over a terabyte of generated TFRecords/CSV/checkpoints, but the actual model weights live on OneDrive

Verdict Worth a look if you’re researching medical NLP retrieval architectures or need a case study in mashing BERT and GPT-2 together. Skip it if you want production code, current dependencies, or—heaven forbid—actual medical advice.

Frequently asked

What is re-search/DocProduct?: A hackathon project that pipelines two language models to retrieve and generate medical answers, with the authors explicitly begging you not to use it for actual medical advice.
Is DocProduct open source?: Yes — re-search/DocProduct is open source, released under the MIT license.
What language is DocProduct written in?: re-search/DocProduct is primarily written in Jupyter Notebook.
How popular is DocProduct?: re-search/DocProduct has 570 stars on GitHub.
Where can I find DocProduct?: re-search/DocProduct is on GitHub at https://github.com/re-search/DocProduct.