Using GenAI Tools in Statistical Research Workflow

16 Mar 2026 by Zhenke Wu

Here is the link to the guest lecture slides on “Using GenAI Tools in Statistical Research Workflow”. This is given to students at Michigan Biostatistics 620 Introduction to Health Data Science (Primary Instructor: Peter Song).

The prompts used in the demos are available here.

Creat quick search shortcut for use in command line

16 Dec 2023 by Zhenke Wu

1. Quick search on command line?

Want to quickly search in your directories? For example, you want to search for .R files that contain a string "latent", and print out the results. You may hope to do something like:

lookfor "latent" ".R"

Or you want to just look for the files without specifying the file type by something like:

lookfor "latent"

Or you want to print page-by-page using less

lookfor latent | less (the quotation marks around latent are not needed)

If achieved, this could speed up your debugging process when you need to find certain variable names or keywords that you identified as useful.

1.1. Back-story

Here is an excellent solution I have been using (see more explanation at the end):

One line of code that have saved lots of time cumulatively: search for pattern $1 in files that end with $2. Thanks @rafalab for sharing this sometime ago!
19:31:10 ~/bin
$ more lookfor
#!/bin/bash
rg --no-heading -i "$1" --iglob \*$2
— Zhenke Wu (吴振科) (@ZhenkeWu) April 11, 2019

This was motivated by Rafa Irizarry who was my then-Hopkins professor and taught me 1st year linear regression class running R and compling files in Emacs (I had not been exposed to Unix before PhD):

Why did I wait 20+ years to write this? I've looked it up like 500 times!

$ less ~/bin/lookfor
#!/bin/bash
grep -r --include=\*.$2 $1 ./ | less

Don't leave for tomorrow the Unix shell script you can write today!
— Rafael Irizarry (@rafalab) October 9, 2018

As you can see by comparing the codes from the two tweets, I have used a different function rg which is faster, but both versions should work.

Too many outputs? You may also append "| less" when you use the command, e.g., lookfor latent | less, to print only page-by-page when lookfor identifies exceedingly many files or long files that contain the target string latent. However, by using less, the results lose syntax highlighting. This is why I have no | less by default to make it easier for my eyes when staring at the returned results.

2. Set things up

Create the file at /full/path/to/your/file, e.g., I created a file ~/bin/lookfor containing the following bash code:
- ```
#!/bin/bash
rg --no-heading -i "$1" --iglob \*$2	
```
Make the file excutable:
- ```
chmod +x /full/path/to/your/file
```
Create a symbolic link to the file by following this. The symlink may be in a different location /usr/local/bin/name_of_new_command. I created a symbolic link with the same name (does not have to be), lookfor.
- ```
sudo ln -s /full/path/to/your/file /usr/local/bin/name_of_new_command
```

2.1. Example

For example, because I created the lookfor file at ~/bin/lookfor, I needed to replace /full/path/to/your/file with ~/bin/lookfor; I wanted to type lookfor to execute the file, so I replaced name_of_new_command with lookfor.

Note that we just created a symbolic in a different directory /usr/local/bin to the now executable file ~/bin/lookfor

Now you may type lookfor with appropriate arguments to do the quick search, save your time, and be less cranky!

3. More explanation (by GPT-4)

#!/bin/bash
rg --no-heading -i "$1" --iglob \*$2	

This Bash script snippet utilizes rg, which is the command for Ripgrep, a fast search tool. The script performs a case-insensitive search for the pattern specified by $1 in files that match the glob pattern *$2.

Here’s a breakdown of the options used:

--no-heading: This option tells Ripgrep to omit the file names and just show the matching lines from each file.
-i: This flag makes the search case-insensitive, meaning it will match both upper and lower case letters.
$1: This is a placeholder for the first argument passed to the script. It represents the search pattern.
--iglob \*$2: The --iglob option allows for case-insensitive file name matching. \* matches any number of any characters, and $2 is a placeholder for the second argument to the script, which specifies the file extension or pattern to search within.

Overall, when you run this script with two arguments, it searches for the first argument (a text pattern) in all files that match the pattern given by the second argument (typically a file extension), without regard to case, and outputs the matching lines without file names.

Building R Package for Reproducibility: Why, When and How

11 Sep 2020 by Zhenke Wu

As part of our regular group meeting, we discussed R package building for reproducible and widely cited research. In my opinion, statistical theory and methods work often get used by more analysts and researchers when there is a public and freely available package. It will likely receive more critical feedback, but this is generally considered a good sign for useful data science products.

The slides are here, please feel free to send comments.

Personal Tips for Academic Statisticians Working in OSX

11 Jul 2019 by Zhenke Wu

This is to be updated constantly as I collect tips adopted over the years. Check back later if you found them useful.

Math Equations

We do plenty of math, so I’d like to test out MathJax support.

Here is an example of MathJax inline rendering — $ 1/x^{2} $. And here is a block rendering:

\[r_{XY} = \frac{\mathrm{cov}(X,Y)}{\sqrt{\mathrm{var}(X)\mathrm{var}(Y)}}\]

Now, if we’d like to get serious, we’d do something involving multiline aligned equations, like

\[\begin{align} \mathcal{N}(t, \mu, \sigma) &= \mathrm{normal} \newline &= \frac{1}{\sqrt{2 \pi} \sigma} e^{-\frac{(t-\mu)^2}{2 \sigma^2}} \end{align}\]

or even an inline formula like $ \sum_{t=0}^{\infty} \frac{x^t}{t!} = e^x$.

Or we could try defining a command, like this. $ \newcommand{\water}{\mathrm{H}_{2}\mathrm{O}} $

Buffer slides off the sides of our tubes like $\water$ off a duck’s back.

Or a more fancy set of equations:

\[\begin{align} \mbox{Union: } & A\cup B = \{x\mid x\in A \mbox{ or } x\in B\} \\ \mbox{Concatenation: } & A\circ B = \{xy\mid x\in A \mbox{ and } y\in B\} \\ \mbox{Star: } & A^\star = \{x_1x_2\ldots x_k \mid k\geq 0 \mbox{ and each } x_i\in A\} \\ \end{align}\]

Or to write the case likelihood function of PLCM model (Wu et al. 2015):

\[Pr(\boldsymbol{M}_i \mid I_i=1) = \sum_{\ell=1}^L\pi_\ell\theta_\ell^{M_{i\ell}}(1-\theta_\ell)^{1-M_{i\ell}}\prod_{j\neq \ell}\psi_j^{M_{ij}}(1-\psi_j)^{1-M_{ij}}\]

One can also use some doses of number theory…

{% include JB/video id="0Oazb7IWzbA" provider="youtube" %}