【发布时间】:2021-07-12 17:19:45
【问题描述】:
我正在尝试编写一个 Python 脚本,该脚本将“清理”扫描的图像,然后才能使用 Tesseract 对其进行处理。除文字外,图像还有一些灰尘、扫描伪影、页边距奇怪的线条等。 Here's what a typical page looks like
到目前为止,这就是我所拥有的。它尝试使用 cv2.ConnectedComponentsWithStats 去除少量灰尘,使用形态结构元素去除水平和垂直线,然后尝试将图像裁剪为文本。总比没有好,因为它确实消除了一些噪音,但有时它也会删除实际文本,并在页边距留下一些线条:
image = cv2.imread(path, 0)
logging.info('Opening image ' + path)
logging.info('Converting to grayscale...')
_, blackAndWhite = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY_INV)
# Find and exclude small elements
logging.info('Removing small dotted regions (dust, etc.)...')
nlabels, labels, stats, centroids = cv2.connectedComponentsWithStats(blackAndWhite, None, None, None, 8, cv2.CV_32S)
sizes = stats[1:, -1] #get CC_STAT_AREA component
img2 = np.zeros((labels.shape), np.uint8)
for i in range(0, nlabels - 1):
if sizes[i] >= 40: #filter small dotted regions
img2[labels == i + 1] = 255
image = cv2.bitwise_not(img2)
cv2.imwrite(out_filename, image)
logging.info('Writing the modified image...')
# ------ START CROPPING ----- #
image = cv2.imread(out_filename)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# Load image, grayscale, Gaussian blur, Otsu's threshold
blur = cv2.GaussianBlur(gray, (5,5), 0)
thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
logging.info('Applying Otsu\'s Threshold')
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (25,4))
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,32))
detected_lines = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)
detected_vlines = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=2)
for l in [detected_lines, detected_vlines]:
cnts = cv2.findContours(l, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
cv2.drawContours(thresh, [c], -1, (0,0,0), 50)
cv2.drawContours(image, [c], -1, (255,255,255), 50)
# Create rectangular structuring element and dilate
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (18,18))
dilate = cv2.dilate(thresh, kernel, iterations=4)
logging.info('Dilating text regions')
try:
# Find contours and draw rectangle
cnts, hierarchy = cv2.findContours(dilate, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
logging.info('Extracting contours')
# Search for contours and append their coordinates into an array
arr = []
for i,c in enumerate(cnts):
# Exclude small elements
x,y,w,h = cv2.boundingRect(c)
# Exclude oddly shaped elements
if w/h > 8 or h/w > 1.6:
continue
arr.append((x,y))
arr.append((x+w,y+h))
# Calculate the coordinates and crop the image
logging.info('Cropping the image')
x,y,w,h = cv2.boundingRect(np.asarray(arr))
image = image[y:y+h,x:x+w]
if debug:
logging.info('Showing the image (press "q" to continue)')
label = "STAGE FOUR: CROPPED IMAGE"
logging.info('Writing to ' + out_filename)
except cv2.error:
pass
cv2.imwrite(out_filename, image)
我对图像处理还很陌生,没有太多经验。想听听一些关于如何改进算法的建议!
【问题讨论】:
-
去除小点可能会导致去除字母“i”上的点
-
欢迎来到 Stack Overflow。这类问题一般在 Stack Overflow 上是 not on-topic,而应该在 codereview.stackexchange.com 上提问